Post has attachment

Post has attachment
Public
Add a comment...

Post has attachment
Tips for #GPU Performance! Maximizing Unified Memory Performance in NVIDIA #CUDA
#HPC #DataScience http://bit.ly/2zoHdck
Photo
Add a comment...

Post has attachment
ThunderSVM: A Fast SVM Library on GPUs and CPUs

(Zeyi Wen, Jiashuai Shi, Bingsheng He, Qinbin Li, Jian Chen)

#GPU #CUDA #OpenMP #MachineLearning #ML #DataMining #SVM #Package

Support Vector Machines (SVMs) are classic supervised learning models for classification, regression and distribution estimation. A survey conducted by Kaggle in 2017 shows that 26% of the data mining and machine learning practitioners are users of SVMs. However, SVM training and prediction are very expensive computationally for large and complex problems. This paper presents an efficient and open source SVM software toolkit called ThunderSVM which exploits the high-performance of Graphics Processing Units (GPUs) and multi-core CPUs. ThunderSVM supports all the functionalities-including classification (SVC), regression (SVR) and one-class SVMs-of LibSVM and uses identical command line options, such that existing LibSVM users can easily apply our toolkit. ThunderSVM can be used through multiple language interfaces including C/C++, Python, R and MATLAB. Our experimental results show that ThunderSVM is generally an order of magnitude faster than LibSVM while producing identical SVMs. In addition to the high efficiency, we design our convex optimization solver in a general way such that SVC, SVR, and one-class SVMs share the same solver for the ease of maintenance. Documentation, examples, and more about ThunderSVM are available.

https://hgpu.org/?p=17912
Add a comment...

Post has attachment
GPU Acceleration of a High-Order Discontinuous Galerkin Incompressible Flow Solver

(Ali Karakus, Noel Chalmers, Kasia Swirydowicz, Timothy Warburton)

#GPU #CUDA #CFD #FluidDynamics #NSE

We present a GPU-accelerated version of a high-order discontinuous Galerkin discretization of the unsteady incompressible Navier-Stokes equations. The equations are discretized in time using a semi-implicit scheme with explicit treatment of the nonlinear term and implicit treatment of the split Stokes operators. The pressure system is solved with a conjugate gradient method together with a fully GPU-accelerated multigrid preconditioner which is designed to minimize memory requirements and to increase overall performance. A semi-Lagrangian subcycling advection algorithm is used to shift the computational load per timestep away from the pressure Poisson solve by allowing larger timestep sizes in exchange for an increased number of advection steps. Numerical results confirm we achieve the design order accuracy in time and space. We optimize the performance of the most time-consuming kernels by tuning the fine-grain parallelism, memory utilization, and maximizing bandwidth. To assess overall performance we present an empirically calibrated roofline performance model for a target GPU to explain the achieved efficiency. We demonstrate that, in the most cases, the kernels used in the solver are close to their empirically predicted roofline performance.

https://hgpu.org/?p=17910
Add a comment...

Post has attachment
Analysing the Performance of GPU Hash Tables for State Space Exploration

(Nathan Cassee, Anton Wijs)

#GPU #CUDA #Hash #DSE #Performance

In the past few years, General Purpose Graphics Processors (GPUs) have been used to significantly speed up numerous applications. One of the areas in which GPUs have recently led to a significant speed-up is model checking. In model checking, state spaces, i.e., large directed graphs, are explored to verify whether models satisfy desirable properties. GPUexplore is a GPU-based model checker that uses a hash table to efficiently keep track of already explored states. As a large number of states is discovered and stored during such an exploration, the hash table should be able to quickly handle many inserts and queries concurrently. In this paper, we experimentally compare two different hash tables optimised for the GPU, one being the GPUexplore hash table, and the other using Cuckoo hashing. We compare the performance of both hash tables using random and non-random data obtained from model checking experiments, to analyse the applicability of the two hash tables for state space exploration. We conclude that Cuckoo hashing is three times faster than GPUexplore hashing for random data, and that Cuckoo hashing is five to nine times faster for non-random data. This suggests great potential to further speed up GPUexplore in the near future.

https://hgpu.org/?p=17909
Add a comment...

Post has attachment
Baidu Apollo 2.0 Continues to Evolve with Neousys Technology's Nuvo-6108GC

This year, the Chinese IT heavyweight, #Baidu, will launch the long-expected #Apollo2.0 project. It is a comprehensive leap for China's #autonomous driving technologies. More than 200 partners support this project. The recommended reference hardware is based on the #Nuvo6108GC computer by #Neousys Technology (Taiwan). This computer uses the #GTX1080Ti graphic card by Nvidia and Intel’s 6th generation processors. To communicate with the CAN #invehicle networks, the computer provides PCI card slots. ESD’s CAN-PCIe/402 interface boards are recommended for Baidu Apollo development partners.

▶ Find out what Neousys GC series can do for you:
https://www.neousys-tech.com/en/discover/gpu-embedded-computing/?utm_source=Googleplus&utm_medium=social&utm_campaign=BaiduApollo2
▶ Learn more about Neousys Nuvo-6108GC #GPU Computer:
https://www.neousys-tech.com/en/product/application/rugged-embedded/nuvo-6108gc-gpu-computing/?utm_source=Googleplus&utm_medium=social&utm_campaign=BaiduApollo2

#CES #Autonomous #Autocar #Automation
Add a comment...

Post has attachment
Add a comment...

Post has attachment
[Thesis]: Parallel Matching and Clustering Algorithms on GPUs

(Md. Naim)

#GPU #CUDA #Clustering #MachineLearning #ML #Thesis

The main focus of this thesis is on developing efficient algorithms on GPUs for certain matching and clustering problems. Through extensive experiments we show that sparse and unstructured problems can benefit greatly from using GPUs as long as the algorithms are carefully designed. Even though none of the presented algorithms are fundamentally new, they still require significant redesign to make them efficient on GPUs. Common to all the developed algorithms is that they emphasis achieving an even load balance and high degree of parallelism, while at the same time avoiding the use of time consuming synchronization operations. Through extensive experiments we verify the performance of the suggested algorithms. In some cases, even a single GPU can outperform tens of multiprocessors on a state-of-the-art supercomputer. The area of computing using GPU enhanced systems is changing rapidly. This is especially true for memory management. For instance, simultaneous access to the same memory by a host and a device was not possible until it was recently introduced in the Pascal GPU [15]. However, starting from the early unified shader architecture, the fundamental architectural features of GPUs have not undergone radical changes. Instead what has happened is that features have been added in an incremental fashion while the speed and size of the device has increased gradually. Based on this observation we believe that algorithms that are developed for GPUs today will also be relevant in the near future. Although the presented algorithms in this thesis can be viewed as proof of concept that carefully designed GPU algorithms for certain graph problems can compete with implementations on traditional parallel supercomputers, it is not to be underestimated that doing so requires substantial effort. For this to become possible on a more regular basis there is a need to continue and further develop higher level abstractions for graph analytics on GPUs. As an example, we believe that the strategy used in Paper III where vertices were grouped according to their degree before being allocated to different thread blocks, is one such technique that could benefit other applications. Also, the use of various reduction operations could be candidates for generic implementation. It seems clear that GPUs will play a major role in the future of HPC, with large systems containing several interconnected ones. Designing efficient graph algorithms for such systems is still only starting and likely to generate much interesting work in the future. Just like the distinction between traditional shared and distributed algorithms, we believe that one will see a similar division between different GPU algorithms depending on how the underlying devices are interconnected.

https://hgpu.org/?p=17682
Add a comment...

Post has attachment
Wait while more posts are being loaded