Thursday, February 28, 2013

Leveraging GPU from Java applications using JCuda

Introduction


GPU computing has been an added advantage now a days for massively parallel processing applications. Leveraging the full power of GPU from Java based applications has been a major concern among the Java developers.

JCuda let's you create cross-platform CUDA offerings that can be easily accessed from your Java applications and can be run on any operating system supported by CUDA.

 

Pre-requisite


You will need to have the following installed on your system:
  1. Java SDK
  2. Latest Nvidia Graphics Card with a CUDA driver
  3. JCuda library

Verify the Driver


You can verify if the CUDA driver is properly installed using the following command:

[centos@company-d017 ~]$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221

The code we are going to cover in this blog has been tested on Nvidia GeForce GTX 680 and CUDA driver version 5.0.


Step I : Write a sample JCuda program 


This JCuda program adds two vectors each having 100000 elements and displays the results.

The addition of vectors is performed in parallel on the GPU device. You will notice that the JCudaVectorAdd.ptx file is in the resources folder of the JAVA project.


     // Enable exceptions and omit all subsequent error checks  
     JCudaDriver.setExceptionsEnabled(true);   
     
     // Initialize the driver and create a context for the first device.  
     cuInit(0);  
     CUdevice device = new CUdevice(); 
     if (cuDeviceGet(device, 0) != CUresult.CUDA_SUCCESS) {
           throw new RuntimeException("Unable to get GPU device");
     } 
     CUcontext context = new CUcontext();  
     cuCtxCreate(context, 0, device);  
     
     // Loads the ptx file.  
     CUmodule module = new CUmodule();  
     cuModuleLoad(module, "src/main/resources/cuda-binaries/JCudaVectorAdd.ptx");  
     
     // Obtain a function pointer to the "add" kernel function.  
     CUfunction function = new CUfunction();  
     cuModuleGetFunction(function, module, "add");  
     int numElements = 100000;  
     
     // Allocate and fill the host input data  
     float hostInputA[] = new float[numElements];  
     float hostInputB[] = new float[numElements];  
     for(int i = 0; i < numElements; i++)  
     {  
       hostInputA[i] = (float)i;  
       hostInputB[i] = (float)i;  
     }  
     
     // Allocate the device input data, and copy the  
     // host input data to the device  
     CUdeviceptr deviceInputA = new CUdeviceptr();  
     cuMemAlloc(deviceInputA, numElements * Sizeof.FLOAT);  
     cuMemcpyHtoD(deviceInputA, Pointer.to(hostInputA),  
       numElements * Sizeof.FLOAT);  
     CUdeviceptr deviceInputB = new CUdeviceptr();  
     cuMemAlloc(deviceInputB, numElements * Sizeof.FLOAT);  
     cuMemcpyHtoD(deviceInputB, Pointer.to(hostInputB),  
       numElements * Sizeof.FLOAT);  
     
     // Allocate device output memory  
     CUdeviceptr deviceOutput = new CUdeviceptr();  
     cuMemAlloc(deviceOutput, numElements * Sizeof.FLOAT); 

     // Set up the kernel parameters: A pointer to an array  
     // of pointers which point to the actual values.  
     Pointer kernelParameters = Pointer.to(  
       Pointer.to(new int[]{numElements}),  
       Pointer.to(deviceInputA),  
       Pointer.to(deviceInputB),  
       Pointer.to(deviceOutput)  
     );  
     
     // Call the kernel function.  
     int blockSizeX = 256;  
     int gridSizeX = (int)Math.ceil((double)numElements / blockSizeX);  
     cuLaunchKernel(function,  
       gridSizeX, 1, 1,   // Grid dimension  
       blockSizeX, 1, 1,   // Block dimension  
       0, null,        // Shared memory size and stream  
       kernelParameters, null // Kernel- and extra parameters  
     );  
     cuCtxSynchronize();  
     
     // Allocate host output memory and copy the device output  
     // to the host.  
     float hostOutput[] = new float[numElements];  
     cuMemcpyDtoH(Pointer.to(hostOutput), deviceOutput,  
       numElements * Sizeof.FLOAT);  
     
     // View the result   
     for(int i = 0; i < numElements; i++)  
     {  
         System.out.println(  
           "At index "+i+ " found "+hostOutput[i]);    
     }  
     
     // Free the memory on device.  
     cuMemFree(deviceInputA);  
     cuMemFree(deviceInputB);  
     cuMemFree(deviceOutput);   


Step II : Write your CUDA kernel


CUDA kernels are functions written in CUDA which is quite similar to C language. These kernels get executed directly on the GPU device.
The below example adds vectors 'a' and 'b' and saves the result to the vector 'sum'.

JCudaVectorAdd.cu

 extern "C"  
 __global__ void add(int n, float *a, float *b, float *sum)  
 {  
   int i = blockIdx.x * blockDim.x + threadIdx.x;  
   if (i<n)  
   {  
     sum[i] = a[i] + b[i];  
   }  
 }  


Step III : Compile your CUDA code


The CUDA kernels can be compiled as .ptx or .cubin files by the nvcc compiler.This will create a file that can be loaded and executed using the Driver API.

The drawback of using cubin files is that they are specific for the Compute Capability (seen as a version number for the hardware) and CUBIN files that have been compiled for one Compute Capability can not be loaded on a GPU with a different Compute Capability. 

We will prefer compiling as ptx file, since they are compiled at runtime for the GPU of the target machine.

Below is the command for compiling the CUDA code as ptx file on linux:

  nvcc -ptx JCudaVectorAdd.cu -o JCudaVectorAdd.ptx   


Step IV : Compile and run your Java program


You can compile your Java program using the following command from your Java project directory:

 javac -cp ".:jcuda-x.x.x.jar" JCudaVectorAdd.java  

This will create a 'JCudaVectorAdd.class' file in your project's directory.

You can then run the program using the following command:

 java -cp ".:jcuda-x.x.x.jar" JCudaVectorAdd  


NOTE : If you face some errors while executing the program, try setting the below environment variables in your bashrc file and try again:

export LD_LIBRARY_PATH=/usr/local/cuda/lib64/:/usr/local/cuda/lib:/path/to/your/jcuda/parent/directory

export LD_PRELOAD=/usr/lib64/libcuda.so

1 comment: