Introduction
GPU computing has been an added advantage now a days for massively parallel processing applications. Leveraging the full power of GPU from Java based applications has been a major concern among the Java developers.
JCuda let's you create cross-platform CUDA offerings that can be easily accessed from your Java applications and can be run on any operating system supported by CUDA.
Pre-requisite
You will need to have the following installed on your system:
- Java SDK
- Latest Nvidia Graphics Card with a CUDA driver
- JCuda library
Verify the Driver
You can verify if the CUDA driver is properly installed using the following command:
[centos@company-d017 ~]$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221
The code we are going to cover in this blog has been tested on Nvidia GeForce GTX 680 and CUDA driver version 5.0.
Step I : Write a sample JCuda program
This JCuda program adds two vectors each having 100000 elements and displays the results.
The addition of vectors is performed in parallel on the GPU device. You will notice that the JCudaVectorAdd.ptx file is in the resources folder of the JAVA project.
// Enable exceptions and omit all subsequent error checks
JCudaDriver.setExceptionsEnabled(true);
// Initialize the driver and create a context for the first device.
cuInit(0);
CUdevice device = new CUdevice();
if (cuDeviceGet(device, 0) != CUresult.CUDA_SUCCESS) {
throw new RuntimeException("Unable to get GPU device");
}
CUcontext context = new CUcontext();
cuCtxCreate(context, 0, device);
// Loads the ptx file.
CUmodule module = new CUmodule();
cuModuleLoad(module, "src/main/resources/cuda-binaries/JCudaVectorAdd.ptx");
// Obtain a function pointer to the "add" kernel function.
CUfunction function = new CUfunction();
cuModuleGetFunction(function, module, "add");
int numElements = 100000;
// Allocate and fill the host input data
float hostInputA[] = new float[numElements];
float hostInputB[] = new float[numElements];
for(int i = 0; i < numElements; i++)
{
hostInputA[i] = (float)i;
hostInputB[i] = (float)i;
}
// Allocate the device input data, and copy the
// host input data to the device
CUdeviceptr deviceInputA = new CUdeviceptr();
cuMemAlloc(deviceInputA, numElements * Sizeof.FLOAT);
cuMemcpyHtoD(deviceInputA, Pointer.to(hostInputA),
numElements * Sizeof.FLOAT);
CUdeviceptr deviceInputB = new CUdeviceptr();
cuMemAlloc(deviceInputB, numElements * Sizeof.FLOAT);
cuMemcpyHtoD(deviceInputB, Pointer.to(hostInputB),
numElements * Sizeof.FLOAT);
// Allocate device output memory
CUdeviceptr deviceOutput = new CUdeviceptr();
cuMemAlloc(deviceOutput, numElements * Sizeof.FLOAT);
// Set up the kernel parameters: A pointer to an array
// of pointers which point to the actual values.
Pointer kernelParameters = Pointer.to(
Pointer.to(new int[]{numElements}),
Pointer.to(deviceInputA),
Pointer.to(deviceInputB),
Pointer.to(deviceOutput)
);
// Call the kernel function.
int blockSizeX = 256;
int gridSizeX = (int)Math.ceil((double)numElements / blockSizeX);
cuLaunchKernel(function,
gridSizeX, 1, 1, // Grid dimension
blockSizeX, 1, 1, // Block dimension
0, null, // Shared memory size and stream
kernelParameters, null // Kernel- and extra parameters
);
cuCtxSynchronize();
// Allocate host output memory and copy the device output
// to the host.
float hostOutput[] = new float[numElements];
cuMemcpyDtoH(Pointer.to(hostOutput), deviceOutput,
numElements * Sizeof.FLOAT);
// View the result
for(int i = 0; i < numElements; i++)
{
System.out.println(
"At index "+i+ " found "+hostOutput[i]);
}
// Free the memory on device.
cuMemFree(deviceInputA);
cuMemFree(deviceInputB);
cuMemFree(deviceOutput);
Step II : Write your CUDA kernel
CUDA kernels are functions written in CUDA which is quite similar to C language. These kernels get executed directly on the GPU device.
The below example adds vectors 'a' and 'b' and saves the result to the vector 'sum'.
JCudaVectorAdd.cu
extern "C"
__global__ void add(int n, float *a, float *b, float *sum)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i<n)
{
sum[i] = a[i] + b[i];
}
}
Step III : Compile your CUDA code
The CUDA kernels can be compiled as .ptx or .cubin files by the nvcc compiler.This will create a file that can be loaded and executed using the Driver API.
The drawback of using cubin files is that they are specific for
the Compute Capability (seen as a
version number for the hardware) and CUBIN files that have been compiled for one
Compute Capability can not be loaded on a GPU with a different Compute Capability.
We will prefer compiling as ptx file, since they
are compiled at runtime for the GPU of the target machine.
Below is the command for compiling the CUDA code as ptx file on linux:
nvcc -ptx JCudaVectorAdd.cu -o JCudaVectorAdd.ptx
Step IV : Compile and run your Java program
You can compile your Java program using the following command from your Java project directory:
javac -cp ".:jcuda-x.x.x.jar" JCudaVectorAdd.java
This will create a 'JCudaVectorAdd.class' file in your project's directory.
You can then run the program using the following command:
java -cp ".:jcuda-x.x.x.jar" JCudaVectorAdd
NOTE : If you face some errors while executing the program, try setting the below environment variables in your bashrc file and try again:
export LD_LIBRARY_PATH=/usr/local/cuda/lib64/:/usr/local/cuda/lib:/path/to/your/jcuda/parent/directory
export LD_PRELOAD=/usr/lib64/libcuda.so