Technospy-Your Secret Technology Agent

Thursday, February 28, 2013

Leveraging GPU from Java applications using JCuda

Introduction

GPU computing has been an added advantage now a days for massively parallel processing applications. Leveraging the full power of GPU from Java based applications has been a major concern among the Java developers.

JCuda let's you create cross-platform CUDA offerings that can be easily accessed from your Java applications and can be run on any operating system supported by CUDA.

Pre-requisite

You will need to have the following installed on your system:

Java SDK
Latest Nvidia Graphics Card with a CUDA driver
JCuda library

Verify the Driver

You can verify if the CUDA driver is properly installed using the following command:

[centos@company-d017 ~]$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221

The code we are going to cover in this blog has been tested on Nvidia GeForce GTX 680 and CUDA driver version 5.0.

Step I : Write a sample JCuda program

This JCuda program adds two vectors each having 100000 elements and displays the results.

The addition of vectors is performed in parallel on the GPU device. You will notice that the JCudaVectorAdd.ptx file is in the resources folder of the JAVA project.

     // Enable exceptions and omit all subsequent error checks  
     JCudaDriver.setExceptionsEnabled(true);   
     
     // Initialize the driver and create a context for the first device.  
     cuInit(0);  
     CUdevice device = new CUdevice(); 
     if (cuDeviceGet(device, 0) != CUresult.CUDA_SUCCESS) {
           throw new RuntimeException("Unable to get GPU device");
     } 
     CUcontext context = new CUcontext();  
     cuCtxCreate(context, 0, device);  
     
     // Loads the ptx file.  
     CUmodule module = new CUmodule();  
     cuModuleLoad(module, "src/main/resources/cuda-binaries/JCudaVectorAdd.ptx");  
     
     // Obtain a function pointer to the "add" kernel function.  
     CUfunction function = new CUfunction();  
     cuModuleGetFunction(function, module, "add");  
     int numElements = 100000;  
     
     // Allocate and fill the host input data  
     float hostInputA[] = new float[numElements];  
     float hostInputB[] = new float[numElements];  
     for(int i = 0; i < numElements; i++)  
     {  
       hostInputA[i] = (float)i;  
       hostInputB[i] = (float)i;  
     }  
     
     // Allocate the device input data, and copy the  
     // host input data to the device  
     CUdeviceptr deviceInputA = new CUdeviceptr();  
     cuMemAlloc(deviceInputA, numElements * Sizeof.FLOAT);  
     cuMemcpyHtoD(deviceInputA, Pointer.to(hostInputA),  
       numElements * Sizeof.FLOAT);  
     CUdeviceptr deviceInputB = new CUdeviceptr();  
     cuMemAlloc(deviceInputB, numElements * Sizeof.FLOAT);  
     cuMemcpyHtoD(deviceInputB, Pointer.to(hostInputB),  
       numElements * Sizeof.FLOAT);  
     
     // Allocate device output memory  
     CUdeviceptr deviceOutput = new CUdeviceptr();  
     cuMemAlloc(deviceOutput, numElements * Sizeof.FLOAT); 

     // Set up the kernel parameters: A pointer to an array  
     // of pointers which point to the actual values.  
     Pointer kernelParameters = Pointer.to(  
       Pointer.to(new int[]{numElements}),  
       Pointer.to(deviceInputA),  
       Pointer.to(deviceInputB),  
       Pointer.to(deviceOutput)  
     );  
     
     // Call the kernel function.  
     int blockSizeX = 256;  
     int gridSizeX = (int)Math.ceil((double)numElements / blockSizeX);  
     cuLaunchKernel(function,  
       gridSizeX, 1, 1,   // Grid dimension  
       blockSizeX, 1, 1,   // Block dimension  
       0, null,        // Shared memory size and stream  
       kernelParameters, null // Kernel- and extra parameters  
     );  
     cuCtxSynchronize();  
     
     // Allocate host output memory and copy the device output  
     // to the host.  
     float hostOutput[] = new float[numElements];  
     cuMemcpyDtoH(Pointer.to(hostOutput), deviceOutput,  
       numElements * Sizeof.FLOAT);  
     
     // View the result   
     for(int i = 0; i < numElements; i++)  
     {  
         System.out.println(  
           "At index "+i+ " found "+hostOutput[i]);    
     }  
     
     // Free the memory on device.  
     cuMemFree(deviceInputA);  
     cuMemFree(deviceInputB);  
     cuMemFree(deviceOutput);

Step II : Write your CUDA kernel

CUDA kernels are functions written in CUDA which is quite similar to C language. These kernels get executed directly on the GPU device.
The below example adds vectors 'a' and 'b' and saves the result to the vector 'sum'.

JCudaVectorAdd.cu

 extern "C"  
 __global__ void add(int n, float *a, float *b, float *sum)  
 {  
   int i = blockIdx.x * blockDim.x + threadIdx.x;  
   if (i<n)  
   {  
     sum[i] = a[i] + b[i];  
   }  
 }

Step III : Compile your CUDA code

The CUDA kernels can be compiled as .ptx or .cubin files by the nvcc compiler.This will create a file that can be loaded and executed using the Driver API.

The drawback of using cubin files is that they are specific for the Compute Capability (seen as a version number for the hardware) and CUBIN files that have been compiled for one Compute Capability can not be loaded on a GPU with a different Compute Capability.

We will prefer compiling as ptx file, since they are compiled at runtime for the GPU of the target machine.

Below is the command for compiling the CUDA code as ptx file on linux:

  nvcc -ptx JCudaVectorAdd.cu -o JCudaVectorAdd.ptx

Step IV : Compile and run your Java program

You can compile your Java program using the following command from your Java project directory:

 javac -cp ".:jcuda-x.x.x.jar" JCudaVectorAdd.java

This will create a 'JCudaVectorAdd.class' file in your project's directory.

You can then run the program using the following command:

 java -cp ".:jcuda-x.x.x.jar" JCudaVectorAdd

NOTE : If you face some errors while executing the program, try setting the below environment variables in your bashrc file and try again:

export LD_LIBRARY_PATH=/usr/local/cuda/lib64/:/usr/local/cuda/lib:/path/to/your/jcuda/parent/directory

export LD_PRELOAD=/usr/lib64/libcuda.so

Thursday, August 30, 2012

Installing Oracle R Connector For Hadoop on Linux machine

Oracle R Connector for Hadoop(ORCH) is a package that provides you with a way of interacting with Hadoop using your local R. By using it, you can copy data between R memory, the local filesystem and HDFS. Besides this you can schedule R programs to execute map-reduce jobs in Hadoop and return the data to any of the above mentioned locations.

Pre-requisites for installing ORCH

JVM
R distribution 2.13.2 and above with all base libraries on all nodes in the Hadoop cluster.

NOTE: You will need to install the ORCH package on each R distribution on each Hadoop node.

Steps for installing ORCH

Set the environment variables for Hadoop and Java as follows:

export HADOOP_HOME=/path/to/your/Hadoop/home

export JAVA_HOME=/path/to/your/java/home
Download the package from here and unzip the downloaded file.

$ unzip orhc.tgz.zip

NOTE: If you downloaded just a .zip file, after unzip please manually convert it to .tgz file. For example I got a folder named 'orch' after unzip and I used the following command to create the .tgz file :

$ tar cvzf orch.tgz orch
Install the package using the following command:

$ R CMD INSTALL orhc.tgz
Alternatively you can open R and install the package as follows:

> install.packages("/path/to/orhc.tgz", repos=NULL)

You are done with the installation now. Cheers!!

Sunday, May 1, 2011

Hadoop - an Introduction

Growing data - a real world problem

Storage capacities of hard drives increasing day by day at alarming rate but access speeds not kept up.

One typical drive from 1990 could store 1,370 MB of data and had

a transfer speed of 4.4 MB/s,§ so you could read all the data from a full drive in around

five minutes. Over 20 years later, one terabyte drives are the norm, but the transfer

speed is around 100 MB/s, so it takes more than two and a half hours to read all the

data off the disk.

Writing slower than read.

Solution : read from multiple disks at the same time. Say 100 disks each carrying 1/100^th of the data. Read time 2 mins approx. HDFS for help.

Hardware failure=> loss of data. Redundancy is for help.

Analysis of data will need to combine all data. Map reduce for help.

ð Hadoop provides a reliable data storage and analysis system.

RDBMS vs Map-reduce

Why cant we use databases with lots of disks to do large scale batch analysis?

Ans : The growing trend shows Seek time improving much slower than transfer rate. Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth.

So with large data seek time will be more and this will add to latency in read or write.

ð RDBMS is good where more updates on smaller range of data. Map –reduce is good for batch analysis where updates are less.

ð Another difference is RDBMS processes structured data but map-reduce has the power with semi-structured or unstructured data.

Relational data is often normalized to retain its integrity and remove redundancy.

Normalization poses problems for MapReduce, since it makes reading a record a nonlocal

operation, and one of the central assumptions that MapReduce makes is that it

is possible to perform (high-speed) streaming reads and writes.

Map reduce

MapReduce tries to collocate the data with the compute node, so data access is fast

since it is local.# This feature, known as data locality, is at the heart of MapReduce and

is the reason for its good performance. Recognizing that network bandwidth is the most

precious resource in a data center environment (it is easy to saturate network links by

copying data around), MapReduce implementations go to great lengths to conserve it

by explicitly modelling network topology.

Histroy of hadoop

Hadoop was created by Doug Cutting the creator of Apache Lucene – a widely used text search library. Hadoop originated of Apache Nutch, an open source web search engine, itself a part of the Lucene project.

It was just the mere idea of creating a web search engine from scratch that were the baby steps towards creation of Hadoop.

2002 Nutch started out of the blues.

2003 Google published a white paper describing the Google File System (GFS) used in production at Google.

2004 Nutch Distributed File System(NDFS) implementation started.

2004 Google published another paper introducing Map-Reduce to world.

2005 Nutch had an implementation of Map-Reduce and all algorithms were run using Map-Reduce and NDFS.

2006 NDFS and Map-Reduce implementation moved out of Nutch to form a separate sub-project of Lucene called Hadoop. The same year Doug joined Yahoo.

2008 Yahoo declared its production search index being generated by a 10,000 core Hadoop cluster.

By this time Hadoop was being used by other companies too like Last.fm, Facebook and The NewYork Times.

Monday, December 13, 2010

Apache Lucene - Indexing documents

Updated for Lucene 3.0.3

Introduction

Apache Lucene is a powerful, high-performance, and scalable open-source Java search library written by Doug Cutting that lets you easily add search to any application.

It is written in Java, yet supports integrations to other programming languages (C/C++,

C#, Ruby, Perl, Python, and PHP among others).

Be careful Lucene is not an application. It is a software library that has to be used by your application.

No doubt,Lucene is so fast ! It is the Lucene's inverted index implementation that gives it the real pace. Inverted index is a data structure that stores mappings from contents to location of the documents. In order to be more clear, you can think of getting a list of employees from an employee database against some user specified content (name, employee id, designation or anything else) .

Download and Installation

You can download the latest version of Lucene from the Apache Download Mirrors .

Installing lucene is so simple. Just add the lucene-core-xxx.jar to your class-path and you are done.

The Java API of Lucene 3.0.3 can be found here.

Lucene also provides a number of extension modules like the spellchecker and highlighter modules. These modules can be found in the contrib module of the Lucene package .

Building blocks : Lucene Indexing

I would like you to understand the following classes before proceeding to creation of index :

IndexWriter
Directory
Document
Field
Analyzer

IndexWriter

IndexWriter is the backbone of the indexing process. It either creates a new index or updates an existing one (depending on the Boolean argument you pass), so that you can add,delete or update documents. IndexWriter is completely thread-safe.

IndexWriter lets the input you feed to pass through an Analyzer before creating the index.

For the added documents, flushing is triggered either by RAM usage of the documents or by number of documents added. The default flush is when 16 MB RAM is used.

For best performance, you should set flush with a large RAM buffer using setRAMBufferSizeMB(double) method.

When you open an IndexWriter, a lock file is created. If you open another IndexWriter on the same directory you will get a LockObtainFailedException.

The IndexWriter also needs to know where to store the index. This is when Directory class comes into the picture.

Directory

Directory class represents the location where the index is to be stored. It is an abstract class with three subclasses FileSwitchDirectory, FSDirectory and

RAMDirectory. Directory class indirectly uses the Java I/O for file management .

Lucene Directory provides you the option to store the index as files (using FSDirectory) or in RAM (using RAMDirectory) for high performance.

Document

Lucene never indexes the raw data that you provide. Rather it indexes Document objects.A Lucene index can be seen as a collection of Document objects. Further, each Document is a collection of Field objects which are name-value pairs.

Choosing Field objects is a bit tricky depending on the structure of the file data and the data that you want to be indexed.

For example, consider a stream of XML documents with following structure :

<event>

<shape>irregular</shape>

</event>

<event>

..........

</events>

This can be structured in an index by adding each event as a Document object and and each attribute( color, shape etc.) as Field objects. For example, 'color' would be the field name and 'red' would be the value.

Lucene only handles java.lang.String, java.io.Reader, and native numeric types (such as int or float).So in case you want to index large text files you can just save the file as a Document object and title of the file and a Reader object as its Field objects.Hence, every time you want to access the file you can use the Reader object to read it.

Field

Field class holds the textual value to be indexed. It has a name,a value and options which let you decide how Lucene would index the field value.

A field value can be a String, a Reader or atomic keywords for date,URL etc. which are not further processed.

Note : A document can have more than one field with same name. The values of all the fields will be appended while indexing. When you search that field, you will get a single text field with all the text field values concatenated in order they were appended.

Analyzer

The job of the Analyzer class is to have plain text as input and extract tokens from it which are further stored in the index. The analyzer takes care of everything like omitting words, breaking text (usually at white spaces ) and avoiding punctuations.

An analyzer consists of Tokenizers that converts text into raw tokens which are then passed through TokenFilters and then finally stored in the index.

Lucene has Analyzer as an abstract class and several implementations of it.

Let's have a look how the most widely used analyzers work:

Input text : "No pain, No gain"

Results :

WhiteSpaceAnalyzer : [no] [pain,] [no] [gain]
SimpleAnalyzer : [no] [pain] [no] [gain]
StopAnalyzer : [pain] [gain]
StandardAnalyzer : [pain] [gain]

Hey!! Do StopAnalyzer and StandardAnalyzer work similarly?

The answer is No. Let's understand the difference between both.

StandardAnalyzer filters output from StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter.
StandardTokenizer does the following tasks :

Splits words at punctuation characters and removes punctuation. A dot not followed by a white space is assumed as part of a token.
Splits words at hyphens,provided there is no number in the token.
considers email addresses and Internet host-names as one token.

StopAnalyzer filters output from LetterTokenizer with LowerCaseFilter and StopFilter. A LetterTokenizer divides text at non-letters.

Lucene Indexing : It's fun

I would guide you through the Lucene indexing process with four simple steps :

1. Acquire Content

Lucene doesn't provide any support for acquiring content for indexing. This is the sole responsibility of your application to collect raw data. You can use third party tools, crawlers (like Grub, Heritrix, Aperture and many more) or some piece of code.

2. Create an IndexWriter

The first thing you would need to do is choose the location where you want to store the index.

Either you can store the index in memory as follows :

Directory dir = new RAMDirectory();

Or you can store the index in a physical directory as follows :

Directory dir = FSDirectory.open(new File("path_to_your_directory"));

Before you create an instance of IndexWriter you would need to choose an analyzer. I would choose the StandardAnalyzer for the time being.

StandardAnalyzer analyzer = new StandardAnalyzer(Version.Lucene_29);

Now you should create the IndexWriter as follows :

IndexWriter  writer = new IndexWriter(dir, analyzer,
        true, IndexWriter.MaxFieldLength.UNLIMITED);

The third boolean argument creates a new index when 'true' and uses an existing one if 'false'. The fourth argument determines the maximum number of tokens a field can hold. In our case, it is unlimited.

3. Create Documents and add Fields

As mentioned earlier, Lucene doesn't index your raw data. So it has to be converted into Document objects which can be indexed by Lucene core.

Document doc = new Document();
doc.add(new Field("color","red",Field.Store.YES,Field.Index.NOT_ANALYZED));
doc.add(new Field("shape","irregular",Field.Store.YES,Field.Index.NOT_ANALYZED));
doc.add(new Field("width","5\'1\"",Field.Store.YES,Field.Index.NOT_ANALYZED));
doc.add(new Field("height","2\'1\"",Field.Store.YES,Field.Index.NOT_ANALYZED));
doc.add(new Field("speed","15kms/hr",Field.Store.YES,Field.Index.NOT_ANALYZED));
doc.add(new Field("direction","east",Field.Store.YES,Field.Index.NOT_ANALYZED));
doc.add(new Field("time","2005-10-30 T 10:45 UTC", Field.Store.YES, Field.Index.NOT_ANALYZED));

The Field class constructor takes name and value of the field as the first two arguments. The 'value' argument is the actual text that is indexed.

The third argument specifies whether a field should be stored and can hold one of two values :

Field.Store.YES : Store the original field value in the index.
Field.Store.NO : Do not store the field value in the index.

The fourth argument specifies whether and how a field should be indexed.It can hold one of five values :

Field.Index.NO : Do not index the field value.
Field.Index.ANALYZED : Index the tokens produced by passing the field's value through an analyzer.
Field.Index.NOT_ANALYZED : Index the field value without using an analyzer.
Field.Index.NOT_ANALYZED_NO_NORMS : Index the field value without using an analyzer and do not store the norms.
Field.Index.ANALYZED_NO_NORMS : Index the tokens produced by passing the field value through an analyzer and do not store the norms.

4. Index Documents

The next step is indexing the document objects by adding the documents to IndexWriter as follows :

writer.addDocument(doc);

This method periodically flushes queued documents to the Directory , and also periodically merges segments in the index.

5. Close Index

The final step is optimizing the index for best performance and close it.

If you desire to have optimal performance and you are not going to add more documents for a while, you can call the optimize() method as follows :

writer.optimize();

The IndexWriter opens a lock when it is instantiated and you can not open the index in between. Hence, you should finally call the close() method which removes the lock and makes your index ready to use.

writer.close();

If sometimes, you get an OutOfMemoryError just call the close() method which internally calls rollback() method to undo all the changes in the index since the last commit.

Cheers! You are done now.

Note : You can open and update an existing index while it is in use. This will not trouble the users who are searching the existing index.

Testing Index : using Luke Index toolbox

You can now search, update and delete documents from your index by using the open-source Luke index toolbox.

This is quite simple. Just download the Luke toolbox, browse to your index directory and click 'OK'.

Enjoy, You can now view all the terms and documents in the Luke interface.

What's next?

I would drive you through the Lucene search API and more in my upcoming blogs.

Hope this blog was of great help to you. I would try to be more contextual in my future blogs.