Technospy-Your Secret Technology Agent: Apache Lucene

Updated for Lucene 3.0.3

Introduction

Apache Lucene is a powerful, high-performance, and scalable open-source Java search library written by Doug Cutting that lets you easily add search to any application.

It is written in Java, yet supports integrations to other programming languages (C/C++,

C#, Ruby, Perl, Python, and PHP among others).

Be careful Lucene is not an application. It is a software library that has to be used by your application.

No doubt,Lucene is so fast ! It is the Lucene's inverted index implementation that gives it the real pace. Inverted index is a data structure that stores mappings from contents to location of the documents. In order to be more clear, you can think of getting a list of employees from an employee database against some user specified content (name, employee id, designation or anything else) .

Download and Installation

You can download the latest version of Lucene from the Apache Download Mirrors .

Installing lucene is so simple. Just add the lucene-core-xxx.jar to your class-path and you are done.

The Java API of Lucene 3.0.3 can be found here.

Lucene also provides a number of extension modules like the spellchecker and highlighter modules. These modules can be found in the contrib module of the Lucene package .

Building blocks : Lucene Indexing

I would like you to understand the following classes before proceeding to creation of index :

IndexWriter
Directory
Document
Field
Analyzer

IndexWriter

IndexWriter is the backbone of the indexing process. It either creates a new index or updates an existing one (depending on the Boolean argument you pass), so that you can add,delete or update documents. IndexWriter is completely thread-safe.

IndexWriter lets the input you feed to pass through an Analyzer before creating the index.

For the added documents, flushing is triggered either by RAM usage of the documents or by number of documents added. The default flush is when 16 MB RAM is used.

For best performance, you should set flush with a large RAM buffer using setRAMBufferSizeMB(double) method.

When you open an IndexWriter, a lock file is created. If you open another IndexWriter on the same directory you will get a LockObtainFailedException.

The IndexWriter also needs to know where to store the index. This is when Directory class comes into the picture.

Directory

Directory class represents the location where the index is to be stored. It is an abstract class with three subclasses FileSwitchDirectory, FSDirectory and

RAMDirectory. Directory class indirectly uses the Java I/O for file management .

Lucene Directory provides you the option to store the index as files (using FSDirectory) or in RAM (using RAMDirectory) for high performance.

Document

Lucene never indexes the raw data that you provide. Rather it indexes Document objects.A Lucene index can be seen as a collection of Document objects. Further, each Document is a collection of Field objects which are name-value pairs.

Choosing Field objects is a bit tricky depending on the structure of the file data and the data that you want to be indexed.

For example, consider a stream of XML documents with following structure :

<event>

<shape>irregular</shape>

</event>

<event>

..........

</events>

This can be structured in an index by adding each event as a Document object and and each attribute( color, shape etc.) as Field objects. For example, 'color' would be the field name and 'red' would be the value.

Lucene only handles java.lang.String, java.io.Reader, and native numeric types (such as int or float).So in case you want to index large text files you can just save the file as a Document object and title of the file and a Reader object as its Field objects.Hence, every time you want to access the file you can use the Reader object to read it.

Field

Field class holds the textual value to be indexed. It has a name,a value and options which let you decide how Lucene would index the field value.

A field value can be a String, a Reader or atomic keywords for date,URL etc. which are not further processed.

Note : A document can have more than one field with same name. The values of all the fields will be appended while indexing. When you search that field, you will get a single text field with all the text field values concatenated in order they were appended.

Analyzer

The job of the Analyzer class is to have plain text as input and extract tokens from it which are further stored in the index. The analyzer takes care of everything like omitting words, breaking text (usually at white spaces ) and avoiding punctuations.

An analyzer consists of Tokenizers that converts text into raw tokens which are then passed through TokenFilters and then finally stored in the index.

Lucene has Analyzer as an abstract class and several implementations of it.

Let's have a look how the most widely used analyzers work:

Input text : "No pain, No gain"

Results :

WhiteSpaceAnalyzer : [no] [pain,] [no] [gain]
SimpleAnalyzer : [no] [pain] [no] [gain]
StopAnalyzer : [pain] [gain]
StandardAnalyzer : [pain] [gain]

Hey!! Do StopAnalyzer and StandardAnalyzer work similarly?

The answer is No. Let's understand the difference between both.

StandardAnalyzer filters output from StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter.
StandardTokenizer does the following tasks :

Splits words at punctuation characters and removes punctuation. A dot not followed by a white space is assumed as part of a token.
Splits words at hyphens,provided there is no number in the token.
considers email addresses and Internet host-names as one token.

StopAnalyzer filters output from LetterTokenizer with LowerCaseFilter and StopFilter. A LetterTokenizer divides text at non-letters.

Lucene Indexing : It's fun

I would guide you through the Lucene indexing process with four simple steps :

1. Acquire Content

Lucene doesn't provide any support for acquiring content for indexing. This is the sole responsibility of your application to collect raw data. You can use third party tools, crawlers (like Grub, Heritrix, Aperture and many more) or some piece of code.

2. Create an IndexWriter

The first thing you would need to do is choose the location where you want to store the index.

Either you can store the index in memory as follows :

Directory dir = new RAMDirectory();

Or you can store the index in a physical directory as follows :

Directory dir = FSDirectory.open(new File("path_to_your_directory"));

Before you create an instance of IndexWriter you would need to choose an analyzer. I would choose the StandardAnalyzer for the time being.

StandardAnalyzer analyzer = new StandardAnalyzer(Version.Lucene_29);

Now you should create the IndexWriter as follows :

IndexWriter  writer = new IndexWriter(dir, analyzer,
        true, IndexWriter.MaxFieldLength.UNLIMITED);

The third boolean argument creates a new index when 'true' and uses an existing one if 'false'. The fourth argument determines the maximum number of tokens a field can hold. In our case, it is unlimited.

3. Create Documents and add Fields

As mentioned earlier, Lucene doesn't index your raw data. So it has to be converted into Document objects which can be indexed by Lucene core.

Document doc = new Document();
doc.add(new Field("color","red",Field.Store.YES,Field.Index.NOT_ANALYZED));
doc.add(new Field("shape","irregular",Field.Store.YES,Field.Index.NOT_ANALYZED));
doc.add(new Field("width","5\'1\"",Field.Store.YES,Field.Index.NOT_ANALYZED));
doc.add(new Field("height","2\'1\"",Field.Store.YES,Field.Index.NOT_ANALYZED));
doc.add(new Field("speed","15kms/hr",Field.Store.YES,Field.Index.NOT_ANALYZED));
doc.add(new Field("direction","east",Field.Store.YES,Field.Index.NOT_ANALYZED));
doc.add(new Field("time","2005-10-30 T 10:45 UTC", Field.Store.YES, Field.Index.NOT_ANALYZED));

The Field class constructor takes name and value of the field as the first two arguments. The 'value' argument is the actual text that is indexed.

The third argument specifies whether a field should be stored and can hold one of two values :

Field.Store.YES : Store the original field value in the index.
Field.Store.NO : Do not store the field value in the index.

The fourth argument specifies whether and how a field should be indexed.It can hold one of five values :

Field.Index.NO : Do not index the field value.
Field.Index.ANALYZED : Index the tokens produced by passing the field's value through an analyzer.
Field.Index.NOT_ANALYZED : Index the field value without using an analyzer.
Field.Index.NOT_ANALYZED_NO_NORMS : Index the field value without using an analyzer and do not store the norms.
Field.Index.ANALYZED_NO_NORMS : Index the tokens produced by passing the field value through an analyzer and do not store the norms.

4. Index Documents

The next step is indexing the document objects by adding the documents to IndexWriter as follows :

writer.addDocument(doc);

This method periodically flushes queued documents to the Directory , and also periodically merges segments in the index.

5. Close Index

The final step is optimizing the index for best performance and close it.

If you desire to have optimal performance and you are not going to add more documents for a while, you can call the optimize() method as follows :

writer.optimize();

The IndexWriter opens a lock when it is instantiated and you can not open the index in between. Hence, you should finally call the close() method which removes the lock and makes your index ready to use.

writer.close();

If sometimes, you get an OutOfMemoryError just call the close() method which internally calls rollback() method to undo all the changes in the index since the last commit.

Cheers! You are done now.

Note : You can open and update an existing index while it is in use. This will not trouble the users who are searching the existing index.

Testing Index : using Luke Index toolbox

You can now search, update and delete documents from your index by using the open-source Luke index toolbox.

This is quite simple. Just download the Luke toolbox, browse to your index directory and click 'OK'.

Enjoy, You can now view all the terms and documents in the Luke interface.

What's next?

I would drive you through the Lucene search API and more in my upcoming blogs.

Hope this blog was of great help to you. I would try to be more contextual in my future blogs.

2 comments:

AnonymousDecember 14, 2010 at 11:11 PM
What are the maximum data volumes that can be handled by Lucene on an average server class machine effectively?
How can clustering/sharding be supported to have indexed searches running on data running into TBs of raw data?
Rakesh Kumar RakshitDecember 16, 2010 at 3:41 AM
***************************************************************************************************************
indoos said...
What are the maximum data volumes that can be handled by Lucene on an average server class machine effectively?
How can clustering/sharding be supported to have indexed searches running on data running into TBs of raw data?

***************************************************************************************************************

1. Lucene performs well on commodity machine.

I would like to share one of the experiences where Lucene worked fine with around 1.5 billion documents.The total size of the data in 16 partitions was around 950 GB. After getting all the data extracted from the source system, it took about 8 hours to index everything.The test was on a dedicated 16 node cluster, each machine had 4 cores and 3 GB of ram, just for indexing.

2. Sorry to say Lucene doesn't support sharded search. For that you can use Solr Cloud or Katta.

An approach that can be applied is :

Database Clustered Local Search

In this approach, indices are used from local disk, but backed up to the database as Lucene Segments. A cluster app node is installed,it synchronizes the local copy of the search index with the database. When new content is added by one of the cluster app nodes, it updates the backup copy in the database. On reciept of the index reload events, all cluster app nodes resyncronize with the database,downloading changed and new search segments.

Technospy-Your Secret Technology Agent

Monday, December 13, 2010

Apache Lucene - Indexing documents

2 comments: