Sunday, May 1, 2011

Hadoop - an Introduction



Growing data - a real world problem


Storage capacities of hard drives increasing day by day at alarming rate but access speeds not kept up.
One typical drive from 1990 could store 1,370 MB of data and had
a transfer speed of 4.4 MB/s,§ so you could read all the data from a full drive in around
five minutes. Over 20 years later, one terabyte drives are the norm, but the transfer
speed is around 100 MB/s, so it takes more than two and a half hours to read all the
data off the disk.

Writing slower than read.

Solution : read from multiple disks at the same time. Say 100 disks each carrying 1/100th of the data. Read time 2 mins approx. HDFS for help.

Hardware failure=> loss of data. Redundancy is for help.

Analysis of data will need to combine all data. Map reduce for help.

ð  Hadoop provides a reliable data storage and analysis system.

RDBMS vs Map-reduce


Why cant we use databases with lots of disks to do large scale batch analysis?
Ans : The growing trend shows Seek time improving much slower than transfer rate. Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth.
So with large data seek time will be more and this will add to latency in read or write.
ð  RDBMS is good where more updates on smaller range of data. Map –reduce is good for batch analysis where updates are less.
ð  Another difference is RDBMS processes structured data but map-reduce has the power with semi-structured or unstructured data.
Relational data is often normalized to retain its integrity and remove redundancy.
Normalization poses problems for MapReduce, since it makes reading a record a nonlocal
operation, and one of the central assumptions that MapReduce makes is that it
is possible to perform (high-speed) streaming reads and writes.


Map reduce


MapReduce tries to collocate the data with the compute node, so data access is fast
since it is local.# This feature, known as data locality, is at the heart of MapReduce and
is the reason for its good performance. Recognizing that network bandwidth is the most
precious resource in a data center environment (it is easy to saturate network links by
copying data around), MapReduce implementations go to great lengths to conserve it
by explicitly modelling network topology.

Histroy of hadoop

Hadoop was created by Doug Cutting the creator of Apache Lucene – a widely used text search library. Hadoop originated of  Apache Nutch, an open source web search engine, itself a part of the Lucene project.
It was just the mere idea of creating a web search engine from scratch that were the baby steps towards creation of Hadoop.
2002 Nutch started out of the blues.
2003 Google published a white paper describing the Google File System (GFS) used in production at Google.
2004 Nutch Distributed File System(NDFS) implementation started.
2004 Google published another paper introducing Map-Reduce to world.
2005 Nutch had an implementation of Map-Reduce and all algorithms were run using Map-Reduce and NDFS.
2006 NDFS and Map-Reduce implementation moved out of Nutch to form a separate sub-project of Lucene  called Hadoop. The same year Doug  joined Yahoo.
2008 Yahoo declared its production search index being generated by a 10,000 core Hadoop cluster.
By this time  Hadoop was being used by other companies too like Last.fm, Facebook and The NewYork Times.

No comments:

Post a Comment