Technospy-Your Secret Technology Agent: Hadoop

Growing data - a real world problem

Storage capacities of hard drives increasing day by day at alarming rate but access speeds not kept up.

One typical drive from 1990 could store 1,370 MB of data and had

a transfer speed of 4.4 MB/s,§ so you could read all the data from a full drive in around

five minutes. Over 20 years later, one terabyte drives are the norm, but the transfer

speed is around 100 MB/s, so it takes more than two and a half hours to read all the

data off the disk.

Writing slower than read.

Solution : read from multiple disks at the same time. Say 100 disks each carrying 1/100^th of the data. Read time 2 mins approx. HDFS for help.

Hardware failure=> loss of data. Redundancy is for help.

Analysis of data will need to combine all data. Map reduce for help.

ð Hadoop provides a reliable data storage and analysis system.

RDBMS vs Map-reduce

Why cant we use databases with lots of disks to do large scale batch analysis?

Ans : The growing trend shows Seek time improving much slower than transfer rate. Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth.

So with large data seek time will be more and this will add to latency in read or write.

ð RDBMS is good where more updates on smaller range of data. Map –reduce is good for batch analysis where updates are less.

ð Another difference is RDBMS processes structured data but map-reduce has the power with semi-structured or unstructured data.

Relational data is often normalized to retain its integrity and remove redundancy.

Normalization poses problems for MapReduce, since it makes reading a record a nonlocal

operation, and one of the central assumptions that MapReduce makes is that it

is possible to perform (high-speed) streaming reads and writes.

Map reduce

MapReduce tries to collocate the data with the compute node, so data access is fast

since it is local.# This feature, known as data locality, is at the heart of MapReduce and

is the reason for its good performance. Recognizing that network bandwidth is the most

precious resource in a data center environment (it is easy to saturate network links by

copying data around), MapReduce implementations go to great lengths to conserve it

by explicitly modelling network topology.

Histroy of hadoop

Hadoop was created by Doug Cutting the creator of Apache Lucene – a widely used text search library. Hadoop originated of Apache Nutch, an open source web search engine, itself a part of the Lucene project.

It was just the mere idea of creating a web search engine from scratch that were the baby steps towards creation of Hadoop.

2002 Nutch started out of the blues.

2003 Google published a white paper describing the Google File System (GFS) used in production at Google.

2004 Nutch Distributed File System(NDFS) implementation started.

2004 Google published another paper introducing Map-Reduce to world.

2005 Nutch had an implementation of Map-Reduce and all algorithms were run using Map-Reduce and NDFS.

2006 NDFS and Map-Reduce implementation moved out of Nutch to form a separate sub-project of Lucene called Hadoop. The same year Doug joined Yahoo.

2008 Yahoo declared its production search index being generated by a 10,000 core Hadoop cluster.

By this time Hadoop was being used by other companies too like Last.fm, Facebook and The NewYork Times.

Technospy-Your Secret Technology Agent

Sunday, May 1, 2011

Hadoop - an Introduction

Growing data - a real world problem

RDBMS vs Map-reduce

Map reduce

Histroy of hadoop

No comments:

Post a Comment