Hadoop and Big Data: 101

By on
Hadoop and Big Data: 101

What is Hadoop and how might it fit in your organisation?

The technology that powers your social networks, your favourite search engines, your online gaming experience and your anti-spam services is provocatively accessible to every enterprise, thanks to an open source project named after a child's stuffed toy.

Apache Hadoop is the open source release of a technology that preceded just about every data storage and analytics tool that has since been labelled 'big data'. 

It is - in many respects - the foundation of new types of services we've never seen before, and many businesses have quickly found themselves playing catch up for fear of being left behind. While we don't all have the unique technology requirements of a Google or a Facebook, any IT shop with terabytes of data streaming in from every direction will find it a useful tool.

Just about every organisation is being challenged to manage structured data from their traditional enterprise systems, as well as unstructured data from numerous other sources. We've become comfortable with managing many terabytes of structured data - given enough time to process it. 

Big Data relates to comfortably storing and processing unstructured data in the order of multiple petabytes. Unstructured data might include, for example, the data generated from routers, firewalls, servers, business and consumer websites, shopping cart click streams, real-time sensor logs, high volume financial transition data. Unstructured data is usually messy, it's mostly real time, and often needs to be analysed on the fly.

So by definition, Big Data can be defined by the 'three V's': volume of data (size), velocity of data (speed), and variety of data (type). In simple terms, we're talking about large volumes of unstructured live data.

Managing the capture, collection, storage and analysis of this type of data presents a wide range of new and unique challenges that traditional data warehouse technologies were not designed to cope with.

Want to experience the power of Hadoop for yourself? Sign up for iTnews' interactive webinar series:

What is Hadoop?

Hadoop is an open source platform you have probably heard of, but most likely don't know much about.

In essence, it's an open source data storage and compute platform that allows for the processing of large volumes of data at far greater speed and much lower cost.

Hadoop makes it possible to quickly, easily and cost effectively build very large scale distributed data storage and data processing solutions using low cost servers and low cost networking hardware. It scales on a near linear basis, and there really isn't a lot of rocket science required to make it work. 

You can string together cheap computers to build distributed storage and compute clusters capable of storing and processing very large-scale data sets.

You can start out small with a single system for development and testing. You could run Hadoop on your laptop, or on one or more virtual machines on your servers, or in the cloud for that matter.

The beauty of it is that from humble beginnings as small as a single node, Hadoop can scale to tens of thousands of systems, allowing you to keep up with almost any storage or data processing growth or capacity requirements as they arise at a price that won’t send you broke in the process.

Hadoop isn't only relevant to the web-scale data problems faced by the like of Twitter, LinkedIn, Facebook, Google, Amazon, PayPal or eBay.

Thanks to its release as a freely available open source tool, that same technology is now used on a daily basis by organisations of all sizes for analytics challenges once considered too hard or too expensive to run on traditional enterprise platforms.

Why is Hadoop so different?

The Hadoop platform is unique in its approach to both large-scale data storage and distributed data processing. Hadoop performs it's work on data on or near nodes actually storing the data you want to process, as opposed to copying that data across the cluster to a single node and then processing it.

Let's look at each of these in detail:

Hadoop Distributed File System (Storage)

In a Hadoop cluster, the Hadoop Distributed File System (HDFS) allows you to ingest data "into” Hadoop and then work as if your data is simultaneously on all of the disks and on all of the servers in your cluster. Essentially you treat it as if it's a single massive hard drive.

HDFS takes the total data copied into a Hadoop cluster, breaks it up into smaller chunks, and stores it in a distributed fashion across multiple servers, usually with three or more copies for redundancy.

A file in HDFS usually consists of many blocks (large blocks of 64MB and above) distributed across the cluster. The main components of HDFS are the NameNode, Secondary NameNode, and DataNodes.

The NameNode is the master of the HDFS system, and maintains the name space, as in directories and files. The NameNode manages data blocks stored on DataNodes.

The Secondary NameNode is responsible for regular data checkpoints, so that in the event of a NameNode failure of any kind, you can restart the HDFS NameNode using the latest checkpoint.

The DataNodes are the HDFS slaves deployed on every machine in the Hadoop cluster. DataNodes provide the actual storage, and are responsible for serving read and write requests.

In properly configured Hadoop clusters, if a server fails, copies of the "chunks" of data on that server are also stored on two or more other servers in the cluster. Hadoop automatically discovers when a server has failed, and immediately replicates copies of the data that was on the failed server across the cluster to ensure the minimum number of copies is maintained. Administrators can configure how many copies they want to store across the cluster.

By simply replacing a failed server with a "blank" new one, a Hadoop cluster immediately rebalances the distribution of data and compute workload to bring the new server back into the cluster.

MapReduce (processing)

Hadoop provides a new approach to distributed computing by implementing an idea called MapReduce, originally developed at Google.

MapReduce is essentially a programming model for processing massive data sets with a parallel distributed algorithm that allows fault tolerant clustering of commodity off the shelf (COTS) low cost computers. It allows for the splitting, processing and aggregation of large data sets.

MapReduce is unique in that it runs your applications either on the actual nodes where copies of the relevant data are hosted, or on nodes as close to the data as it can - in the same rack, if possible. This dramatically reduces the amount of copying of data across the network.

The main components of MapReduce are the JobTracker, TaskTrackers, and the JobHistoryServer.

The JobTracker is the master of the Hadoop processing system, and manages jobs and resources in the Hadoop cluster. The JobTracker schedules jobs as close to the actual data being processed as possible, ideally on a TaskTracker which is running on the same DataNode (server) as the underlying data block.

The TaskTrackers are slaves deployed onto each machine in a Hadoop cluster, and are responsible for running the map and reduce tasks as instructed by the JobTracker.

The JobHistoryServer serves historical information about completed applications. Typically the JobHistory server is deployed with Job­Tracker, but ideally it is run as a separate process.

A new era of data analytics

Hadoop has changed the landscape of data analytics in ways many industries are still coming to grips with.

US President Barack Obama's 2012 presidential campaign used Hadoop to predict in near real time where they needed to send teams of volunteers out on the streets and knocking on doors to focus their efforts to swing voters.

China Mobile uses Apache Hadoop - in combination with several supporting technologies - to serve billing inquiries for a staggering 650 million mobile phone customers.

A wider range of companies with asset management challenges (such as power utilities) are using Hadoop to ingest image and spatial data to map their power grids, poles, wires and surrounding vegetation.

Read on for Hadoop FAQ's - How is Hadoop distributed and supported? Where to next for Hadoop?

Hadoop FAQS

  • Who developed Hadoop, and why?

Hadoop is credited to software engineer Doug Cutting. In 2002, Cutting was working on a project to build distributed crawler and search technologies capable of dealing with data sets at internet scale at Yahoo!

With the help of database guru Mike Cafarella, he combined the MapReduce data processing idea first conceived by engineers at Google with distributed file system components from an open source search engine project he had developed prior to joining Yahoo!, called Nutch. It was this work that formed the basis of the Hadoop framework.

He then gifted the technology to the world by distributing it under an open source license.

  • Where did the name Hadoop come from?

Hadoop is not an acronym, it is the name Cutting's two-year old son gave his favourite possession, a yellow stuffed toy elephant. He pronounced it with an emphasis on the first syllable, as in HA-doop.

  • Who are the biggest users of Hadoop?

As early as 2006, Yahoo! was running a Hadoop cluster of around 3,000 nodes. By late 2008 that cluster was up to 30,000 nodes, and by 2011 it was cited as running 42,000 nodes holding between 180 to 200 petabytes of data.

It is estimated that the likes of Google, Facebook, Baidu and Yahoo! have clusters of up to a million servers each, the bulk of which run Hadoop in some form. Twitter, LinkedIn, eBay and others have similarly impressive numbers ranging from tens of thousands to hundreds of thousands.

Anil Madan from eBay’s Analytics Platform Development team, went on record in mid-2012 reporting that the e-commerce platform's first large Hadoop cluster, a 500 node machine called Athena, was built in just under three months.

As early as 2009, CERN has been processing data sets of multiple petabytes on Hadoop clusters to process data produced at ridiculous rates by their High Energy Physics (HEP) experiments. One CERN experiment was producing data at 320 terabytes per second from thousands of sensors. 

Facebook is reportedly ingesting as much as eight terabytes of image data per second and successfully storing and processing it (real time facial recognition and tagging of your photo galleries) using Hadoop.

  • Why is Hadoop so hyped?

Hadoop is most revered for its ability to run on low-cost commodity hardware and to scale from terabytes to petabytes by simply bolting on more hardware. 

It can feasibly be run on a small cluster of three ARM -based Pi computers running Java, or as 2,000 instances on an IBM Mainframe running System Z. 

Further, the entire Hadoop stack is written in Java (making it just about 100 percent cross-platform), and can be downloaded for free, or be further tweaked from the source code.

Those two attributes alone have contributed to its rapid growth.

  • How is Hadoop distributed and supported?

Like many large-scale open source projects, the Hadoop project and its various components were placed in the care and governance of the Apache Software Foundation. You can download the entire Hadoop platform from the Apache project website.

The Hadoop core continues to grow at a rapid pace under the guidance and direction of approximately 43 non-paid volunteers who form what's known as the Hadoop project management committee (PMC). Most of these individuals are active code contributors to Hadoop as well as a wide range of related projects such as Hive, Pig and Yarn.

A good indicator of when an open source project has shifted from a quirky idea to a serious platform is when businesses are launched to sell and support it, and to supply consulting and professional services for it. A wide range of commercially supported distributions of Hadoop are now available. These distributions have added sophisticated system and cluster management tools or bundled Hadoop as part of fully integrated data analytics software suites.

Notable distributions of the Hadoop platform include Cloudera (founded by Hadoop creator Doug Cutting), MapR, and HortonWorks.

  • On-premise, or in the cloud?

Public cloud providers Rackspace, Amazon and Microsoft now offer support for Hadoop on their cloud platforms.

By late 2012, universities and niche HPC service providers like Cycle Computing started spinning up large-scale Hadoop clusters in third party public clouds, some as proof of concept trials, others as bona fide projects looking for short-term access to readily available resources on a grand scale.

The largest Hadoop cluster we've found operating in a public cloud is a 50,000 core cluster, spun up in 45 minutes, which costs around $4,850 per hour to run and can be shut down almost instantly.

This in itself is a game changer. It is akin, in some respects, to an on-demand Hadoop super computer with as many as fifty thousand cores for around five thousand dollars an hour. An organisation that requires rapid answers from 100 terabytes of data would have to consider this a very attractive prospect. That's about $10 per core per hour for one of the largest Hadoop clusters on the planet! A single Hadoop system admin with a laptop and cloud formation scripts could single handedly deploy the largest Hadoop cluster on the planet in minutes, monitor it, run jobs, and then shut it down once the jobs are run, all without leaving their desk.

  • Will Hadoop replace traditional databases or storage?

Hadoop is not a replacement for traditional databases or storage. Rather, it is a completely new approach to solving  a subset of data storage and analytics problems. In the past, these specific challenges might have required High Availability (HA) servers clustered in pairs, running Structured Query Language (SQL) based Relational Database Management System (RDBMS) on one of either a Storage Area Network (SAN) or Network Attached Storage (NAS).

Hadoop was created to solve a whole new type of problem, one that databases and storage at the time could not easily solve, either due to scale (= cost), or the unstructured nature of the data being stored and processed.

Databases are good at managing structured data. The data produced by a Human Resources (HR) or an Enterprise Resource Management (ERM) system is well structured and tightly controlled. When you enter employee details into a HR system, you have specific fields you can enter for things like First Name, Last Name, and age. Age for example would be stored as an integer as apposed to a floating-point number as we generally don't care if someone is 45.6 years old, 45 will do.

Hadoop was designed to deal with data that is the exact opposite - for unstructured data.

You could store your HR system’s data in Hadoop if you wanted to, but in reality Hadoop is better suited to the streams of data logged by your routers, switches and firewalls, page view logs from websites or click stream data from eCommerce shopping carts. It's for sensor network data, spatial data, social data.

  • What next for Hadoop?

The major challenge for Hadoop adoption in Australia is the sourcing of skilled staff that have some experience deploying Hadoop or developing on the platform.

Working from scratch, your team is going to require Linux OS skills, programming skills. 

While work is afoot to port Hadoop to Windows, Hadoop has usually been deployed on Linux. MapReduce applications and tools have until recently been developed in Java, but gradually we are starting to see Python and Ruby alternatives appear.

We should expect these challenges to be resolved over time, as more integrators and OEMs rush to expand their capabilities to include Hadoop.

Further, many of the latest distributions of Hadoop have been designed to install and deploy without expert Linux or Java skills, much in the same way modern Linux distributions have matured.

Savvy database players have also jumped on the Hadoop bandwagon and have integrating their existing ANSI 92 SQL database engines with Hadoop and MapReduce APIs, or they have integrated related projects such as Hive or Pig into their platforms.

The likes of Intel have developed their own distributions, HP has developed a reference architecture for use of its servers in a Hadoop cluster, and storage vendors are producing database engines that talk natively to the Hadoop MapReduce API.

The big end of town is taking Hadoop seriously.


Want to experience the power of Hadoop for yourself? Sign up for iTnews' interactive webinar series:

Multi page
Got a news tip for our journalists? Share it with us anonymously here.
Copyright © iTnews.com.au . All rights reserved.

Most Read Articles

Log In

  |  Forgot your password?