Hadoop and Big Data: 101

By
Page 2 of 2  |  Single page

Hadoop FAQS

Hadoop and Big Data: 101
  • Who developed Hadoop, and why?

Hadoop is credited to software engineer Doug Cutting. In 2002, Cutting was working on a project to build distributed crawler and search technologies capable of dealing with data sets at internet scale at Yahoo!

With the help of database guru Mike Cafarella, he combined the MapReduce data processing idea first conceived by engineers at Google with distributed file system components from an open source search engine project he had developed prior to joining Yahoo!, called Nutch. It was this work that formed the basis of the Hadoop framework.

He then gifted the technology to the world by distributing it under an open source license.

  • Where did the name Hadoop come from?

Hadoop is not an acronym, it is the name Cutting's two-year old son gave his favourite possession, a yellow stuffed toy elephant. He pronounced it with an emphasis on the first syllable, as in HA-doop.

  • Who are the biggest users of Hadoop?

As early as 2006, Yahoo! was running a Hadoop cluster of around 3,000 nodes. By late 2008 that cluster was up to 30,000 nodes, and by 2011 it was cited as running 42,000 nodes holding between 180 to 200 petabytes of data.

It is estimated that the likes of Google, Facebook, Baidu and Yahoo! have clusters of up to a million servers each, the bulk of which run Hadoop in some form. Twitter, LinkedIn, eBay and others have similarly impressive numbers ranging from tens of thousands to hundreds of thousands.

Anil Madan from eBay’s Analytics Platform Development team, went on record in mid-2012 reporting that the e-commerce platform's first large Hadoop cluster, a 500 node machine called Athena, was built in just under three months.

As early as 2009, CERN has been processing data sets of multiple petabytes on Hadoop clusters to process data produced at ridiculous rates by their High Energy Physics (HEP) experiments. One CERN experiment was producing data at 320 terabytes per second from thousands of sensors. 

Facebook is reportedly ingesting as much as eight terabytes of image data per second and successfully storing and processing it (real time facial recognition and tagging of your photo galleries) using Hadoop.

  • Why is Hadoop so hyped?

Hadoop is most revered for its ability to run on low-cost commodity hardware and to scale from terabytes to petabytes by simply bolting on more hardware. 

It can feasibly be run on a small cluster of three ARM -based Pi computers running Java, or as 2,000 instances on an IBM Mainframe running System Z. 

Further, the entire Hadoop stack is written in Java (making it just about 100 percent cross-platform), and can be downloaded for free, or be further tweaked from the source code.

Those two attributes alone have contributed to its rapid growth.

  • How is Hadoop distributed and supported?

Like many large-scale open source projects, the Hadoop project and its various components were placed in the care and governance of the Apache Software Foundation. You can download the entire Hadoop platform from the Apache project website.

The Hadoop core continues to grow at a rapid pace under the guidance and direction of approximately 43 non-paid volunteers who form what's known as the Hadoop project management committee (PMC). Most of these individuals are active code contributors to Hadoop as well as a wide range of related projects such as Hive, Pig and Yarn.

A good indicator of when an open source project has shifted from a quirky idea to a serious platform is when businesses are launched to sell and support it, and to supply consulting and professional services for it. A wide range of commercially supported distributions of Hadoop are now available. These distributions have added sophisticated system and cluster management tools or bundled Hadoop as part of fully integrated data analytics software suites.

Notable distributions of the Hadoop platform include Cloudera (founded by Hadoop creator Doug Cutting), MapR, and HortonWorks.

  • On-premise, or in the cloud?

Public cloud providers Rackspace, Amazon and Microsoft now offer support for Hadoop on their cloud platforms.

By late 2012, universities and niche HPC service providers like Cycle Computing started spinning up large-scale Hadoop clusters in third party public clouds, some as proof of concept trials, others as bona fide projects looking for short-term access to readily available resources on a grand scale.

The largest Hadoop cluster we've found operating in a public cloud is a 50,000 core cluster, spun up in 45 minutes, which costs around $4,850 per hour to run and can be shut down almost instantly.

This in itself is a game changer. It is akin, in some respects, to an on-demand Hadoop super computer with as many as fifty thousand cores for around five thousand dollars an hour. An organisation that requires rapid answers from 100 terabytes of data would have to consider this a very attractive prospect. That's about $10 per core per hour for one of the largest Hadoop clusters on the planet! A single Hadoop system admin with a laptop and cloud formation scripts could single handedly deploy the largest Hadoop cluster on the planet in minutes, monitor it, run jobs, and then shut it down once the jobs are run, all without leaving their desk.

  • Will Hadoop replace traditional databases or storage?

Hadoop is not a replacement for traditional databases or storage. Rather, it is a completely new approach to solving  a subset of data storage and analytics problems. In the past, these specific challenges might have required High Availability (HA) servers clustered in pairs, running Structured Query Language (SQL) based Relational Database Management System (RDBMS) on one of either a Storage Area Network (SAN) or Network Attached Storage (NAS).

Hadoop was created to solve a whole new type of problem, one that databases and storage at the time could not easily solve, either due to scale (= cost), or the unstructured nature of the data being stored and processed.

Databases are good at managing structured data. The data produced by a Human Resources (HR) or an Enterprise Resource Management (ERM) system is well structured and tightly controlled. When you enter employee details into a HR system, you have specific fields you can enter for things like First Name, Last Name, and age. Age for example would be stored as an integer as apposed to a floating-point number as we generally don't care if someone is 45.6 years old, 45 will do.

Hadoop was designed to deal with data that is the exact opposite - for unstructured data.

You could store your HR system’s data in Hadoop if you wanted to, but in reality Hadoop is better suited to the streams of data logged by your routers, switches and firewalls, page view logs from websites or click stream data from eCommerce shopping carts. It's for sensor network data, spatial data, social data.

  • What next for Hadoop?

The major challenge for Hadoop adoption in Australia is the sourcing of skilled staff that have some experience deploying Hadoop or developing on the platform.

Working from scratch, your team is going to require Linux OS skills, programming skills. 

While work is afoot to port Hadoop to Windows, Hadoop has usually been deployed on Linux. MapReduce applications and tools have until recently been developed in Java, but gradually we are starting to see Python and Ruby alternatives appear.

We should expect these challenges to be resolved over time, as more integrators and OEMs rush to expand their capabilities to include Hadoop.

Further, many of the latest distributions of Hadoop have been designed to install and deploy without expert Linux or Java skills, much in the same way modern Linux distributions have matured.

Savvy database players have also jumped on the Hadoop bandwagon and have integrating their existing ANSI 92 SQL database engines with Hadoop and MapReduce APIs, or they have integrated related projects such as Hive or Pig into their platforms.

The likes of Intel have developed their own distributions, HP has developed a reference architecture for use of its servers in a Hadoop cluster, and storage vendors are producing database engines that talk natively to the Hadoop MapReduce API.

The big end of town is taking Hadoop seriously.

 

Want to experience the power of Hadoop for yourself? Sign up for iTnews' interactive webinar series:

Previous Page 1 2 Single page
Got a news tip for our journalists? Share it with us anonymously here.
Copyright © iTnews.com.au . All rights reserved.
Tags:

Most Read Articles

Microsoft had three staff at Australian data centre campus when Azure went out

Microsoft had three staff at Australian data centre campus when Azure went out

Defence picks Lockheed Martin for mammoth compute deal

Defence picks Lockheed Martin for mammoth compute deal

APA Group to bring mobility, AI to engineering document repository

APA Group to bring mobility, AI to engineering document repository

NSW Govt data centres on track to be Australia's greenest

NSW Govt data centres on track to be Australia's greenest

Log In

  |  Forgot your password?