Apache Hadoop is an open source technology that offers a radically cheaper alternative to the processing and storage of large amounts of unstructured data.
But for all of its potential benefits, Hadoop can be an uncomfortable fit within many IT environments. It challenges traditional approaches to data warehousing architecture, to the way in which IT projects are funded, and in some cases - can even threaten jobs.
On that basis, selling a Hadoop-based analytics project into a business is not as straightforward as a back-of-the-envelope cost calculation might suggest.
A different approach
Hadoop features characteristics that aren't shared with most data processing solutions IT departments have supported to date:
- the core software is 100 percent free and available for just about any form of use,
- the core software is entirely open source and can be downloaded directly off the internet,
- Hadoop can perform extremely well using low cost network equipment and does not require expensive high end server hardware, a high performance SAN or NAS, or tape or disk-based archival, and
- a properly configured Hadoop cluster can provide data protection without requiring additional tools.
The fundamental selling point of the technology is that doesn't require a large capital investment to get started.
Equally, however, some of these positive attributes have presented challenges for individuals pursuing big data projects inside large organisations.
Some CIOs are naturally cautious about adopting open source technologies without feeling confident adequate support structures are on the ground in Australia.
They are also likely to be cautious about any claim that Hadoop reduces dependence on RAID designs that have been a staple of data protection in the enterprise for decades.
And finally, Hadoop is radically different enough that some storage, server and data warehousing engineers will see a potential for the technology to obsolete their (vendor-specific) skills. As with any technology project, change management is crucial. We'll explore each of these barriers on page two.
Tip #1: Focus on business benefits
There is ample data available online to help sell the benefits of adopting new approaches to data management and analytics. A very healthy industry is being formed around big data and in particular Hadoop. You would be wise to make use of the massive body of work already published.
It's relatively easy to win an argument on the technical merits of the Hadoop architecture. Typically, the earliest business cases have been formed around:
- reducing data warehousing costs,
- faster time to market for decisions,
- validation for retaining and leveraging historical data, and
- leveraging 'sunk costs' of older hardware that would otherwise have been retired.
Ideally, however, these are supporting arguments for a project that needs to be more compelling.
The best business case for Hadoop is written by an author with a question in mind for which the business urgently requires an answer. And that answer should have the power to inform or validate a new approach to your organisation's way of doing business - something as critical as marketing or distribution strategies.
Tip #2: Start small
Even when you have a big question in mind to answer, Apache Hadoop offers a great opportunity to 'start small' and first win over your peers with a proof-of-concept which offers fast results at low cost.
Large organisations adopting Hadoop tend to have delivered proof of concepts using little more than the freely-available software running on commodity servers, before investing any dollars in supported distributions, dashboards and management tools on top of the base platform.
You may choose to first introduce Hadoop skills into the IT department for low-risk use cases such as:
- part of new dev and test platforms,
- to reduce the cost of retaining machine data and log files,
- as a temporary data store for Extract, Transform & Load (ETL) requirements
IT staff may relish the challenge of building a small Hadoop cluster using six to eight decommissioned desktop PC systems, on an old blade chassis with low power blades or on rack servers due for retirement.
We've also seen great success within large financial organisations hosting 'hack days', where new technologies are put in front of a mix of both IT staff, developers and business analysts, with an opportunity to learn about it, play with it, and dream up new possibilities for the business as a team.
We've also seen business cases approved without the use of any on-premise hardware - by pushing non-critical data onto Hadoop instances hosted on third party cloud platforms. The ability to spin up (and spin down) a virtual machine in minutes, without seeking external funding beyond your corporate expense account, is incredibly powerful.
You will, however, need to be cognisant of the fears many in your organisation would hold about pushing any corporate data onto a third party cloud. You will either need to go through the necessary approvals processes (which presents the unfortunate potential for your project to be scrapped before it begins) or make a very calculated gamble based on measurable results from non-critical data.
Also in Hadoop week:
In any case, it is advisable to keep pilots as simple as possible - senior IT execs are more likely to be won over by the speed of a result than by the magnitude of the problem you first attempt to solve.
Some of the most successful deployments of Hadoop have started from development or proof of concept clusters of a little as three or four desktop grade personal computers, a small desktop grade network switch, and an interesting data analytics challenge to solve.
Yahoo!, for example, started using Hadoop with just four servers in a development project. As of March 2013 they were reported to have over 42,000 servers storing and performing analysis on hundreds of petabytes of data.
Tip #3 - Communicate
When you take concepts around Hadoop (and big data generally) beyond your IT team, it's important to start with a clear common vocabulary for everyone to use when discussing your business case.
You could consider writing articles for your in-house newsletter and provide high-level introductions to key topics and examples of success stories in your industry. Few HR departments burdened with the challenge of publishing an in-house newsletter each month will say no to unsolicited articles from the IT department, provided you can make the content relevant and easy to consume. (We tend to find iTnews.com.au popping up on a lot of intranet sites for this very reason.)
You might consider running some educational workshops on data analytics prior to presenting your business case or proposal, being sure to engage with those stakeholders likely to be a hard sell further down the track.
Lunchtime 'brown bag' workshops with an open door 'walk in' format can bring about good results - just make sure you have a few backers in the room with you.
It is important to get the tone right - these sessions are about education. Nobody likes to hear a colleague come off sounding like a salesman. But most of us are open to being offered the opportunity to learn about something new if it is presented in an engaging and informative way.
Your sessions might cover topics like:
- What is 'big data' and why is it important?
- Case studies on competitors and peers sharing results with this technology
- What is open source and how it is licensed, distributed and supported?
Make sure you explain the topics in plain language anyone can understand. Avoid too much jargon or three letter acronyms. If you can explain the opportunity without having to wade knee deep in geek speak, you will find a lot less resistance.
Consider using analogies that make it easy to grasp potentially complex concepts. MapReduce is complex, and the technical overhead of breaking up and processing multiple 'chunks' of a data set isn't easy to describe. How would you explain it in a language the board can understand? How about this example:
Imagine trying to complete a head count of how many children are playing soccer on a large community sports field at 9am on a Saturday.
You could send one person out to count every single child on the field one at a time, which may take a long time, and most likely end up with errors as children moved around the field through the course of the morning.
But if you broke the problem up into lots of little bits by dispatching all of the team captains in parallel to count how many players they have in their team on the field at exactly 9AM, have them report back with a single number (sub-total) for their individual teams, then sum up those team counts and arrive at a final (grand total) number of children on the field at that time, it would take a lot less time, and provide a more accurate result at less effort. This is what Map Reduce does with data.
Read on for how to overcome common barriers to big data projects...