Many deeply entrenched ideas about how to address traditional enterprise data storage and analysis needs are directly challenged by Hadoop.
The first entrenched attitude you will need to dispel is that proposing open source is somehow a career-limiting move.
Open source and software support
Open source software is usually developed by a community of like-minded individuals, distributed freely online under known license mechanisms (Apache, GNU etc). The term open source refers to the fact that the source code can be downloaded and viewed or modified to suit any requirement.
The obvious benefit is avoiding upfront purchase and ongoing licensing costs, avoiding vendor lock-in and often (arguably) a higher quality of code due to the efforts of multiple, motivated individuals.
But some CIOs still perceive a range of technical and business risks associated with deploying open source in enterprise environments. They fear a lack of support or a 'throat to choke' when there are problems.
They fear that the software has been developed for back-room geeks rather than business users. They fear it is advocated for political reasons over practical ones. They fear its unpredictability, its potential to splinter.
IT professionals faced this same challenge when Linux made the transition from a pet project for computer geeks to a safe, secure, stable platform that powers the engine houses of the biggest names on the internet.
Linux initially seemed so radically different and it took years for some organisations to develop skills around it. Today many businesses or key technology platforms simply couldn’t live without it.
There is arguably enough movement around Hadoop to suggest that - like Linux - it will be supported as a vital tool in the enterprise IT armoury well into the future.
According to IDC, the Hadoop software market will be worth at least US$813 million (A$860 million) by 2016. IDC also predicts that the big data market will reach US$23 billion by 2016. While big data is unquestionably over-hyped, the dollars flowing into this sector should just about guarantee support well into the future.
CIOs should also take comfort from the distribution model for Hadoop - which again mirrors Linux.
Hadoop is available as a free download directly from the Apache Hadoop website. The Hadoop core is always remains under the open source Apache software license, such that any additional development or work on the Hadoop core must also be made freely available.
Hadoop is also available via value-add suppliers that distribute Hadoop bundled with proprietary management, monitoring, reporting or security tools, or as part of a larger analytics suite. Some of these companies are new to Australia (MapR, Hortonworks, Cloudera), but they have formed partnerships with many of the companies your organisation already contracts with (HP, Intel, EMC etc). The ecosystem is growing.
For support, IT organisations can again choose between a massive global community surrounding the open source core (online forums, wikis, email mailing lists, blogs and FAQ websites), or seek support from distributions, systems integrators and consulting professionals that specialise in the technology.
Skills might be scarce on the ground in Australia today, but more options are becoming available on a weekly basis.
Your organisation will most likely have an existing data warehouse platform and employ staff to directly support that platform. There is a tendency, whenever disruptive technology is introduced, for some of these individuals to search for reasons why Hadoop doesn't tick all the boxes.
You should argue that traditional architecture approaches to data warehousing do not need to compete with or conflict with your proposal to bring Hadoop into your organisation. Hadoop, as described in our FAQ section earlier this week, aims to derive results from large volumes of unstructured data and can happily live as a complementary solution to your existing data warehouse.
If you engage your existing database and analytics staff and share your vision with them from the outset, they are far likely to act against your interests. Reach out early and solicit their input and map out any issues, risks or concerns they might share.
Risks to data integrity
You may also come up against concerns that Hadoop plays fast and loose with data integrity, at least when compared with the RAID architectures used in most existing systems.
RAID refers to redundant arrays of independent or inexpensive disks to provide data protection and disk redundancy in case of hard drive failure. Enterprise storage platforms such as network attached storage (NAS) and storage area network (NAS) usually employ some form of RAID configuration.
The file system is a logical overlay spread across multiple physical hard drives, and data is written across multiple disks to provide disk redundancy. If one disk fails, copies of data is available on one or more other disks.
RAID doesn't tend to be applied within the Hadoop architecture. But data can be just as well if not better protected, it's simply a different approach.
Like any storage platform in traditional enterprise platforms, the Hadoop distributed file system (HDFS) storage platform requires careful planning to ensure data availability and integrity.
Files copied into a HDFS cluster are immediately broken up into chunks and distributed across multiple nodes in cluster. As files are loading into HDFS, Hadoop distributes these chunks to multiple nodes based on two key configuration settings: chunk size, in bytes (dfs.block.size) and replication factor (dfs.replication)
Chunk size is usually one of 64MB, 128MB, 256MB or 512MB. You would usually need to do a series of designs and tests to arrive at the ideal size based on input data size to be processed and the compute power of each node.
The replication factor sets the number of copies of each chunk to be spread across the cluster. By default, the replication factor is set to three - so each chunk will be available on three independent nodes across the cluster.
This replication is used to improve data redundancy and also distribute compute workloads into more manageable pieces.
So rather than writing data across multiple disks in a single server at a file system level the way RAID does, disks in a HDFS cluster are treated as as "just a bunch of disks" (JBOD) and data is written across multiple nodes across the cluster. If one disk fails in a node, the data is available from one or more other nodes across the cluster.
Have you deployed Hadoop in your organisation? What tips would you offer your peers on putting a business case forward?