For years, Microsoft and other companies badly burnt by security incidents have said if you let someone else run their code on your server, it’s no longer your server.
How about if you run someone else’s code in your data centre, without checking what it does or where it came from?
Crazy perhaps, but that appears to be the norm these days for big data and other cloud projects.
Mathematician and CompSci PhD student Erich Schubert’s blog on setting up Hadoop is a good read on just how bad the situation is.
Hadoop is written in Java, which by itself should make anyone who’s been in IT the last few years wary from a security point of view.
Schubert checked out a typical Hadoop installation and was appalled at the lack of security. He criticised it as being “an incredible mess of depencies, version requirements and build tools”, meaning it’s nigh impossible to package up properly.
While Hadoop is very interesting technology, at the moment it requires users to turn a blind eye to security and download binaries from all over the internet - binaries that aren't signed or authenticated, and which you most likely don’t know what they’ll do or who created them.
Wham, into the data centre they go.
It’s hard to disagree with Schubert’s definition of “stack” as meaning “I have no idea what I’m actually using".
The malaise of people downloading untrusted binaries extends beyond Hadoop, of course.
As Schubert notes, it makes life easy for the bad people who don’t even have to figure out a clever exploit. All they need to do is create images of virtual machines or Docker containers, which trusting users will download and run.
Thinking it surely couldn't be that bad, I checked on Hadoop specialist HortonWorks’ website. The automated “recommended way to install HortonWorks Data Platform [HDP] version 2.2 for a production environment” lists the below Linux distributions as being supported:
- Red Hat Enterprise Linux (RHEL) v6.x
- Red Hat Enterprise Linux (RHEL) v5.x (deprecated)
- CentOS v6.x
- CentOS v5.x (deprecated)
- Oracle Linux v6.x
- Oracle Linux v5.x (deprecated)
- SUSE Linux Enterprise Server (SLES) v11, SP1 and SP3
- Ubuntu Precise v12.04
The above Linux distributions are old and have been replaced with updated versions that obviously contain security fixes.
These are the Java runtime environments that are supported by HDP 2.2:
- Oracle JDK 1.7_67 64-bit (default)
- Oracle JDK 1.6_31 64-bit (deprecated)
- OpenJDK 7 64-bit (not supported on SLES)
The default JDK is from August last year, and yes, it contains several vulnerabilities. It's not what you should install on servers.
How did it get to be this way? We should know by now that if there are security holes in networked systems, they will be exploited.
Ever increasing pressures to reduce time to market, coupled with perpetual betas that hope maybe someone will sort out any flaws at some point, have conspired to create an environment that’s basically wide open to security breaches.
Furthermore, because the code base is convoluted, open as it is, people won’t understand how to fix the mess, even if they had the time to do it. The requirement to use specific, old Linux distributions points to some serious hackyness happening, unfortunately.
What’s the long and short of it? As Schubert puts it: “as is, I cannot recommend to trust your business data in Hadoop.”
It may sound harsh, but really, don’t run Hadoop (or other big data products) until you know the code, can audit it, and update it.
Think about that before letting the unsecure Hadoop elephant waltz around in your data centre.