VMware has established a crack team of engineers to virtualise applications usually considered too technically challenging to operate in a virtual container, including big data, high-frequency trading, and high-performance computing platforms.
Successful big data deployments at Google, Yahoo, Facebook and Target have prompted organisations such as banks to attempt to apply the technology to their businesses.
But most organisations experimenting with Hadoop do so in small iterations, often on standalone, physical servers and storage – undermining the use of VMware’s vSphere as an enterprise standard operating environment.
At a technical level, big data applications are I/O and compute intensive, and are designed to take advantage of very different storage architecture to most corporate apps.
Hadoop users have also opted for cheaper storage options than VMware-supported SAN and NAS, by using clustering and replication software to create SAN-like redundancy with cheap disks ($0.10 per GB vs up to $5 per GB on a SAN).
“We asked ourselves the question: is virtualisation ready for I/O and CPU intensive applications?” VMware's Australian-born CTO of application infrastructure Richard McDougall told iTnews at the VMworld conference last week.
“We sat down with our partners for some co-engineering tests. The good news is we have hit some big milestones – we’re now able to prove that when systems are set up properly, there is no performance overhead to running virtualisation underneath Hadoop.
“That will be a big eye opener for a lot of people.”
McDougall took to the stage at VMworld to demonstrate ‘Project Serengeti’, an open source effort aiming to run Apache Hadoop on virtual machines.
Using the high availability features of vSphere, McDougall demonstrated how a single system could be multi-purposed to focus nearly all of its processing power on production web servers during business hours, with only one or two nodes running a Hadoop instance collecting click stream data in the background.
Trying to analyse that data in real-time could have a negative I/O impact on live transactions. But McDougall demonstrated how the system could be set up to share resources between the web server and Hadoop application in a flexible fashion.
As traffic to the web server dropped away into the evening, the system was set up to gradually dedicate more of the available nodes to the Hadoop cluster to run analytics against the stored click-stream data.
The exercise might have seemed trivial – but for the fact that McDougall was demonstrating the collection and processing of terabytes of data per node, without having to “shift any data around”.
McDougall told iTnews the “obvious value prop” for running Hadoop on virtual machines was utilisation.
“It’s reported that Yahoo’s big Hadoop cluster has CPU utilization of only 11 percent or thereabouts. And you see that trend across other environments, the majority of customers aren’t making use of their computing capacity when systems are set up to tackle single purpose problems. Being able to use that platform for more than one thing would save huge amounts of money.”
But like all other apps before it, perhaps the bigger driver is the ability to clone or provision Hadoop environments in a rapid fashion. “We want to make it dead simple to create a new virtual cluster on existing infrastructure,” McDougall said.
VMware now just needs a big name reference customer – someone of the size of Facebook, Google, Yahoo or eBay – to drive the point home.
Trading systems and HPC
McDougall’s team is also tackling the virtualisation of low-latency applications for financial trading systems.
Brokers have a fairly unique set of requirements when connecting to stock exchange systems, and there is massive competitive advantage to be gained from making any given trading decision faster.
Applications require extremely fast response times and networks as little latency as possible – to the point in which many stock exchanges offer co-location to bring customers closer to the action.
Whilst a subset of these applications aren’t likely to live on virtualised infrastructure any time soon, McDougall says there is a larger share of apps in the market that don’t require such extreme performance.
“We have spent time with most of the software vendors in that space – and most talk of low-latency and real-time trading,” McDougall.
“But for the majority of apps - when they say real time what they mean is that it is sensitive to milliseconds of latency.
“It is only the pointy end that needs to buy the data centre space at the speed of light distance away. The great majority of apps in the middle are low hanging fruit for virtualisation.”
The same goes for high performance computing – increasingly used in mining and manufacturing, insurance, biotechnology and other research fields. These systems tend to run simulations on terabytes of data, with thousands of CPU’s in a cluster.
Like big data apps, these high performance computing clusters tend to be run in silos from corporate IT.
McDougall’s team wants to abstract away that hardware infrastructure so that whether you are running a simulation or checking your email, you are still accessing the same shared pool of hardware resources.
“If there are adjacencies where virtualisation hasn’t been used, we are making sure we have everything we need to prove to customers it can be done.”