Why Monash is sharing its MASSIVE supercomputing cluster

By on
Why Monash is sharing its MASSIVE supercomputing cluster

Tweaking OpenStack and feeding NeCTAR for the common good.

It was heralded as a major jump in computing power when it was switched on last year, and Monash University’s MASSIVE-3 supercomputer is now being opened to Australia’s nationwide academic and research community thanks to participation in the rapidly-expanding NeCTAR project.

Touted as “Australia’s largest biological microscope”, the MASSIVE-3 node has been designed to support imaging applications such as angstrom-scale microscopy, in which scientists generate what is often terabytes of data from atomic-level scans of molecules.

Processing this data into usable images can take many hours using a conventional desktop system but MASSIVE-3 – which is built on 1700 Intel Haswell CPU cores and 50 Nvidia Tesla K-80 graphical processing units (GPUs) – can complete the same work in 5 minutes, senior HPC consultant Lance Wilson told an audience at this month’s OpenStack Australia Day Melbourne.

“You’re looking at around 5000 files comprising 1 to 4 terabytes [per image],” Wilson said.

“It really smashes your file system, and pipeline analysis means you have to do lots and lots of steps.

“You need monster GPUs and a lot of memory – the last job I did needed over 240GB of memory to process and was tripping over our system – and if you do have parallel processing like we do, it absolutely smashes it. One job will take up all of the bandwidth.”

Since the cluster’s launch a year ago, the university’s team of supercomputing specialists has been working hard to optimise its performance.

Over the course of what senior HPC consultant Blair Bethwaite called a “6-month learning curve”, the team has steadily worked to tweak the environment – which runs on 100Gbps Ethernet, has 100 nodes and is about to get a few hundred more – for optimal performance.

This included optimising interfaces with CPU enumerative topology and virtual codes, disabling kernel consolidation features, removing host-based data-proxy features, and adjusting technical minutiae around use of memory pages, GPU instances, memory allocations in multi-threaded environments, tweaks to internal APIs, configuration of the Nvidia NVLink GPU interconnect, configuration of the System Resource Orchestrator (SRO), and more.

The Intel Linpack benchmark – used to rank the perpetual top 500 supercomputer sites list– offered guidance in a range of configurations and helped the Monash team tweak performance and resolve issues.

“Every time you think you’ve solved a problem, someone comes along and says ‘this doesn’t work’,” Bethwaite said. “It’s been tricky.”

Yet performance is only one of the attributes the team has been targeting with its work: it has also been working to improve accessibility to the resources, offering browser-based access to the system and engaging with the NeCTAR national research cloud to integrate the system’s resources into the deepening pool of available supercomputing power.

“This was never just about building a cluster,” Wilson said. “This space has a lot of different use requirements, and this was going to be just one of the services that would be needed. So when it came to doing cluster 3 we chose do to it through OpenStack through NeCTAR.”

Interfacing with NeCTAR – which allows scientists at Melbourne University and 7 other institutions to sign into a pool of supercomputing resources and lodge their jobs using OpenStack tools such as the SLURM queueing system – puts MASSIVE-3 into a fraternity that has been growing rapidly in recent years, NeCTAR deputy director for research platforms Paul Coddington told the OpenStack audience.

“The whole point of increasing general national research infrastructure is to enable national and international collaboration and research to provide the tools that researchers need,” he said.

NeCTAR has grown from its original 100 computing cores in 2011 to comprise around 40,000 CPU cores in total, with 4 petabytes of RAM.

Biological sciences systems such as Monash’s MASSIVE-3 are the largest contributor – providing 12,090 CPU cores overall – but information and computing sciences (8688 cores), mathematical sciences (4582), engineering (3645), medical and health (2554), physical sciences (2511), and other disciplines are also contributing their resources.

A federated authentication model allows users to submit virtual machines to run “anywhere there is an available resource, but people see a single interface to the federation,” Coddington said.

“This has been an extremely low hurdle that has been extremely useful for driving uptake,” he said, noting that NeCTAR has been growing user numbers by 40 percent annually and now counts more than 10,000 users nationwide.

NeCTAR has already implemented a range of OpenStack infrastructure services – including the Glance service registry, Horizon cloud dashboard, Keystone identity service, Swift object store, Nova compute and Cinder volume storage services, and a range of helpdesk and user support capabilities.

This infrastructure will allow NeCTAR to pursue its long-term goal of interoperability with other national research clouds to develop ‘international science clouds’, ultimately complementing local resources with 22 OpenStack-based systems already in place – ranging from France’s 8000-core Grid.5000 to CERN’s 250,000-core computing cloud.

Ultimately, the technological decisions made by the Monash supercomputing team are driving a push to make the platforms ever more accessible to scientists that want to focus on the science rather than the esoterica of the high-performance computing world.

By building on the OpenStack framework – components such as Ironic bare-metal provisioning and Cyborg acceleration-management are next on the to-do list – the team is continuing to boost performance while complementing technological improvements with ease-of-use considerations.

“We’re dramatically reducing the time between when people collect data, and when they have a result they can see,” Wilson said.

“The next level for this is to be able to scan a pathogen and see what it is almost in real time.”

Copyright © iTnews.com.au . All rights reserved.
Tags:

Most Read Articles

Log In

Username / Email:
Password:
  |  Forgot your password?