The University of Melbourne has launched a new high-performance compute cluster called Spartan, which has been designed to triage workloads and end the queuing headache for priority researchers.
The unconventional setup will allow power users to access to optimised bare-metal HPC capabilities, supported by high-speed networking and high-speed storage, while general users with less critical workloads can be pushed to the cloud component of the system.
According to the uni's head of research computer services, Bernard Meade, general users of the old system, who did not necessarily need to access bare-metal performance, ended up in the same queue as power users who did.
“Usually, how universities build HPC systems is they pick a couple of principle use cases – no more than about five – and they build a HPC system to fit, making compromises between those power users. These are usually the most important use cases,” Meade told iTnews.
“But then everyone else comes aboard as well, and they don’t know how to use the system quite as well. So they get in the way [of the power users] because they’re part of the same job queue, and the power users the system was designed for have to wait in queue for them to finish their jobs.”
The Spartan design means general users gain access to spare cycles available in the research cloud, reducing the time they wait for jobs to start, while power users no longer have to wait in line.
Meade said the system will also scale up and down depending on how much of the Nectar national research cloud is available for HPC workloads at any given time.
“So they might allow us to have 1000 cores, or if usage is lower, perhaps 5000 cores or even 10,000 cores. So we can scale up and down and use that extra capacity in the cloud, which isn’t being used, for our HPC clients.”
While the cloud component of Spartan uses the same physical hardware used to host the Melbourne node of Nectar, Meade notes the cloud component isn’t actually part of the Nectar cloud itself.
“It is the same hardware and it’s all connected together on the same rack – it’s not physically separate hardware, but there’s a virtual division we impose. So we can scale it whenever we want.”
Meade hopes the new approach will stop a worrying trend towards frustrated research departments feeling forced to invest in their own HPC systems.
Meanwhile the Spartan architecture does mean these departments can very easily add their own dedicated nodes to the system.
“They can attach it to Spartan as a separate partition. So they don’t have to buy head nodes, log-in nodes, interconnect and storage. We’ve got all that. All they have to buy the hardware that they want if they have a particular need."
Along with Spartan’s ability to add additional nodes, management nodes can be live-migrated to new hardware, allowing hardware to be upgraded or replaced without bringing the entire cluster down, using a platform called Puppet.
"So if everything went south, we could build it back up within a day,” Meade said.
“As we buy new hardware in, the performance of the hardware is better and we can migrate virtual machines on to the new hardware without any interruption to the service. So we can do rolling upgrades of the system without having to replace everyone in one big hit.”
The dedicated ‘bare-metal’ component of Spartan consists of 204 standard CPU-based cores plus two GPU nodes, made up of Dell systems with switches from Mellanox and Cisco. The system runs Red Hat Enterprise Linux, with SLURM as a workload manager.
Along with the main HPC system, the project features an expansion of Melbourne Uni’s already existing research cloud by 4000 cores to around 11,000 cores.
Work on the system began over a year ago, with design assistance provided by the Victorian Life Sciences Computation Initiative (VLSCI).
It is set to be used by researchers across a wide array of disciplines, including Victoria University’s bushfire modelling.