CSIRO’s Data61 has revealed the difficulty it faces in getting funding for hardware to run infrastructure services, with one of its research projects only recently being able to retire “dying” 20 year-old servers.
Principal research engineer in the Trustworthy Systems team, Peter Chubb, told a pre-gathering of Linux.conf.au 2019 in Christchurch that getting funding for infrastructure hardware outside of a specific piece of research is “almost impossible”.
Chubb said that the team’s infrastructure runs services for around 40 desktops and “the same number of laptops, plus around 30 development boards” for its embedded systems research.
This is growing as the number of people in the research group approaches 50, he said.
Chubb said that the team had typically tried to repurpose research hardware to run infrastructure, but this had become less possible over the years as the team’s focus had shifted.
“Back in the late 90s/early 00s, we were doing lots of server work, mostly sponsored by IBM and HP, but we’ve moved to embedded systems work so most of the research hardware we get now is prototype … boards, and you can’t easily take those and run infrastructure on them,” he said.
“Most of our hardware was last installed in the year 2000 so it’s about 18-20 years old now, and it’s dying.
“There’s a limit to how many times you can swap a power supply and still get parts over eBay.”
The need to run on boxes donated from expired research projects is one that has been in place a long time, he noted.
“As a research organisation, it’s easier to get funding for new research hardware but it’s almost impossible to get funding for infrastructure hardware,” Chubb said.
“So what that means is when we’ve got a project that uses a server, we try and spec it so that it’s going to be useful for infrastructure purposes afterwards.
“The difficulty is that when you buy a server for research, they want to cut down everything [except] for the bits you need for your research. So get a 28-core machine - fine - but it’s only got 16GB of RAM, no redundant power supplies, and no hot-swap fans.”
Chubb said it took three years of applying to secure capital expenditure to fund the purchase of a single large server that could replace some of the dying gear.
“When I applied, I went to Dell’s configure-it-yourself website and got the quote, and ordered it on that,” he said.
“But because CSIRO’s so big we got a huge corporate discount, so I bought two”.
The two new machines each have 24 cores, with 300GB RAM, 16TB of spinning disk, and two 10Gbps fibre connections linking them.
Having the two machines allowed the team to retire about a dozen older servers.
Chubb’s next goal has been to set the two machines up in a “high-ish availability” configuration, allowing services to be replicated between the two, and this was the major topic of his presentation.
“Given some excess capacity in the new machines, I decided to try to set up replication and failover, so I can bring one machine down for maintenance, and people won't notice (much),” he said in notes for his presentation.
“What I want to be able to do is plan for downtime so I can do kernel upgrades or replace a dead network card without affecting people and without coming in at ridiculous times of the morning [to do it],” he said on Monday.
“With our growing group - we’ve now got 50 people in the group relying on this infrastructure - downtime costs more.”
Chubb said he looked at open source solutions like CoroSync and Pacemaker to manage this.
However, he found that “they all seem to want you to start with three servers. We’ve got two.”
With them not seeming applicable to Trustworthy Systems’ use case, Chubb took the decision to skip over them and roll it his own solution.
“In hindsight this may have been a mistake,” he said.
“What I’m going to do is put every service in its own LXC container, replicated on the two systems, and use LsyncD which is a Linux utility that monitors file systems for changes.
“When a change happens invokes rsync to ship everything across.
“And I put some monitoring software on there so if the TFTP [Trivial File Transfer Protocol] server dies it’ll notice it’s dead and bring up the other one.
“This all seems to work in all my initial tests so I decided to do a full runtime test.”
Chubb came in early one morning to “try to shut down one machine and watch everything fail over.” The problem was that it didn’t, and by 8.15am “people started arriving at work and couldn’t do anything because they couldn’t resolve any of the machine names for the servers they used to use.”
He managed to get everything working again by 11am that day but it “was not a good experience.”
“I obviously had to go back to the drawing board and rethink how we were going to do this high-ish availability,” he said.
He had since come up with several alternatives though he noted that “the adventure continues”.