Gumtree owner eBay Classifieds Group spent the first six months of the year patching 1000 hypervisors and rebooting 80,000 CPU cores powering its private cloud to mitigate against the Meltdown and Spectre vulnerabilities.
The company, which operates under the eBay umbrella but runs 10 local market websites like Gumtree Australia where users buy and sell goods, started the project soon after the vulnerabilities leaked out in the first days of 2018.
The group was largely unaccustomed to needing to restart its entire infrastructure.
“We don’t [restart frequently] currently so we are doing it right now [only] when it’s needed,” cloud engineer Bruno Bompastor told last month’s Openstack summit in Berlin.
“We intend to do it in future every three months so we are looking next year to come up with a plan to automate this.”
eBay Classifieds Group’s infrastructure consists of a multi-region private cloud that runs Openstack on Dell hardware. The two data centres (regions) are in Amsterdam and Dusseldorf, and there are four availability zones in each region.
Cloud reliability engineer Adrian Joian said the company learned of Meltdown and Spectre in the same way as most other people.
“We were not one of the companies that knew upfront about the exploits so we were informed with the rest of the world,” Joian said.
“We started to understand what mitigating the bug would mean.”
Bompastor added, “We started with an assessment phase to investigate what packages we needed to patch and what we needed to do for this to be covered.”
eBay Classified Group’s plans boiled down to patching hypervisors in the four availability zones for each region with the latest Linux kernel, KVM version and BIOS updates.
Having scoped the extent of the work, the group “started to develop all of the scripts and automation that we needed to apply this patches”.
The company’s strategy was not to “patch everything at once”, Bompastor noted.
“Initially, we just added some infrastructure with the patches to see how it behaved,” he said.
“At the same time we were keeping an eye on the community to see if there was some consequences of patching the hypervisors - if the load would increase immensely - because people were afraid this could happen.”
Once comfortable that patches could safely be applied, the company started with one of four zones in its Dusseldorf region.
“Dusseldorf was the first data centre that we decided to patch,” Joian said.
“We went the full cloud way where we told our users we were going to take down one availability zone per week and they were fine with that.
“But we [had] never done maintenance at this scale so it was a learning curve for us as well.”
The company initially targeted zone 4 in the Dusseldorf region.
“Our tenants had the choice to either move their workload into the other three availability zones or have the downtime with us, and then we would bring their VMs back up,” Joian said.
“We took the zone completely down and then up again, patched everything, and restarted the VMs.
“We did the same for zone 3, and then we got a bit of momentum with the last two zones and did them in the same week.”
Dusseldorf - with its 37,300 cores - was patched against Meltdown and Spectre in a month, though it wasn’t all smooth sailing.
“The lessons we learned were, yes we can do it very fast, we can upgrade an entire data centre in a few weeks, but it was not that optimal,” Joian said.
“We had feedback from our users that their Elasticsearch clusters were resyncing shards a lot. If we weren’t careful enough in bringing up entire racks they had problems with rebalancing their Elastic clusters.”
The company decided to give itself more time to patch the four availability zones in Amsterdam.
This wound up taking four months, though part of the additional time was required after eBay Classifieds Group found other elements of its infrastructure, including its Juniper Contrail software-defined network (SDN) and AVI Networks load balancer-as-a-service, that were in need of patches themselves.
“We started with the same approach [as Dusseldorf] of one zone per week, but on the first upgrade we had an issue where we needed to patch our SDN,” Bompastor said.
“That happened as well while we did the second zone, so we had two patches in-between [the Meltdown/Spectre patch application] which is why it stretched a little bit on our target to do it also in one month.”
Bompastor said that eBay ultimately decided to apply the Meltdown and Spectre patches less aggressively to ease user concerns.
“We started with one zone per week but ended up with a rack per day because it looked like a good compromise between velocity and impact,” he said.
“Some platforms - which is a group of users for us - were afraid it was impacting too much on their workload.”
The upgrades and reboots were completed in July of this year, the company noted.
Tests on key workloads - such as the ability for users of its e-commerce sites to search for goods to buy and sell - showed no adverse performance effects after the patches were applied, a relief for the company which - like others - was worried about potentially taking a performance hit of 10-15 percent on its hardware.