REA Group moved 500TB of BigQuery data from the EU multi-region to the Sydney region of Google Cloud over a period of five weeks.
The project was performed by REA Group subject matter experts and two consultants from IT services firm Servian, and the work is detailed in a Medium post.
The post notes REA began using Google Cloud “several years ago, primarily focusing on using the data analytics tools and services on the Google Cloud Platform (GCP) technology stack.”
“Back at the beginning of the journey, the region in Sydney did not exist,” Servian senior consultant Pablo Caif wrote.
“Like many other GCP customers at that time, REA Group chose the EU multi-region for analysing their data in BigQuery.
“Fast forward to today, and because of newly established contractual obligations and data sovereignty requirements, REA Group wanted to repatriate its BigQuery EU datasets to the relatively new Sydney region.”
Caif said REA’s GCP-based data workloads served analytical and “critical reporting functions” for sales, marketing, audience and other purposes.
He said Google Cloud Storage (GCS) and its transfer service “was used as the main technology” to repatriate the data.
“We used GCS to extract the data into, and then reload it back into BigQuery on the Sydney side,” Caif wrote.
‘Hot data’ - data called upon most frequently by REA - had to be moved “within an aggressive 48 hour window.”
“This, coupled with the need to validate that the data had been migrated successfully and without corruption, made it all the more challenging from an engineering perspective,” Caif wrote.
Caif was at pains not to over-simplify the project.
“When you’re shovelling half a petabyte of data around from one continent to another, things get a lot more interesting and challenging,” he wrote.
“The movement of that much data did in fact throw up a few considerations that weren’t in play for smaller repatriation projects that we’d done in the past.
“For example, BigQuery has limits and quotas for extracting and loading with GCS that we needed to consider and engineer solutions for.”
The repatriation covered a few “hot data” tables exceeding 100TB in size and that were being updated in real-time by streaming jobs.
“Migrating these tables were by far the most technically challenging hurdles that we needed to overcome,” Caif wrote.
Servian wound up breaking the tables into smaller chunks that could be more easily migrated, especially within data extraction limits set by Google.
“On the other side, when reloading them back in, we of course needed to reassemble/ recombine then back into one table with the right partitions,” Caif wrote.
“This also involved some more heavy duty engineering effort.”