Research institutions participating in Australia and New Zealand's bid to host the Square Kilometre Array radio astronomy project have begun collaborating with the likes of IBM and Cisco on projects to develop new methods of processing and storing data from the telescopes.
The moves come in an effort to tame what is expected to be one of the most data-intensive projects to date.
The SKA, a 3000-dish array of telescopes, is expected to yield more than an exabyte of raw data per day by 2024, some 99 percent of which will likely be dumped in favour of usable images and time-series information for use by researchers.
Should Australia and New Zealand win rights in February to host the project over South Africa, data from the project will ultimately be held in the 500 square metre spare capacity at the Pawsey Centre high performance computing facility in Perth.
"Historically what they would do is put it on tape or disk and ship it down the mountain and load it up on some labs," said Dougal Watt, IBM New Zealand chief technology officer.
"None of them have approached the level the SKA proposes; nothing is in that league at the moment and there will probably be nothing in that league for a very long time."
The closest parallel, according to Watt, is Europe's Large Hadron Collider, which is estimated to produce approximately 15 petabytes per year. A global computing grid comprising 140 centres in 35 countries was established in 2008 to handle, process and store the data from the scientific experiments run in the LHC. This currently handles peaks at 10 GB of data processed per second.
Phase one of the SKA, to begin construction in 2016, will deliver about a tenth of the array's full capacity by 2019 and require a 100 petaFLOP machine. While a huge amount of data will be processed and disposed of, the project will require petabytes of storage every year.
Methods of processing the data quickly, storing it logically and transferring it quickly between its ultimate home and research institutions continue to be explored by researchers ahead of a full design of the data system due by 2015.
Significant research and development projects have begun between IBM, Cisco and Direct Data Networks to develop methods of implementing a hierarchical file system that will store the data appropriately for researchers to access in minute detail.
The companies had previously signed memoranda of understanding with the global SKA Organisation to contribute to research and development on the project.
"We believe that the storage companies like DDN and IBM, the heirarchical storage systems they're working on are probably adequate in terms of data volume for the SKA," said Peter Quinn, director of the International Centre for Radio Astronomy Research.
"I think where the challenges are going to lie is in data management - how you stage the data in the right way to process it, how you pre-condition the data, how you store the data."
Researchers at ICRAR have focused on the NASA-funded Heirarchical Data Format 5 (HDF5) as a possible file format for the data.
The format, combined with research involving IBM and DDN, would allow the storage systems to store data based on latency and high or low availability requirements for researchers.
The goal is to "effectively try to enable applications to instruct storage on the best way to put the data in place".
"If you don't [optimise], you can have a very intelligent and structured storage system but the storage system doesn't know anything about what the data is going to be used for," Quinn told iTnews.
Also in tow are research projects with Cisco to optimise networking equipment to better respond to the specific data flow patterns rates expected from the SKA during the first phase of the project.
Concurrently, researchers in New Zealand recently completed a three-month prototype of software that automates classification of star systems and other astronomical objects, a process that previously required manual intervention by scientists.
The three-month prototype saw reference data processed on laptops but would be required to scale up massively to ultimately handle the amount of data from the telescopes.
The software, built on the open source International Virtual Observatory Association Ontology, is expected to provide easier processing and storage of data.
"It has the ability to represent peta-size data sets, it would dramatically reduce the amount of data sets you'd need to have stored," IBM's Watt said.
"You'd still need to have the original peta-size data file because at some point you do need to go back and you might need to read about things inside the original image but we think we can get a massive reduction down to tera-sizes."
Though the prototype was considered a success, there have been few further talks on whether the software will be integrated into the wider Australian/New Zealand project.
"There's no one way that this problem of managing that data is going to be solved, it requires I think multiple approaches to try and address the problem," Watt said.
Quinn said ICRAR would continue to pursue further research and development projects over coming years ahead of finalising the systems ultimately required for the project.