Information technologists at the European Organisation for Nuclear Research (CERN) face a tough challenge in the wake of its discovery of a "particle consistent with the Higgs Boson" last month.
CERN has committed to keeping its Large Hadron Collider (LHC) online for several more months, as physicists conduct yet more experiments to determine precisely what the discovery means.
The commitment requires CERN's IT team to once again adjust planning around the lifecycle of IT infrastructure to meet an extended project deadline.
iTnews’ Brett Winterford sits down with CERN’s deputy head of IT, David Foster, to discuss how it plans to stay ahead of the technology curve.
Check out Brett's photos of CERN's 'Tier Zero' data centre here.
What does the recent Higgs Boson discovery mean for IT?
What was discovered was a particle at an energy that seems to be compatible with what we understand the Higgs should be. All of this – apart from the fact that a particle exists at that energy – has to be verified.
Yes, it’s a discovery, something new and interesting that seems to be compatible with the Higgs, but the experiments slightly disagree as to what energy it resides in.
So we need to do a lot more measurements. It’s all about data. You need to collect a lot of event data to have a statistical chance of seeing the same outcome enough times to prove that what you’ve found is real.
We’re going to keep the accelerator running for three more months longer than planned into 2013 at current energy levels.
The LHC was supposed to shut down in 2013 so we could upgrade the accelerator to the highest energy available. We are currently operating at 4 TeV per beam in each direction, which means a collision generates a sum of 8 TeV. We originally intended to move to 7 TeV in each direction (total of 14 TeV in a collision) by the end of 2014. But the success of the current experiments has convinced us to stay the course at the current energy levels.
The discovery of the new particle will require intense analysis work. It will certainly require at least what we have today in terms of computing, disk and network and will continue the trend towards increased capacity in all these area going forward.
Can you give us some idea of the current data processing requirements for the LHC experiments?
When the LHC is working, there are about 600 million collisions per second. But we only record here about one in 1013 (ten trillion).
If you were to digitize all the information from a collision in a detector, it’s about a petabyte a second or a million gigabytes per second.
There is a lot of filtering of the data that occurs within the 25 nanoseconds between each bunch crossing (of protons). Each experiment operates their own trigger farm – each consisting of several thousand machines – that conduct real-time electronics within the LHC. These trigger farms decide, for example, was this set of collisions interesting? Do I keep this data or not?
The non-interesting event data is discarded, the interesting events go through a second filter or trigger farm of a few thousand more computers, also on-site at the experiment. [These computers] have a bit more time to do some initial reconstruction – looking at the data to decide if it’s interesting.
Out of all of this comes a data stream of some few hundred megabytes to 1Gb per second that actually gets recorded in the CERN data centre, the facility we call ‘Tier Zero’.
How much of the data is stored and processed on site at Tier Zero?
For the first time in particle physics, not all the processing for the data generated by these experiments could be done on-site. It simply requires too much processing power. So what we have is the concept of a grid, where we tie together all the computing resources of collaborating institutes so that it essentially looks like one pool of infrastructure.
The filtered raw data comes into CERN’s Tier Zero data centre, but it is also sent out in almost real-time to eleven other large ‘Tier One’ data centres. These large data centres provide storage and long-term curation (back ups) of the data.
Other data products flow down to Tier Two centres. The Tier Twos provide CPU and disk for analysis and simulation. There are around 150 Tier Two data centres on the grid, made up of other labs, institutes and universities all over the globe.
So there is a continuous evolution of the data. You reconstruct the events, filter out noise, group data sets together and create new data sets.
There is also always at least a second or third copy of the raw data distributed on the grid.
We have a distributed back-up system. You don’t have just one copy of the ATLAS data at the CERN site, that data is distributed at various sites that support the experiment. There is always one copy at CERN and a number of other copies elsewhere on the grid.
What defines this system as a ‘grid’?
If I’m a physicist and I have some processing I want to do on some data, I submit that job to a physical machine that knows about the whole infrastructure. It knows at what site that data has been stored or replicated to on the grid, and it sends my job to one of those sites. It will be executed there against the data, and the results sent back to me. So jobs are flying all over the place.
What makes it a grid is that the user doesn’t see which server or storage system is being used, but rather submits a job and the system takes care of the execution of the job somewhere on a worldwide basis.
To what degree is the data processed in real-time?
It occurs in CERN at real-time, and I guess you would call it pseudo-real time out to the Tier One facilities and a cached batch process out to Tier Twos.
Essentially, what the system allows is almost real time analysis of the data. And that is a new innovation. We are able to go from taking the raw data to actually writing the research papers in an extremely short period of time. Sometimes we are showing results at presentations within weeks or days of the raw data being captured. This was never possible in the past.
Grid computing has enabled the very fast production of results into hands of physics community. And that’s been very spectacular.
You mentioned that the LHC has been nearly 30 years in design and construction. How can you predict on day one what technology will be available once the accelerator and experiments have been realized?
Generally when the initial planning is done for a physics idea, the technology is not necessarily available. What you are relying on is that the time it takes to design the accelerator and experiments is so long, by the time you need the technology it will be available.
So Moore’s law has been very important to us, and it has proven real. We have been relying on the doubling of processing speed every 18 months. And these days we are looking at parallelization using multi-cores to keep the trend going.
But it’s the evolution of networking that made the grid realisable.
In the late 90s we anticipated the networking we might have available to us for the LHC would be 622 Mbps. But what happened was the deregulation of the telecom industry in the late 90s and 2000s. That totally changed the landscape: instead of a few hundred Mbps, we found that we could have gigabits.
Without this step-change in networking, it wouldn’t make sense to move all this data around. It enabled a change of the business model for computing. We were quick to realize that and adapt the way we did computing in order to capitalize on that. That decade from 1998 to 2008 was formative to developing a new concept of collaboration on a global scale.
Was the particle physics community open to such collaboration?
This is one characteristic of the high-energy physics community – it is highly collaborative, highly distributed and highly mature in its methods. We were able to build the LHC computing grid because of both the technology becoming available, plus the will and desire of people to become a part of that collaboration.
Everything in high-energy physics has been a major collaboration, and in that sense it is quite different to a lot of other scientific disciplines. For example, any experiment at the LHC tends to be a major collaboration in its own right. The number of people working on an experiment at the LHC - dedicated to a detector they are building and the results they are seeking - is as big as the total staff of CERN.
To what degree do these experiments also compete?
From an IT perspective, each experiment – be that ATLAS, ALICE, CMS etc - develop and design their own computing solution. That’s part of the competitive advantage of each experiment.
ATLAS and CMS for example, are both looking to solve similar physics problems. They do not communicate during experiments to avoid systematic errors and prevent bias creeping into the analysis. They do not reveal results to each other until the data is analysed, but then they issue joint press releases. So it’s a mix of collaboration and competition. Its co-opetition.
Read on for a discussion on CERN's use of cloud computing...