The US National Institutes of Health has made some 200 terabytes of genomic data publicly available on Amazon’s S3 cloud in a bid to boost participation in the international 1000 Genomes Project.
The dataset contains DNA information of 1700 individuals, and will expand to include the genomic sequences of more than 2662 individuals from 26 populations around the world by the end of 2012.
It has been publicly available from the National Institutes of Health’s (NIH) National Center for Biotechnology Information and European Bioinformatics Institute since 2008.
Project organisers said this week that storing the data on Amazon Web Services (AWS) would make it cheaper for researchers to access and analyse the data, by using AWS’ cloud computing power.
Amazon said the data could be “seamlessly accessed” from its Elastic Compute Cloud (EC2) and Elastic MapReduce offerings, which provided computing power and big data processing capacity.
“Researchers can use the Amazon EC2 utility computing service to dive into this data without the usual capital investment required to work with data at this scale,” the company stated.
“Making the data available via a bucket in Amazon S3 also means that customers can crunch the information using Hadoop via Amazon Elastic MapReduce, and take advantage of the growing collection of tools for running bioinformatics job flows, such as CloudBurst and Crossbow.”
NIH’s partnership with Amazon Web Services was part of the US Government’s $200 million Big Data Initiative, announced this week by the President’s Office of Science and Technology Policy (pdf).
The 1000 Genomes Project aims to document human genetic variation, to identify regions that are associated with particular diseases or traits.
The data may be accessed for free from s3.amazonaws.com/1000genomes, using http and Amazon software development kits in Ruby, Java, Python, .NET and PHP.