Two years ago, in Spring 2007, Cold Spring Harbor submitted the first DNA sequence data—Jim Watson’s 454 sequence reads—to the Short Read Archive at NCBI (National Center for Biotechnology Information). Since then, the SRA has become a critical component of the genomics community infrastructure, providing two-way access to enormous datasets, integrating with European and Japanese repositories, and storing sequence information on nearly 1,000 different organisms. As current and future technologies push the read-lengths closer to Sanger territory, the SRA has even undergone a name change: it now stands for “Sequence Read Archive.”
Kevin Davies spoke to NCBI’s Jim Ostell (Chief, Information Engineering Branch) and staff scientist Martin Shumway about the past, present and future prospects of the SRA.
Bio-IT World: How did the SRA come about?
OSTELL: The SRA came out of the Trace Archive activity, which was originally started for the mouse genome [shotgun sequencing] project. People thought they’d need a place to combine reads from a number of different centers and have one collection that you could try with different assemblers. That very quickly grew in demand, expanded, and became international when the Sanger Center Trace Archive joined. It was recently moved to EBI [European Bioinformatics Institute] and DDBJ [Data Bank of Japan] created a resource as well. We started exchanging data. With the advent of short-read technology, it was clear that the database design of the trace repository would not scale to short reads, so NCBI architected the design basically to accommodate the bulk expected from short reads but also to learn from some of the history of the design of the trace archive.
For example, in the trace archive, a lot of the metadata about each experiment is associated with each trace. That could be an issue when there’s a correction of the information, and you have to update thousands of traces. In the SRA, [we are] careful at what level of the hierarchy information was placed, so the least amount of information is touched during updates.
Another thing to acknowledge is that the unit of submission to the trace repository is a trace, but the unit of submission to the SRA is a run... It is also unclear what levels of data are going to be important to hold and for how long. For example, some argue that for samples that are difficult to reproduce, e.g. a particular patient where you won’t redo the experiment, or cases where there may be additional computational interpretation over time, e.g. SNP calling, you want to store the machine intensity data, not just the base calls. While for other activities, e.g. gene expression or epigenomics, where you’re essentially counting the number of reads and the details of what base may not be so critical, you may not need to supply the intensities.
The SRA has a column-oriented design—that has several advantages. We can use the same architecture for people to submit data with and without certain elements, but it also means that we can take one of those columns and store it a different way. We’re not expecting people to want the intensities very often, so we can store those on a low-cost tape farm, which is slower, and store the reads and the base calls on a disc drive. There’s a meta-database which is orchestrating those different columns, so from the point of view of a user, it’s all one thing.
Does “short read” mean shorter than traditional Sanger traces?
OSTELL: That’s a good point. SRA started out meaning short-read archive, but now it means sequence-read archive. The reason is exactly the point you raised. It’s a superior architecture to the trace repository. We’d expect next-gen technology to generate longer and longer reads but still producing massive amounts of data. Even as the reads get longer, we’ll still store them in SRA.
What can you tell us about the IT Infrastructure?
SHUMWAY: We use SQL Server—that’s the meta-database, if you will. That’s keeping track of where all the pieces are. The pieces themselves are disk files that use a directory system. We use Panasus…
OSTELL: That’s like a NetApp, lots of discs distributed on a network and a virtual file system. It was selected by our systems group, on the basis of price and performance. It seems to perform and scale well… We’re also installing a tape farm, a tape jukebox, like a really slow disc drive, for the intensities. You get a virtual view spanning this room full of disc drives connected to a tape juke box.
How does the SRA relate to GenBank?
OSTELL: There are two relationships. One is a structural biological relationship—SRA has the reads and it will also have eventually the alignment of the reads to the reference genome. It doesn’t have the assembled sequence—that goes to GenBank. Ultimately, structurally you’d like to have the assembly, the reads, and the alignment all connected together, even though they go to repositories with different repositories and different needs. The ID’s will match them up.
The political thing is that, since the TA and SRA began as NCBI projects and then got picked up by [Wellcome Trust] Sanger [Institute]… There’s been consolidation on the European side. The trace archive moved from Sanger to EBI. In addition, DDBJ, the third partner in the GenBank partnership, said they’d have a short read archive. At our last meeting, we agreed to officially make the trace archive and SRA part of the GenBank-EMBL-DDBJ collaboration. We’re beginning to integrate them together.
EBI has an SRA running—they’ve taken a copy of the software for our SRA and they are running it there after discussion of design and architecture. We’re attempting to have a common codebase for that, which obviously simplifies a lot of things. Historically, that’s not the case for GenBank—the formats are different and the databases are different.
Why is it so important to have these geographically dispersed identical archives?
OSTELL: There’s good reason in terms of redundancy and the sense that there’s a lot of work [invested] and a lot of data, and this is an evolving set of data. Having more people able to interact with their local constituencies and projects is good—it distributes the load. There’s no way one group could do all the molecular biology, and I think it’s been positive that we’re trying to work effectively as if we we’re one group while taking advantage of the investments being made in Europe and here and the strengths and weaknesses of each group.
How fast is data in the SRA growing?
SHUMWAY: We’re currently at 8.5 Terabases (Tb) of biological sequence under management. We’re growing by about 1 Tb/month. The major contributors are the 1000 Genomes Project, The Cancer Genome Atlas. We’re also bringing up a controlled access instance of SRA within the dbGAP resource—to provide the same privacy protection to research participants as the NIH GWAS studies have. The human microbiome project and epigenetics projects are other major contributors. Right now, the growth [rate] is about linear at 1 Tb/month. We may see a growth curve later this year or next year, as the American stimulus funds translates into sequencing data.
The archive contains 85% human data, whereas the old trace archive is 15% human. There are still a lot of genomes in the SRA—about 800-900 single organisms. As for complete human genomes, we have the two Korean genomes, EBI received the Han Chinese genome. We’ve received three whole genomes from Illumina—a Yoruban trio. One of those [Yoruban] individuals, Hap Map sample NA18507, was also sequenced to deep coverage on the SOLiD platform. We also received a whole human genome sequencing dataset done on the CompleteGenomics platform. Finally, we have been archiving the output of the 1000 Genomes Project, which consists of a number of deeply covered human genomes.
How are the various next-gen platforms represented in the SRA?
SHUMWAY: Illumina data currently occupies about 75% of the repository. The other platforms share 25%. We have good coverage from 454, a number of datasets from the SOLiD platform, and then we have one dataset each from Helicos—a yeast genome—and Complete Genomics.
How do users interact with SRA, such as accessing the Complete Genomics dataset?
SHUMWAY: There are those who download the complete dataset. It’s a challenge, no doubt. We’ve adopted the Aspera technology as a replacement for FTP, and it’s worked very well for submissions and otherwise. The data are huge, and most people seem interested in working with the data in the alignment form. If you’re on Internet 2, it would take a few hours [to download a human genome dataset]. If you’re not, it would probably take overnight.
Are you trying to drive the SRA agenda or adopt a more reactive posture to serve the community’s needs?
OSTELL: Oh, we’re very aware of that difference! We’re trying to strike the appropriate balance between those two extremes. This is a public service—we’re doing this for other people. But there are a number of constituencies. The mega projects have their deadlines and needs and ways to deliver the data. The small mid-scale projects, university sequencing centers, may not have an automated pipeline for providing data. We have people doing multiplexing lots of smaller samples, so their needs is more like a run-based archive, how do I give you a tenth of my run? Professor X is ready to give you his part but Professor Y isn’t.
We work with the vendors who, of course, are anxious to have their platforms well represented. We try to make sure, for example, the archive is capable of representing ABI color-space data, which obviously wouldn’t be the highest priority for a different vendor like Illumina. [laughs]. But I think the vendors appreciate our even handedness among them. At the same time, we’re using our experience to judge what’s likely to be important in the longer term. For example, a certain type of mega-user dominates right now, they download everything over Internet 2. But we recognize that as the data accumulate, other classes of user will have questions, e.g. ‘I’d like all the short read sequences under this particular gene, I’m not interested in the entire genome.’ That’s a vertical slice through the archive across many submissions—that’s the utility of working through an alignment. We think that’s going to be needed and we’re working towards providing that for the next wave of people interested in using the SRA.
Are you talking to future third-generation sequencing providers?
OSTELL: Oh we try and talk to them. Most of them know to talk to us, and we strongly encourage that as early as possible.
SHUMWAY: Each has their idiosyncracies, but in a very basic way they all do the same thing. We’re gratified that our data modeling has held up as long as it has. For example, the PacBio platform actually looks like the old Sanger platform, in terms of data management—not technology.
So where does the SRA go from here?
OSTELL: We’re very aware that sequencing will be the tool to be applied to any biological problem—it’ll be cheap and fast and quantitative. We see this impacting many resources—gene expression is going from microarray to sequence based; genotyping is going from microarray to sequence-based. The flow of sequence data through NCBI, it’s hard to think of that now as just being the sequence database, because it’s part of GEO, dbGaP, all sorts of things. It’s becoming an architecture where sequence flows in at various levels—the read, the alignment, the assembly, the genome… We see the SRA as a core enabling infrastructure spanning lots of projects and reaching across most of bioscience.