Building A File System To Keep Up With Genomics England’s Five Million Genomes Project

March 24, 2020

By Allison Proffitt

March 24, 2020 | When David Ardley took over the platforms function at Genomics England, he inherited “quite a challenging storage environment,” he told Bio-IT World. The UK Department of Health launched Genomics England in 2013 with an audacious goal to sequence 100,000 genomes. Since October 2018, the vision has expanded to 5 million genomes—a growth of 4,900%.

At the end of 2018, GEL already had 21 petabytes of genomic data and expected that number to grow to over 140 petabytes by 2023, when the 5 million genomes project is slated for completion. GEL’s previous scale-out NAS solution had already hit its limit on storage node scaling and was experiencing performance issues. 

It was time for something new. GEL kicked off the RFP process in January 2019 seeking a storage platform for a new era of genomics.

GEL needed several functionalities from the new storage system, Ardley explained. The RFP benchmarked against three primary requirements: The new storage needed a robust security and disaster recovery plan. All the data needed to be active and accessible for researchers. And the system needed to perform at scale to achieve the goal of five million genomes by 2023.  

The GEL team evaluated four proposals. They rejected parallel file systems because of their complexity and lack of enterprise features; they rejected all-flash scale-out NAS because the costs wouldn’t scale as GEL’s needs did.

Ultimately GEL chose WekaIO’s WekaFS solution. “That was primarily driven by Weka as a cache layer that required potentially low management,” Ardley explained. 

The WekaFS product is a distributed, shared file system running on commodity, off-the-shelf flash and disk-based technologies into a single, hybrid solution, says Barbara Murphy, VP of Marketing at WekaIO. “Our software layers on top of all those individual servers and creates those massively distributed file systems with all the underlying storage in each one of those servers combined together to present a single scale-out NAS solution.”

The solution met all of GEL’s priority requirements. First, storage needed to be distributed across multiple sites. Backing up 21 petabytes of data—and growing—is impractical, but GEL still needed a disaster recovery strategy and a strong security plan. “One of the key requirements of storage was to have sites 50 to 100 miles apart,” Ardley explained. For security and backup, the object store is geographically dispersed over three locations all 50 miles apart, but all located within England. If a major disaster occurs in the primary location, Weka’s Snap-to-Object feature allows the system to be restarted in a second location. 

The existing storage was a fairly traditional tiered approach, Ardley explained, but the new solution needed to be more flexible. “Researchers are looking at the same data and they randomly access everything,” he said. “Having all the data really active and performing is a requirement.” 

WekaFS delivers a two-tier architecture. The primary tier consists of 1.3 petabytes of NVMe-based flash storage that supports working datasets. The secondary tier consists of 40 petabytes of object storage to provide a long-term data lake and repository. In this case, the underlying object store is ActiveScale from Quantum (recently acquired from Western Digital).  

The whole 41 petabytes are presented as a single namespace. The two tiers scale independently; if more performance is needed on the primary tier, it can increase independently of the second tier.

“We manage the moving back and forth of data between those two tiers seamlessly to the user,” Murphy explained. “The user doesn’t have to do anything, they don’t have to load special software, they don’t need data migration software, they don’t need anything. We manage that all internally. When they say, ‘I want XYZ file,’ we move it from the cold tier which is on the object store right into the flash tier so it’s available immediately for the user.”

Finally, GEL needed a system to perform at scale. For five million genomes—ignoring compression—Ardley estimates about 150PB of needed storage. And as the number of genomes increased, he needs the system to handle ingest rates of eventually 3,000 genomes per day. “The I/O requirements for storage are very high,” he said. Of course there are other bottlenecks to that ingest rate, but “we wanted to design it such that storage and the high performance compute elements weren’t a bottleneck,” he emphasized.

“The rate that they’re actually creating data is phenomenal, and in fact we’re just about to do another expansion of the system,” Murphy said.  

Cost, Ardley said, wasn’t a primary priority, but, “It just so happens that the one we picked was also the cheapest,” he said.

Rollout Underway

GEL chose the WekaIO solution in April 2019, installation took about two months, and data migration began in September 2019. The data migration is still going. “We literally had about 25 petabytes of read/write data in a very unstructured way,” Ardley says. “You can imagine that’s not going to be a very easy migration process. It’s been very challenging, but we’re at the tail end of that.”

Some of the biggest challenges in data migration came from identifying who owned various datasets, whether they could be deleted, and what authentication was needed.

“Prior to starting this project, we did a lot of cleanup, basically just to keep the servers running. There was a lot of duplication, directories that no one knew who owned them,” he said. “There were lots of directories that were very small or empty… We spent a lot of time just trying to find owners of data… knowing whether we could delete something. Who do we need to talk to say we’re moving this data? If we didn’t have that really locked down and managed, you’re just creating a problem later on.”

But even with the migration and data stewardship challenges, Ardley is pleased.  

“So far the performance has been really good, though I wouldn’t say it’s been as stressed as it’s going to be. We’ve had to fine tune it, but it’s lived up to the performance we were expecting to see, which is quite nice!”