Getting The Cloud To Work For Genomics

Contributed Commentary By Arik Keshet

October 30, 2017 | As in other industries, the cloud holds the promise of flexible, globally accessible, and scalable genomics computing. Unlike the information silos of most Internet content companies, the tremendous value of sharing genomics information provides a strong incentive for cloud migration. Why is it, then, that – certain national initiatives aside – this promise has been so slow in materializing?

The Challenges

Content upload is a first major hurdle: contrary to trickles of data from millions of separate Internet users, huge (hundreds of gigabyte) data samples need to be transferred from the sequencing lab to the cloud. All this data is then stored – likely forever – and becomes a significant cost element: contrary to popular belief, cloud storage is actually more expensive than a carefully engineered large-scale on-premises facility. Additionally, applications and workflows that were written for on-premises High Performance Computing (HPC) installations need to be adapted to work in the alien environment of object stores, ephemeral compute instances, and connectivity constraints. In order to make cloud computing effectively work for genomics, a number of new technologies must be developed and deployed and some of the established paradigms re-assessed.

Data Compression

Data compression can mitigate the financial cost of cloud storage and the operational burden of lengthy transfer times. Data integrity is paramount and only 100% lossless compression can be considered, ZIP being a well-known, general-purpose example. The challenge is that compression cannot be universal and efficient at the same time: only by taking advantage of the unique characteristics of genomics data can an algorithm capture inherent redundancy and efficiently compress. Thus, ZIP will typically compress a FASTQ file by a factor of 3. GZIP is in fact used to compress SAM files to the BAM format, achieving a savings factor of around 2.5. In contrast, by leveraging biological insight into the structure of the analyzed DNA and RNA molecules, as well as the workings of the sequencing machine producing the FASTQ and eventually BAM file, FASTQ files can be losslessly compressed by a factor of 10 and more (3 times beyond ZIP), and BAM files by a factor of 2.5 and even, in certain circumstances, a factor of 4.

Cloud Infrastructure

However valuable data compression might be, it will not be widely adopted if that means introducing yet another file format and having to educate bioinformaticians and programmers about its proper use. The trick is to make compression-decompression part of cloud infrastructure in a way that is transparent to applications, which for their part will continue to read and write native-format FASTQ and BAM files.

Economics aside, migration to the cloud also means a number of significant technological changes. Large-scale and cost-effective cloud storage is invariably based on object stores. While the object paradigm is a good fit to the write-once nature of the bulk of genomics data, virtually all bioinformatics tools were written with files in mind and therefore need to be adapted to deal with objects. Furthermore, the high-latency of cloud object storage is a bottleneck to high-performance computing. Migration to the cloud will be easier if users are provided with an Infrastructure as a Service (IaaS) solution that stores genomics data in objects but at the same time presents applications with a file system interface. In a Linux environment, this amounts to implementing a VFS-compatible file system that is added to the machine image used by compute instances. To be effective, this file system needs to be distributed, so that it does not rely on a permanently running dedicated server and efficiently scales with the number of compute instances. Smart caching on locally attached disks of recently read or written file segments can reduce to a minimum the performance penalty in object storage access. Compression can be integrated into this cloud infrastructure solution if objects are compressed, while background compression and on-the-fly, random-access capable decompression ensure that applications remain completely unaware of all that’s happening behind the scenes. When combined with local caching of recently used uncompressed data, the performance penalty involved in compression is insignificant. The result is cloud storage that is on-premises-like, high performance and cost effective at the same time.

Storage Tiering

To mitigate costs, cloud providers now offer multiple tiers of cloud storage service that differ in reliability or trade off storage cost with read/write charges. For example, S3 storage from AWS in the Infrequent Access (IA) tier carries a monthly charge that is 45 percent lower than Standard tier, but adds read and write fees. To effectively use tiered cloud storage, an implementer must be aware of the structure of raw genomics data and the way it is typically used. Let’s take as an example a sorted, whole-exome or whole-genome BAM file. Once the BAM file is processed through the pipeline, most follow-on analyses will only address a small part of the genome and will therefore only read a small part of the file. Very effective tiering can be realized by having the storage system described above store each BAM file in multiple objects, track the access to each object and manage its storage class accordingly. All this can be combined with blocked compression so that the savings from tiering comes on top of the savings through data compression for an overall reduction in cloud storage cost of close to 90%.

By using these technologies and others that are under development, the promise of effective genomics cloud computing will finally be fulfilled.

Arik Keshet, co-founder and CTO at Geneformics, leads the company’s efforts to optimize data transfer, storage and archiving of genomic data by mitigating the complexity of massive data sets with lossless compression and decompression technology. He can be reached at info@geneformics.com.