Data Wrangling: Priming UK Biobank Data to Drive Discovery

February 16, 2021

By Stan Gloss

February 16, 2021 | TRENDS FROM THE TRENCHES | Meet biopharma’s new superheroes: Ariella Sasson of Bristol Myers Squibb (BMS), Will Salerno of Regeneron and David Sexton of Biogen. They’re all data wranglers but have slightly different roles and responsibilities. They’re also collaborators—and competitors—who have equal access to the rich data sets available through the UK Biobank project.

While the term, “data wrangler,” is not anyone’s actual title, it aptly describes the functional role of those at the next level of data management going beyond traditional and straightforward data extraction, transformation, and loading processes. Much of the data that wranglers handle is raw, complex, unstructured, and comes in a variety of formats. Schemas may be experimental or initially undefined which further complicates how the data will be managed—especially when it's coming in at terabytes at a time as it has for the last decade or so.

The phrase has been used on the internet since 2017 and is associated with the team at the UK Biobank. There, a group of 500,000 volunteers are participating in a multi-year study tracking their health which can be correlated to the whole genome and exome sequences being generated for each participant. All the data is anonymized and fully compliant with the most stringent data privacy laws. The effort is epic and offers unparalleled insights into the mechanisms of disease and new avenues for the development of novel therapeutics.

Ariella Sasson, principal scientist at BMS, outlines the potential of the UK Biobank dataset. “One day, I hope to be able to go into the dataset and easily ask a question like, ‘Show me all the people with a BMI greater than 25 with a particular disease or biomarker’ and see which sequences and health outcomes are in that dataset,” she said. She sees her role as making it as efficient and painless as possible to do all these desired analyses. [Editor’s note: Since this conversation, Sasson has taken a new position. She is now Global Life Sciences Solutions Architect at Amazon Web Services.]


Wrangling Gobs of Data

To enable that vision for pharma researchers like Sasson, the UK Biobank sends human samples to the Regeneron Genetics Center, which over the last five years has sequenced more than a million exomes, including almost all of the UK Biobank participants. These UK Biobank exomes are a public resource funded jointly by the UK BioBank Exome Sequencing Consortium (UKB-ESC), a research partnership between the UK Biobank and biopharma leaders including Abbvie, Alnylam, AstraZeneca, Biogen, BMS, Pfizer, Regeneron, and Takeda. Regeneron’s unique role in this consortium means they are on the hook to make sure the data get delivered correctly and on time.

Enter Will Salerno, Senior Director of Genome and Sequencing Informatics at the Regeneron Genetics Center. Salerno and his production team are the last stop before the data go out the Regeneron door. The volume of data moving back and forth between the RGC, the UK Biobank, and the consortium members is non-trivial, comprising millions of files in dozens of formats totaling close to a petabyte of storage as of 2020. Salerno explained his team’s role as the generators and stewards of the raw sequence data, “We turn raw data into actionable data formats—whether it be variant call files, QC metrics, or whatever we get asked for—then feed that back to our lab so that we can monitor the quality of the data, finally aggregating those data for our internal and external collaborators.”

A prerequisite data-hygiene step is required along with a full copy and data aggregation step to ensure the integrity of the original data source. Once these critical steps are complete, data wranglers enable scientists and other domain experts to analyze combinations of various data to identify genomic variation, associate phenotypes to gene expression in translational studies and so on. Such work is essential within biopharma as it is foundational to modern drug discovery and the cornerstone of precision medicine.

“For every one of our collaborations, we periodically aggregate the data that we sequenced for them, and we assemble that to what we call the data freeze that we make available to them,” Salerno explained. “I think last year, the RGC did about two freezes a week for the entire year, which is just a really impressive rate to put data out.”

Given the tremendous responsibility, quality control, transparency, data security demands, and sheer volume of data that his team handles, Salerno considers the term data wrangler “a little bit too fun.” Plus, he adds, “the data wrangling isn’t limited handling the data itself. A lot of it is communicating with our partners to make sure that everyone involved understands what’s going on.”


Where Genotype Meets Phenotype

The UK Biobank collects and manages all the phenotypic data: more than 17,000 fields for each of the 500,000 volunteer participants. They then make that data available in the public domain to enable analysis of the genetic predisposition to disease in the context of environmental exposure. To uphold data privacy, the UK Biobank deidentifies then encrypts all the data, locking consortium members out of the gene-phenotype combinations until the members are presented with a key—a linking file—for each new batch of data generated.

The linking file contains the linker codes which are needed to decode the encrypted data and must be applied systematically to correlate sequence with phenotype and health outcomes data for each patient. Although Salerno and his team at Regeneron generate the sequence data, his efforts are firewalled from his colleagues tasked with making new discoveries. Synchronous access for consortium members ensures that members gain access to the data at exactly the same time and hence, have an equal opportunity to study it.

Sasson and David Sexton, Senior Director Genome Technologies and Informatics at Biogen, eagerly await each new batch of sequence data, applying the critical linking file to match the whole exome data with the rich phenotype data. “Access to the linker file also starts the clock ticking where we [the consortium members] have one year to access to it before it goes into the public domain,” Sexton explains.


Planning is critical

The first batch of data exported from the UK Biobank created a few wrinkles for data wranglers and arrived with an unpleasant surprise: a space in one of the directories. “When you're doing anything on a Linux system in the command line, spaces cause significant amounts of problems because the command line doesn't read the space in context, Sasson explained. “It just reads it as a separator. Not all code plays nicely with spaces, so it was much easier to just fix it by erasing all the spaces.” The data wranglers worked on collaboratively with the UK Biobank team so that there would be no issue with nomenclature in future batches.

“The feedback that we get from the pharma partners is critical to making sure that we catch any sort of the issues early, and get them fixed quickly, which will save us all a lot of time and money in the long run,” Salerno said. “Troubleshooting, quality control, fixing those problems, it’s what we do. We're operating here at scales that very few [people], if anyone, has ever operated at before. We're doing it at a speed which I really do think is unprecedented.”

Collaboration is also moving forward at scale. For example, the Regeneron Genetics Center group actively participates in nearly 100 collaborative efforts at a time. Logistically, maintaining data privacy and upholding proprietary restrictions takes a lot of data wrangling. “A very large component of what we do is working directly with our collaborators to answer their questions, to take their feedback, to facilitate specific analyses that they need,” Salerno said.

Data wranglers are informed on the timing of the arrival of new batches of data. Because consortium members have a limited period of data exclusivity, every day counts. Advanced planning can confer competitive advantages even though each consortium member theoretically has the same chance of making a discovery. We're prioritizing the types of questions that we're going to work on, Sasson explained. “We do not have unlimited resources. We don't just process data for the sake of processing data. We want to be able to help targets move stages, support FDA submissions, and answer impactful questions.”

She continued, “If the data's too big, it becomes unmanageable and unimaginable, especially if you don't have experience with this kind of data. So, if you bring your question into very narrow focus, you can learn how to manipulate the data, get trustworthy results, maybe learn some of the advantages and disadvantages of the data, and still figure out whether your result is a good result or not. I think you have to define your questions very well, otherwise you can get carried away by the noise.”

This question-first perspective helps guide discovery, Sexton agreed, and this is where his role is crucial. “It does require a certain type of person that helps the biologists with this data,” he said. “Our genome or clinical genomics team spends a lot of time with the biologist interpreting this data for them and helping them to understand.”

But first Salerno recommends reading the manual. “The number one thing I'll tell anyone when they're getting into a new data set is to read the full manual. Good data always comes with a README. Then take that information, compare it with your specific aims and your scientific goals to see what you really need to do in order to add something to the existing body of work.”


No Byte Left Behind

Wranglers are responsible for coordinating the migration of insanely large packets of data from one location to another and for ensuring the integrity of the data that’s moved. Transferring data packets from one cloud to another today is generally seamless, albeit expensive, but data wrangling pre-cloud created a lot of sleepless nights. “The bigger your data set, the more likely you were going to see failures,” Sasson remembered. “There was always the potential of a crash, and so you were always monitoring your system to make sure that your jobs were processing accordingly… I spent a lot of sleepless nights just monitoring the system to make sure everything got processed properly within the timeframe. It was a lot of time spent on the forensic accounting of bioinformatics.”

Today, while the logistics have been greatly improved, costs are still high. Even data stored in the cloud creates a cost-burden because it has to be moved, sometimes between cloud providers.

“Transfer costs for 100,000 whole exomes is about in the order of $4,000. We get a half a million whole exomes in every batch, so you multiply that times five which is about $20,000 for all the transfers. Putting the analyzed data back in is only a fraction of that,” Sexton calculated. “AWS is running the analysis using Hail which is a resource hog so we’re paying for Hail’s inefficiency. We’re paying for both the compute time and the time of the person who’s doing the analysis.”

Given that some of the biopharma consortium members are located around the block from each other, it almost begs the question of why a loaded hard disk can’t be walked down the street. Almost.

There’s also a human toll. Data wrangling is hard work, labor intensive and stressful. Working for employers who understand the pressure makes that burden more manageable. It’s about aligning the mindset of the company with the efforts of the wranglers. “This is a marathon. It's not a sprint,” said Sasson.


Big Picture, Big Rewards

Yet the consortium members understand the significance of the UK Biobank project. “This is a very big investment for every pharmaceutical company. This data's precious. They've spent a lot of money on it and recognize the huge potential,” Sasson said.

Salerno credits this perspective with the success of the consortium. “The biggest challenge is making a dataset that is useful to a wide group of people doing a lot of different things,” he said. “Building consensus with this group has been fantastic. Everyone gets the big picture. We understand where every group has to compromise a little bit.”

For both BMS and Biogen, the big picture view is already returning value. At BMS, Sasson said the company sees novel targets rise to significance, “because there's just so many more samples than anything ever published before.” Biogen is applying UK Biobank findings to all internal, in-process programs, Sexton said.

And the returns will only increase, Salerno believes. “Ten years down the road, we're going to look back and say that the UK Biobank study was the model, but we will also see what these data didn't capture, and that will point us toward the next ten years.”