Biopharma’s Dark Data Problem

November 19, 2021

Contributed Commentary by Christopher Bouton, Vyasa

November 19, 2021 | Pharmaceutical R&D is a long, arduous and expensive process. And in many cases the years of research, testing and funding end in unsuccessful outcomes. In fact, the Pharmaceutical Research and Manufacturers of America reports that only 1 in 5,000 drug candidates result in a commercial product with a median capitalized research and development cost per product of $985 million as reported by The Journal of the American Medical Association (JAMA)

That’s a lot of time and money, but it’s not stopping pharmaceutical companies from investing in drug discovery. According to Statista, worldwide spending on pharmaceutical R&D will reach $230 billion by 2026.  

So with investment dollars flowing in and R&D continuing an upward trajectory all should be well, right?  

Well… not exactly. 

The Dark Data Conundrum  

Like many industries, pharmaceutical companies are facing a data problem. During the big data boom of the early 2000s companies were told to save as much information as possible, quickly filling up available storage and purchasing other infrastructure to house data without much strategy behind it. While the possibilities of big data seemed endless, what most didn’t foresee was that without the ability to find or make sense of the data it would mostly be rendered useless, collecting dust and becoming an extra line item on IT budgets. 

As such, companies have ended up with a deluge of dark data, also known as siloed or hidden data buried under repetitive or obsolete information. In most cases, dark data is largely unstructured. Meaning this data is in the form of written lab notes, reports saved as PDFs, PowerPoint presentations, or scanned images that hold vast amounts of valuable insights. These data sources are incredibly hard to find and once identified can take hours, weeks, or months to search through.  

For example, in the discovery and development stage, a research scientist may be tasked with analyzing hundreds of patent filings, clinical trial research, or small compound analytics to identify insights relevant to the project at hand. As you can imagine, this is not only a tedious process but one that is ripe for human error. (Just ask anyone who’s had to stare at spreadsheets all day only to find out they’ve overlooked an entire data set.) Further, this typically takes highly-trained individuals away from more valuable work all while sinking money in the process.  

Shining a Light on Your Dark Data  

With the massive amounts of information being collected by companies every day, this simply isn’t a problem that can be ignored. Deloitte reports that the creation of genomics data alone has increased from approximately ten megabytes per year in the mid-1980s to over 20 petabytes from 2015–19, an increase of over nine orders of magnitude. As digitalization continues, these numbers will only increase as pharmaceutical companies expand their data streams to include wearables and health apps, social media and more.  

Fortunately, new technologies are stepping up to face this challenge head-on. Advancements in deep learning and artificial intelligence (A.I.) specifically are changing the way companies approach their data by identifying novel relationships between data sources to more easily discover and make connections between dark data. This comes primarily in the form of natural language processing (NLP) which enables algorithms to learn structures of words or images to quickly detect matches within a data set.  

Now the researcher tasked with combing through hundreds of patent filings or lab notes can collect the insights they need in a matter of seconds or minutes instead of days or weeks.  

The ironic part of these advancements is most of these ideas aren’t in fact new at all. The concepts of deep learning and AI have been around since the dawn of computers. The difference is we didn’t have the computing power or the available data to fully take advantage of these capabilities. Companies like NVIDIA are leading the way by providing the infrastructure needed to make these capabilities successful along with advancements in software development like Helm charts enabling fast and easy deployment.  

A New Beginning for Biopharma R&D  

The data age is presenting an exciting era for new drug discovery. We saw this just recently with the accelerated production of the COVID-19 vaccine. While we certainly don’t need another global health crisis to spur innovation, it’s an excellent use case for what having the right data and technology at hand can mean for the industry. 

The onset of deep learning and AI won’t mean that every new drug discovery process will now happen in months instead of years. What it will mean is pharmaceutical companies now have greater access to knowledge that can accelerate the development process in total - whether it’s identifying relationships between molecular compounds, uncovering potential side effects in your target audience or discovering a new treatment method for an emerging disease. All of which will lead to a more knowledgeable and efficient industry and ultimately a healthier population.  


Dr. Christopher Bouton is founder & CEO of Vyasa, a provider of deep learning AI analytics software. Prior to Vyasa, Bouton founded big data analytics company Entagen which was acquired by Thomson Reuters. He also served as head of integrative data mining at Pfizer. Dr. Bouton holds a Ph.D. in molecular neurobiology from Johns Hopkins University. He can be reached at