Why NLP Is the Key to Interpreting Unstructured Data in Pharma Drug Discovery

July 19, 2021

Contributed Commentary by Sundeep Reddy Saireddygari 

July 19, 2021 | Natural Language Processing is a field of AI, which draws disciplines from computational linguistics and computer science to help machines infer the grammatical relations, semiotic annotations, and sentiment analysis of a clause in order to better understand human language. NLP allows machines to dissect, interpret, and manipulate human language, in both text and speech forms. Machines break down unstructured data, for example social media posts, going through a phase of data pre-processing to create structured data which can be used for analyzing. 

Within the healthcare sphere, NLP can be applied in a variety of situations, from reducing preventable deaths as a result of medical errors, to identifying clinical pathologies. For NLP in the Pharma industry, this can mean assessing the market potential of a new drug, as well as targeting patient populations more accurately for existing drugs. NLP is much faster than human processing and can standardize mass amounts of both qualitative and quantitative unstructured data quickly into usable information with actionable insights. 

Patent analytics, for example, is one area of drug discovery where manual techniques are currently both time-consuming and costly, and use analogous internet search reports that then need to be collected and reviewed. Developments in NLP as a tool for mining patent literature, however, have meant searching the patent landscape for freedom-to-operate and assessing patentability is easier and less subject to human error than ever before. 

Efficient Data Collection

There is a wealth of real-world data from the healthcare and pharmaceutical industries. With such vast amounts of heterogeneous data available, companies can find it hard to consume information quickly and efficiently. That includes typical information from patient data records and results from clinical trials, as well as information from unstructured texts on social media, patient forums, emails, reviews, and customer claims. For instance, NLP is now widespread in its use to interpret sentiment analysis based on reviews for new drugs. It adopts a purely linguistic approach to its analysis that studies the sentiment orientation of a clause in a review to decide if it is positive, neutral, or negative. The resulting data is valuable information for clinicians and drug makers looking to gauge public opinion. 

This diverse array of information is used to train AI/ML models to make processing tags, which are then applied to make predictions based on the data collected. Thanks to its versatility, NLP is now used at all levels of drug discovery. NLP can analyze data and make connections faster than its human counterparts, saving both time and money. In addition to helping drug companies prioritize their responses to customers, building customer satisfaction and maximizing the use of customer feedback, it also facilitates a better understanding of potential patent landscapes, as well as improving target identifications and clinical pathways for new drugs. 

Natural Language Processing (NLP) Breakdown

NLP automatically breaks down the syntax and semantics of the human language to understand it better. Where humans have had a lifetime of training in understanding the nuances of speech, NLP uses training techniques to pick apart the grammatical structure of a sentence and take into account the meaning of words in context with one another. 

This analysis is generated in a four-step process: tokenization, part-of-speech-tagging, stemming and lemmatization, and stop word removal. Sentences are initially tokenized into single clauses or smaller semantic units before part-of-speech-tagging categorizes the words as nouns, adjectives, verbs, and so on. Words are subsequently standardized by being reduced to their root form through stemming and lemmatization, with stop word removal used for filtering out commonly used words such as articles and prepositions.

This transmutation of unstructured to structured data is then used to build either a rule-based approach or machine learning algorithm. A rule-based approach involves linguistic experts or knowledge engineers utilizing grammatical rules to construct algorithms, while the alternative uses unstructured data to create tags that then train machines to make predictions and perform repetitive tasks based on example. In both cases, an NLP algorithm makes it possible to sort through large volumes of unstructured sources, producing both insights as well as correlating and connecting data.

Benefits of NLP 

Gathering information from traditional techniques such as focus groups, interviews, and questionnaires is a financial burden and time-consuming for pharmaceutical companies. NLP provides new ways to address even larger population sizes through the complex analysis of online posts and reviews from patients. Using real-world and up-to-date information from sources like social media and forums also helps companies identify and respond to any concerns or developments, as NLP filters through the unstructured data to produce clear breakdowns of information. 

The benefit of NLP in better drug discovery processes is its clear ability to speed up and reduce errors in data analysis and development pipelines. This is good news for the millions of patients affected every year according to the World Health Organization (WHO), by inaccurate or delayed diagnosis that then adds billions in costs to healthcare systems worldwide. 

NLP algorithms can assist at all stages of the drug discovery pipeline, from analyzing clinical trial digital pathology data to identifying predictive biomarkers, and as NLP usage becomes more widespread, so too will be its applications. With technological advancements and growing opportunities, the applications of NLP continue to rise, which equally increases the speed of its technological advancement. NLP is now as commonly found in everyday objects like the predictive text on our mobile phones, as it is in specialized industries such as pharmaceuticals and its deployment in drug discovery.


Sundeep Reddy Saireddygari is a Principal Data Consultant at Gramener, which solves business problems for its clients by identifying insights from data and presenting them as stories. Sundeep advises executives at leading enterprises to get more from their data investments by architecting data solutions and delivering value. He helps them align their business vision with data science strategy and choose the right initiatives, to move them up in data maturity. As a Principal Data Consultant, he drives and manages data science projects, advice client teams on managing and quantifying ROI on projects. He can be reached at sundeep.saireddygari@gramener.com