Unlocking Life Sciences Data Challenges With Knowledge Graphs

Contributed Commentary by Tom Woodcock, SciBite

November 11, 2022 | The life sciences industry generates huge volumes of data each day. When left in its raw form, researchers and data scientists cannot sift through data quickly to answer pressing questions. By building a visual, interconnected network describing the relationships between existing data and knowledge entities—a knowledge graph—researchers can answer questions more quickly than ever before. And these nuanced answers support better decision making.

Knowledge graphs combine information to create an interlinked network that describes relationships between different entities. They currently streamline many of our everyday digital experiences, underpinning applications such as Google search, social media websites, and streaming media recommendation engines. With their ability to define complex and overlapping relationships, e.g., visualizing the hundreds of interactions that take place between proteins and molecules at a cellular level, knowledge graphs have rich applications in the life sciences. When used properly, they can give insights into new treatment targets, unravel the mechanisms around a disease, or identify the ripple effects of certain genetic mutations. 

How Are Knowledge Graphs Built?

To build a knowledge graph, named entity recognition (NER), natural language processing (NLP) and machine learning can be used to recognize, understand and connect data. Knowledge graphs represent specific relationships between data and knowledge entities in a way that machines can process, called triples. These triples are either pulled from existing ontologies or automatically extracted and define specific relationships between two things. For example, in the sentence, “The wasp gene is implicated in Wiskott-Aldrich syndrome”, NER can be used to recognize the term ‘wasp’ as a gene (not an insect), and annotate it with data from HGNC Gene ontology (Label: WAS, ID:HGNC_12731). The triple ‘WAS—(implicated in)—Wiskott-Aldrich syndrome’ could be extracted and additional triples added from the HGNC ontology, including ‘WAS—(member of)—WAS protein family’. 

As a result, knowledge graphs can be used to visualize, describe and map intricate, overlapping relationships. This rich model yields far more relevant information than searching literature with a keyword and helps researchers find relevant information and insights faster.

Well-formatted data is vital to any knowledge graph, and it must be relevant to the application. Data must be managed and sourced carefully for it to be truly effective. However, research data, study reports, images and other texts often lack meaning without any context, representing a challenge for machines that rely on a critical mass of data before they can start learning. 

Meaningful Data 

The data to build knowledge graphs can come from many sources: clinical trial records, journal articles, public databases like BioGRID and ClinVar, third-party tools and databases, and proprietary and experimental data. To really make the most of the data, knowledge graphs should be designed with the end goal in mind. This includes the use of specialized ontologies to harmonize data sets and make them searchable. 

This stage may require semantic technology to transform unstructured text into structured information, assign to a category and extract the relationship information. This will enable deeper insights, highlight connections and reduce complexity. By applying domain-specific ontologies and using cross-checked IDs, knowledge graphs can increase scientific rigor. The automated AI process is less of a ‘closed box’ and gives researchers an increased confidence in the decisions made. 

A key task in building datasets for knowledge graphs ensuring data is FAIR, that is, findable, accessible, interoperable, and reusable. Without comprehensive, harmonized data that is comparable, the systems and instructions used to query the knowledge graph will be much more challenging. 

Once clean, accurately described and appropriately formatted, data in a knowledge graph becomes interoperable; it can be exchanged and utilized, providing a solid footing on which to build graph models. 

The Road to Success

Knowledge graphs are a dynamic source of information—which can update in real-time or as desired—and constantly pull in new information from the defined data sources. This allows them to evolve based on a semantic network of incoming information. By deeply mining data and leveraging latent knowledge, researchers can answer questions like what potential targets for an indication are, which drugs interact with each other, or could a drug be repurposed to treat another disease with a similar biological pathway.

The opportunities to use knowledge graphs in the life sciences are vast. Their power is in identifying and exploiting relationships between data and knowledge entities to find answers, but good data practice and trusted sources are necessary to capitalize on this approach. The use of data in knowledge graphs has potential to speed up drug discovery, generate insights or predictions on clinical outcomes, and ultimately get treatments to patients faster.

Tom Woodcock is a professional services consultant at SciBite, an Elsevier company. Tom works with big pharma to inform decision making and help them through data disambiguation and harmonization. He offers an invaluable combination of data science and scientific domain expertise, with over 20 years experience in biological sciences spanning 3 continents. He has a Master's degree in Molecular and Cellular Biology, and a PhD in Pharmaceutical Science. He can be reached at tom.woodcock@scibite.com.