A New Machine Learning Approach To Document Classification – A Pfizer/SciBite Collaboration

August 8, 2018

By Joe Mullen and Steve Penn

August 8, 2018 | Bio-IT World Best Practices Award Honorable Mention | In May, we announced the winners of the Bio-IT World Best Practices Awards. Each year these awards are given with the goal of celebrating projects that highlight excellence in bioinformatics, basic and clinical research, and IT frameworks for biology and drug discovery. While only five projects are rewarded, we can’t help but recognize the remaining finalists and honorable mentions for their efforts in elevating the critical role of information technology in modern biomedical research. Here is one of the two honorable mentions. – The Editors

SciBite helps clients better understand the complexities of life science data by transforming unstructured text into rich, contextualized machine-readable content. To do this SciBite leverages ontologies to enable computers to understand millions of concepts relevant to life science. Referred to as VOCabs, these hand curated ontologies cover over 100 concept types (including gene, disease, phenotype and drug, to name but a few!) and sit at the core of SciBite’s technological offerings. At the heart of these technologies sits TERMite (TERM identification, tagging & extraction); a named entity recognition (NER) and extraction engine. SciBite support an ever-growing number of use cases, including document classification; a use case developed in collaboration with Pfizer.

For large pharma, knowledge transfer is crucial for successful integration of external research projects or commercial acquisitions into the organization. Much of this knowledge is found in a myriad of free-text documents that must be catalogued and integrated with internal data management systems.

During an Acquisition, for example, a company will inherit a plethora of regulatory documents. Pharmaceutical companies are required to align these documents to the electronic Common Technical Document (eCTD) hierarchy, outlined by the FDA. The eCTD consists of numerous “modules” which describe hierarchies for classifying different documents, such as the module 4 (M4) hierarchy (Figure 1), which covers Nonclinical Study Reports.


To successfully integrate regulatory documents into an organization, a 2-step process is required: (1) the extraction of key metadata (such as title, study compound and study species) and (2) the alignment of these documents to a hierarchy (such as the eCTD M4 hierarchy). Carrying out this task manually is incredibly cumbersome, time consuming and can result in inconsistences due to multiple FTEs being assigned the task. Furthermore, the complexity of the task is exacerbated by the variation in format of these regulatory documents from one company to another.

To address this, Pfizer and SciBite collaborated on the development of ClassifR; a web-based tool that combines SciBite’s NER platform with novel machine learning (ML) approaches to automate the process of metadata extraction and alignment of incoming documents against a user-defined hierarchy, such as the M4 hierarchy. ClassifR provides a web user interface (WUI) as well as a RESTful API for systematic access and integration with other tools.

Within the ClassifR architecture are multiple bespoke modules, developed specifically for this project, including: an optical character recognition (OCR) module for the conversion of scanned pdf documents to text; an ML module for the extraction of titles from PDFs; a TERMite module for the extraction of key document entities; and a document classification algorithm. Ontologies utilised by TERMite may easily be extended with internal data, such as lists of study codes or compound identifiers, ensuring ClassifR extracts all relevant entities found in submitted documents. In the M4 setting, it is imperative that the compounds and experimental species are extracted from a document. However, other areas such as Pharmaceutical Sciences and Clinical may require the extraction of other entity types such as Batch/Lot or Job Number, or Drug Program Code and Project Name. A vital piece of functionality of ClassifR is the ability to customize the entity extraction step; ensuring simple configuration can enable ClassifR to meet the requirements of a cross section of Acquisition documents.

The document classification algorithm consists of a two-step approach. A textual rule-based method is first applied to the extracted title of a document in an attempt to assign documents to relevant categories of the M4 hierarchy. If rule-based classification is not possible, an ML approach is then utilized for the classification task. Individual pdf documents or zip folders can be classified using the WUI or RESTful API.

Although ClassifR was developed against a specific use-case, the WUI provides a framework for creating bespoke “projects.” The project creation framework (Figure 2) includes: an interactive taxonomy builder; a means for uploading, viewing and deleting documents against the sub-categories captured in the taxonomy; and the ability to create new classification models from the project data.


ClassifR v1.0 has been successfully deployed internally and has been heavily tested by experienced personnel involved in the manual process of document classification. During this process, both the accuracy and performance of ClassifR was assessed. When classifying unseen Acquisition documents (documents used neither during the training or testing stages of model creation) ClassifR was shown to achieve a classification accuracy of ~88%. Furthermore, a large number of Acquisition documents can now be classified in a matter of minutes, vastly reducing the time taken when compared to traditional manual approaches. We approximate, with the capabilities to scale this solution, that there is a saving of around 75% of the program team time, significantly driving down the cost of Acquisition that have on average 3-5 active regulatory submissions to process.

ClassifR ensures that critical knowledge inherited by Pfizer during an acquisition is correctly ingested internally. A financial investment in an asset is done so for both the tangible asset as well as the associated knowledge. By ensuring ingestion of documents is managed correctly, ClassifR helps protect and manage intellectual assets inherited as part of an Acquisition. ClassifR has shown how ML can be used concomitantly with semantic natural language processing in a real world, business-orientated use-case. Furthermore, there is massive potential for further applications of ClassifR, for example, on internal Regulatory documentation; where metadata could be captured and used to create a robust, clean, standardised dataset for downstream analytics.

In summary, the implications of the work developed as part of this project are striking. Pfizer and SciBite are pioneering the use of advance ML and natural language processing to ensure more efficient strategic business investments.

Joe Mullen is a Senior Informatics Scientist at SciBite where he works closely with customers to provide bespoke solutions using SciBite’s text analytics and semantic enrichment technologies. With a background in data integration and data mining within the field of drug discovery, he is passionate about taking text analytics into the future using machine learning approaches. He can be reached at joe@scibite.com.

Steve Penn is Medicinal Sciences Information Strategy Lead at Pfizer. With over 15 years’ experience in the pharmaceutical sector, he has a wide range of experience in leading document, data and knowledge management strategy, migration, deployment and implementation. He can be reached at steven.penn@pfizer.com.