YouTube Facebook LinkedIn Google+ Twitter Xingrss  

Making Semantic Sense of Unstructured Data


Sophia’s Digital Librarian software relates documents without taxonomies.

November 15, 2011 | A somewhat stifling aspect of many sophisticated semantic search tools is the need to build ontologies and taxonomies to organize the data. Nonsense, says David Patterson, the co-founder and CEO of Sophia Search, who believes in freedom from taxonomies or ontologies.

“People are so attuned to building taxonomies and ontologies because they think they need to,” says the Northern Irishman. “Our message is, that just isn’t true! We understand the purpose they serve, but one of our long-term goals has been to build a search tool that doesn’t rely on background knowledge structures. Let’s free people up from the overheads and expense of these knowledge structures.”

The Sophia Digital Librarian search tool thrives on finding relationships between documents without the need for any kind of taxonomy. Just set the software loose on a collection of unstructured data, and it does the rest.

Patterson co-founded Sophia with Vladimir Dobrynin, a professor at St. Petersburg State University and Sophia’s chief science officer. Patterson’s background is in artificial intelligence. He was director of an artificial intelligence R&D lab at the University of Ulster, working on data mining, machine learning, and information retrieval. Patterson has collaborated with Dobrynin since 2003. After some five years of R&D, the researchers came up with a prototype version of Sophia. The Sophia product engineering team is based in Belfast, while the R&D team is based in St. Petersburg, Russia and sales in Silicon Valley, California.

Semantic Searching

“The Sophia Digital Librarian is all about content enrichment within organizations or repositories,” says Patterson. “It helps organizations improve the findability of the content they have in their organization or external repositories.” Interest is high among pharma companies, life science organizations and the scientific publishing community.

The Digital Librarian is built on Sophia’s patented contextual discovery engine,” says Patterson. “We’re empowering organizations to become more innovative and creative through discovery. Search should not just be about retrieving information you expect to find but also about uncovering and discovering new things you weren’t aware of.”

Taxonomies can stifle innovation, says Patterson, by constraining staff to all think conventionally and uniformly. “You’re not allowing your employees to think freely… If we are constrained by the boundaries of conventional thinking, we limit creativity and our ability to discover new things. Sophia removes these constraints and enables users to discover unknown relationships and knowledge from within their content—that’s the real power behind what we do.”

After reading through documents in an organization’s repository, the Sophia Librarian gets to work. “It extracts the meaning, organizes them into topics, understands what they’re about, captures metadata describing their topic and subtopics, and attaches tags to the document to make them more findable,” says Patterson. Tags can be extracted from the document, but the tool also creates semantic tags—relevant words or phrases that do not actually crop up in the source document.

“We also identify the most similar documents among all the documents in the corpus,” says Patterson. “Then we rank the “nearest neighbors”—the most semantically similar documents. This quickly brings to the user a list of related documents and helps users focus and find information relevant to their needs.”

The metadata—Sophia calls it the “semantic profile”—are exposed as XML, through a series of web services to make the information available to any search or content management tool. The goal, says Patterson, is to augment current search tools and make them smart.

In a quick demo, Patterson uses Sophia to automatically create a semantic profile for a document taken from the Web using knowledge extracted from a corpus of 1.8 million news stories spanning 20 years. The articles have been automatically indexed by Sophia and semantic profiles created for each document. “Sophia uses knowledge extracted from these documents to come up with a semantic profile—topic, tags and nearest neighbors—for the new document,” he says. It uses knowledge extracted from the news corpus to intelligently assign metadata to the new document.

A similar search can be done using an abstract from PubMed on Alzheimer’s disease, for example, retrieving tags related to dementia, amyloid protein, and neurodegenerative diseases among others. The numerical score offers a gauge of the “distance” between the source document and any retrieved article. “Our capabilities relate topics and documents together that wouldn’t have been otherwise known,” says Patterson. “We’re able to make connections years ahead of [traditional means of] discovery.”

In time, it will be possible to filter those results by source or date, for example selecting just PDFs or documents within a specific date range. The functionality is there, says Patterson, it just requires using a web services user interface.

Customer Stories

Sophia’s sales director Jeff Bierach says the company has a number of customers in the U.S. already, and is targeting many more in the life sciences community and publishing sectors. As discussed in a recent Bio•IT World guest commentary (see, “Reevaluating the Role of Research Librarian in Pharma R&D,” Bio•IT World, Sept 2011), many pharma companies have been downsizing their librarian staff for some time.

“Look at what’s been happening in pharma companies over the last 3-5 years,” says Bierach. “Genentech used to have a staff of over 20 librarians—researchers finding information. They have zero now. That’s really a big play for us, to automate a lot of this capability around discovery and document classification. Those are really time consuming tasks that we can automate.”

One big pharma company had spent three man-years building custom queries for PubMed to extract information on over 30 different topics they track. “Based on the last two years of MEDLINE abstracts which we have indexed, our client was able to reproduce those queries in Sophia in less than one week,” says Bierach. “We essentially solved his entire job that took three years to get to where he is now. There’s a huge amount of value there. 

“We’re starting to create semantic profiles for content of a pharma company in San Francisco,” adds Patterson. “They use Microsoft FAST [search tool] and want to add semantic profiles to the content.” With stagnant drug discovery pipelines, Patterson also sees an opportunity in drug repositioning, helping to find correlations at significant cost savings. 

While most content search applications focus on bespoke internal datasets rather than the web, Patterson says Sophia is looking to work with partners, including a Google reseller in Chicago. “We’ll supply a layer between the Google search appliance and Sophia.”  

Another promising target segment is the publishing sector, which could leverage Sophia to ‘up-sell’ related journal articles to customers. Another early project involved helping a publishing company sift through 20 years of legacy content. “Within three days, we were able to see what information was evergreen. The client said that saved him 9 man-months of effort,” says Patterson.  

Version 1.2 of the Sophia Digital Librarian is now in full commercial release, but Patterson stresses that his team is still building functionality. Because the company is relatively small, “we can move very quickly,” says Bierach.  

This article also appeared in the November-December 2011 issue of Bio-IT World magazine. Subscribe today!

 

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1


For reprints and/or copyright permission, please contact  Terry Manning, 781.972.1349 , tmanning@healthtech.com.