YouTube Facebook LinkedIn Google+ Twitter Xinginstagram rss  

OSINT and the Pharmaceutical Enterprise

By James Golden

Nov. 13, 2007 | In its July 2004 report, the 9/11 Commission recommended the creation of an “open-source” intelligence agency — somewhat different than the CIA and NSA. Open Source Intelligence (OSINT) is defined by the Director of National Intelligence as intelligence “produced from publicly available information that is collected, exploited, and disseminated in a timely manner to an appropriate audience for the purpose of addressing a specific intelligence requirement.” OSINT focuses on creating actionable intelligence from public information, allowing other Federal agencies to focus on creating primary intelligence from covert human sources or listening in on electronic communications.

Many OSINT resources are available via the millions of Web pages, blogs and databases on the World Wide Web. Utilizing these open sources in an actionable way to produce quality intelligence requires unique research production processes and advanced search, retrieval, discovery, characterization and analysis software. I believe we have a similar opportunity within the world of biopharma — much of the information needed to create sophisticated intelligence and analysis regarding drug discovery and development, regulatory, and sales and marketing projects is available through OSINT.

Obviously, information technology is essential to the pharmaceutical enterprise. However, IT systems to help create meaningful pharmaceutical OSINT in a meaningful way have not kept pace with more process-driven information systems. Every pharma company has invested in tools and techniques for enterprise-wide search, document discovery and management, knowledge management and business intelligence, and possibly even semantic-web type tools such as ontology creation and text mining. Such tools can play an important part in creating a pharmaceutical OSINT system to better inform company decision makers.

However, technology is only one piece of the puzzle. A more central issue is knowing what questions to ask and accurately determining what constitutes an answer.

Question asking (and answering) is a fine art. In our work as consultants, we’re often asked a variety of questions, from “What does Wall Street think of our CEO?” to “How many biomarkers can we actually use as prognostics in our clinical trials?”  Answering each of these questions requires a different approach (we’re fortunate at SAIC to have decades of National Security Intelligence experience to call upon).

In the upcoming months, I’ll explore some of the concepts that are useful in building OSINT systems for drug intelligence. Here, I focus on two important topics: mining the Deep Web and taking advantage of developments in Web 2.0.

Deep Web Mining
To create relevant intelligence from OSINT sources, analysts require a well thought-out process for querying, discovery and analysis, as well as access to software, systems and processes to collect and monitor open sources for relevant information around biopharma business practices. These sources include information regarding genomic targets and biomarkers; disease epidemiology; competitive intelligence; clinical trial information including trial design, endpoints, and enrollment statistics for themselves and their competitors; business intelligence surrounding suppliers, vendors, distributors and partners; regulatory standards and recommendations; sales and marketing data (including physician script data); and investor relations and market sentiment.

To discover content on the Web, search engines typically use web crawlers that follow hyperlinks. This technique is ideal for discovering resources on the surface Web, but is often ineffective at finding Deep Web resources. (The Deep Web — or Deepnet, invisible Web or hidden Web — refers to WWW content not part of the surface Web indexed by search engines.) For example, these crawlers seldom find dynamic pages resulting from database queries due to the infinite number of queries that are possible.

Deep Web resources may be classified into one or more of the following categories:

•  Dynamic content: dynamic pages returned in response to a query or accessed only through a form.
•  Unlinked content: pages unlinked from other pages, which may prevent Web crawling programs from accessing the content. This content is referred to as pages without backlinks (or inlinks).
•  Limited access content: sites that require registration or restrict access to their pages (e.g., using the Robots Exclusion Standard), prohibiting search engines from browsing them and creating cached copies.
•  Scripted content: pages only accessible through links produced by JavaScript and Flash.
•  Non-text content: multimedia (image) files, Usenet archives and documents in non-HTML file formats such as PDF and DOC documents.

Creating a complete OSINT picture for drug portfolio intelligence requires access to Deep Web content. Several vendors supply Deep Web mining applications as part of an overall search strategy. One product I particularly like is Deep Query Manager from BrightPlanet ( I’ve had good luck using their product to collect data from hard to reach places on the Internet.

Web 2.0
“Web 2.0” refers to a perceived second generation of Web-based communities and hosted services such as social networking sites, wikis, and folksonomies that facilitate collaboration and sharing between users. Though the term suggests a new version of the Web, it does not refer to updated Web technical specifications, but to changes in the ways systems developers have used the web platform.

Alluding to the version-numbers that commonly designate software upgrades, the phrase “Web 2.0” hints at an improved form of the World Wide Web; advocates suggest that technologies such as blogs, social bookmarking, wikis, podcasts, RSS feeds (and other forms of many-to-many publishing), social software, and online Web services imply a significant change in web usage.

Web 2.0 can also refer to the transition of web sites from information silos to sources of content and functionality as well as a social phenomenon embracing an approach to generating and distributing Web content, characterized by open communication, decentralization of authority, freedom to share and re-use, and “the market as a conversation.”

This is an intriguing idea considering how isolated most research labs tend to be.  While most drug discovery researchers tend to collaborate, the number of nodes in those networks tends to be small.

The use of Web 2.0 technologies to enable on-line communities of interest and social networking is critical to OSINT analysts.  These communities allow users of varying interests to connect, network, communicate and publish content on many topics, including several that would be relevant to drug industry best practices, portfolio valuation, and related technologies. Web communities such as MySpace, Friendster, and especially scientifically focused communities such as SciLink are important OSINT sources in collecting data for biopharmaceutical intelligence.

Jim Golden is CTO at SAIC. He can be reached at

Subscribe to Bio-IT World  magazine.


Click here to login and leave a comment.  


Add Comment

Text Only 2000 character limit

Page 1 of 1

For reprints and/or copyright permission, please contact Angela Parsons, 781.972.5467.