Dec. 17, 2007 | Many OSINT (Open Source Intelligence) resources are available online, but the real art to OSINT analysis is in framing the right questions and creating the right sort of analysis for decision makers. While many free and commercial search engines and content aggregators exist in the public domain, a tailored retrieval, discovery, and refinement process must be created for answering specific intelligence questions around biopharma portfolios and practices from open sources.
Software for enabling this process includes: (1) OSINT search and content extraction tools (RSS aggregators, Web crawling and scraping); (2) Content classification, categorization, and clustering tools; (3) Entity and relationship extraction tools; (4) Taxonomy and ontology creation and management tools; (5) Presentation software including visualization; (6) Desktop search engines (personal knowledge refinement by individual analysts); and (7) Analytics tools, both quantitative (increased chatter, number of mentions, word co-occurrence, etc.) and qualitative (sentiment, opinion).
Many readers will have ample experience using (if not developing) such tools - whether building a corporate-wide search strategy, modeling data as taxonomies or ontologies, or building semantic-web applications. Many vendors supply these tools, often quite cutting-edge. However, a more complete approach to intelligence collection and analysis requires focusing on system integration and the final intelligence product, rather than the newest, coolest piece of technology. Big pharma research informatics groups probably license several interesting tools. Look at what you've got and think about linking them into an overarching process that addresses your specific questions.
The collection and analysis of OSINT requires a focused, synchronized set of processes. Open sources should be categorized by source type, content relevancy, source reputation and coverage quality. We routinely monitor over 150 reputable blogs that contain unique information to identify potential trends. I also keep a spreadsheet of online sources and rank them by their perceived accuracy and utility. When my content harvesting engines pull source data following a query, I check those sources against my list, update my sources, and iterate on my collection process. This also provides a head start in creating a taxonomy that can be passed to your search engine.
I use several entity and relationship extraction tools to aid in taxonomy creation, as well as manually collecting terms when web browsing. This has been especially fruitful in monitoring trends in biomarker discovery and validation. When traditional sources like Reuters or AP promote articles on personalized medicine, I can compare these articles against my biomarker taxonomy to make sure my automated systems haven't missed anything.
Monitoring and Analysis of OSINT Sources
Monitoring the open source ecosystem for real-time changes to Web content is critical. Relevant information should be received as soon as it's published and organized in a systematic way. Business analysts should be alerted to new blogs, articles, and featured sites with informative content and be able to add and organize blogs and news feeds without being technology experts. They should also be able to view content from one feed at a time, groups of feeds, or all related feeds at once.
Successful monitoring should include checks for:
- Automated web page surveillance and alerting relevant analysts when significant changes to relevant sources occur;
- Identification of new (or removed) content from open sources of interest;
- Sort and filter web page changes by date/time change detected, page content categories (blogs, chat, etc.) or watchlist groupings;
- The ability to rate and flag pages to suit analysis needs. Analysis should include:
>Search by concept and example, as well as keywords;
>Discovery: the identification and extraction of items likely to contain relevant information for analysts, e.g. identifying trends, competitors, new entities of interest, etc.;
>Entity and relationship extraction of persons, groups, products, etc. and population of taxonomies and ontologies to understand relationships and associations to each other and to previously defined networks;
>Visualization of networks and non-intuitive relationships for improved understanding and trend-spotting;
>Calculation of quantitative metrics such as chatter volume, key word co-occurrence, new network connections, etc.;
>Calculation of qualitative metrics such as sentiment (person/organization's feelings or emotional response as manifested by descriptions in open sources) and changes in direction or opinion;
>Determination of cogent answers to questions that surfaced during structured discovery.
Jim Golden is a CTO at SAIC. He can be reached at firstname.lastname@example.org
Subscribe to Bio-IT World magazine.