Freeing Data, Keeping Structure


By Eric K. Neumann

June 14, 2006 | Data are becoming ubiquitous throughout the Web, parallel with the growth of social venues such as blogs and wikis. In fact, it is important to bear in mind that data without a social context is meaningless, since scientists collectively propose and test hypotheses. It must have a social context for it to support the growth of knowledge. The Human Genome Project was a vision by a community inspired to help other communities in the future address many medical needs. Experiments and their produced data are in direct response to the questions and hypotheses a scientific community puts forward. Data are always connected to people, views, and hypotheses, and the interplay between them, though sometimes unpredictable, is a formal process requiring consistency and logic. That is where the scientific community will need sufficient structure in addition to flexible collaborative spaces, something current blogs do not yet support.

In this era of Web-based tools and services, applications that use Web resources and data could come from any angle without any prior indication, as illustrated by the various kinds of data mashups that have become mainstream. Google has added a neighborhood map feature of global proportions that can combine item searches with a geographic locator. Map views of cities can be overlaid with pointers to all the bookstores that are in walking distance to coffee shops once the address semantics are used as a coordinate system. Eric Miller, who created this demonstration, refers to this as “recombinant data,” since the mashup can keep track of all the various semantic parts such as addresses or geographic coordinates, similar to how gene alleles can be recombined without (usually) losing their functionality.

Mashups illustrate the proposition that data need to not be too dependent on any single application. Typically, the phrase “data interoperability” is used to describe this, and several methodologies such as SOA attempt to address this issue. However, I will go one step further and suggest that recombinant data must be captured and defined in a way that is “application independent,” being free of any application formatting biases so that it has value on its own. I am not saying data only have value without any application, since that would be outlandish; just that it is not hampered by any individual application constraints. Data should have strong value even for future applications that were not considered when the data was created, a property referred to as “re-usefulness.”

Remixing Research
An interesting real-world application of mashups and re-usefulness comes from the music sampling space. The ccMixter Project, which allows anyone to choose other samples, mix them together (i.e., mash them up), and share them back through the site. It utilizes the Creative Commons’ legal RDF tag system (by whom, commercial vs. noncommercial use, propagative — viral — copyrighting or not), which users can set for the desired copyright values they want for their aural creations. In ccMixter, the cc:by tag describes who created the specific mix, and any remix automatically builds upon the tree of cc:by tags from all ancestral works into the new creation, forming a phylogeny of how the remix music came about. This is a completely new application that the Creative Commons folks had not even considered (remember “application independence”?), yet it clearly shows how a lightweight type-tag system can be used for novel functionality.

Since science involves a social process through peer review of papers, the same Creative Commons tags can be applied to scientific publications. Indeed, this is what Science Commons is attempting to foster for the greater research community. Science Commons executive director John Wilbanks hopes that by adding the ability to tag the legal use and distribution of knowledge and data, Web-based science resources can be guaranteed to be openly available for both academic and commercial R&D, and thereby promote innovation in science by lowering the legal and technical costs of the sharing and reuse of scientific work. Science Commons’ vision fits very well to the goals of Public Library of Science (PLoS), which aims to empower the science community by making published knowledge more accessible. Since rights cover data as well, it can also take full advantage of the RDF model for its representation. Such an experiment has been initiated for Science Commons’ NeuroCommons project. The use of RDF for both content and usability should enable large-scale practices of knowledge sharing and building, analogous to music mash-ups, but with very beneficial effects.

Recently, the phrase “semantic interoperability” is being used to highlight the additional need to include data semantics in the data model, as exemplified by NCI’s caBIG project (https://cabig.nci.nih.gov). Here, many forms of data from molecular biology to clinical studies will be brought together with a large variety of tools in support of analytical R&D. Interestingly, combining application independence together with semantic interoperability strongly leads one to consider Semantic Web application to address this problem, since it both defines data semantics (see “RDF — The Web’s Missing Link,” March 2006 Bio•IT World, page 52) and allows incremental information association that originated from prior analyses.

Where is this taking us? Well we have hardly explored how to do a scientific mashup or what it means to take advantage of it. One thing is clear: If it is based on recombinant scientific data, a data description language such as RDF is necessary. Otherwise, the mashups will result in mush, unusable piles of unparsable data with unknown provenance! One of the project areas people are currently discussing is around a Neuroscience Mashup, where complex sets of data could all be co-joined by tissue locality as defined by a brain coordinate system. Data about neurological disorders, neurotransmitter receptor-types, neural functions, nerve fiber projections, and gene expression could all be co-registered for very powerful analyses and viewing. This scientific mashup would allow collaborating researchers to ask: “What genes are affected in responsive neural cells targeted by p38 inhibitors, and do these same cells go on to form amyloid plaques in Alzheimer’s-affected individuals?” Conceivably, this could have astounding benefits for research and medicine, but we’ll need to begin with a few incremental yet provocative demonstrations.

Eric K. Neumann is senior director product strategy at Teranode. E-mail: eneumann@teranode.com.

 

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1

White Papers & Special Reports

isilon white paper

“Storage for Science – Methods for Managing Large and Rapidly Growing Data Stores in Life Science Research Environments” sponsored by Isilon
Large and rapidly growing stores of file-based and other data are a hallmark of life science research and bioinformatics. Determining how best to manage those data stores has become a significant challenge for Researchers and IT Pros alike.

This paper is intended to:

  • Provide guidance on the many storage requirements common to Life Science research;
  • Explain the evolution of modern storage architectures;
  • Summarize the major data storage architectures currently in use.

Additionally, it will present the Isilon IQ clustered storage product as a strong and flexible solution to those needs. Download now



definiens briefingon-76Next-Generation Technologies Revolutionizing Oncology and Diagnostics
underwritten by Definiens

This “Briefing On” collection of Bio-IT World features, commentaries and analysis, presents some of the latest thinking on high-throughput technologies that are being applied to the fields of research and drug discovery, with particular emphasis on oncology, diagnostics and imaging technologies. Download now at no charge compliments of the underwriting sponsor, Definiens. Download This Free Paper



metaminer image(1)

MetaMiner™ Cystic Fibrosis Report,  Sponsored by GeneGo
This paper discusses the MetaMiner™ (CF) data analysis platform for a broad range of CF researchers designed to: 1. Easily assemble important biological and chemical experimental data available today in cystic fibrosis research. 2. Visualize key mechanisms leading to the disease through pathway maps and network models 3. Provide the CF community a “one stop shop” tool for uploading and analyzing experimental data in a disease-centered interface.  Download now 



Life Science Webcasts & Podcasts

Storage for Science
Methods for Managing Large and Rapidly Growing Data Stores in Life Science Research Environments

Sponsored by Isilon

Isilon webcast1

Large and rapidly growing stores of file-based and other data are a hallmark of life science research and bioinformatics environments. Determining how best to manage those data stores has become a significant challenge for the Researchers and IT Professionals that support them.

This webcast is intended to: 

  • Provide guidance on the many storage requirements common to Life Science research; 
  • Explain the evolution of modern data storage architectures; 
  • Summarize the major data storage architectures currently in use;
  • Present the Isilon IQ clustered storage product as a strong and flexible solution to those needs.

    Download this webcast

More Podcasts

Job Openings

Isilon Systems ~ Senior Marketing Communications Manager
Isilon Systems is the worldwide leader in clustered storage systems and software for digital content and unstructured data. We seek an experienced marketing communications professional/writer expert in creating and delivering effective and persuasive business communications. The ideal candidate can think at the strategic and conceptual level and act, simultaneously, as a highly-effective and productive individual contributor. The position is based in Seattle, WA. For additional information click here:
 

Lilly Singapore Center for Drug Discovery (LSCDD) - Associate Director of Informatics
Lead and mentor a strong team for the Bioinformatics group at the Integrative Computational Sciences (ICS) department at LSCDD towards the development of novel algorithms, data analysis methods and software tools for drug discovery. Work closely with the Software Engineering group at ICS, and collaborate with the Discovery IT organization in Europe and USA. For additional information, or to apply visit: LSCDD 

For reprints and/or copyright permission, please contact RMS, 1808 Colonial Village Lane, Lancaster, PA;

(717) 399-1900 ext. 125 or via email to bio-itworld@theygsgroup.com.