YouTube Facebook LinkedIn Google+ Twitter Xinginstagram rss  

Freeing Data, Keeping Structure

By Eric K. Neumann

June 14, 2006 | Data are becoming ubiquitous throughout the Web, parallel with the growth of social venues such as blogs and wikis. In fact, it is important to bear in mind that data without a social context is meaningless, since scientists collectively propose and test hypotheses. It must have a social context for it to support the growth of knowledge. The Human Genome Project was a vision by a community inspired to help other communities in the future address many medical needs. Experiments and their produced data are in direct response to the questions and hypotheses a scientific community puts forward. Data are always connected to people, views, and hypotheses, and the interplay between them, though sometimes unpredictable, is a formal process requiring consistency and logic. That is where the scientific community will need sufficient structure in addition to flexible collaborative spaces, something current blogs do not yet support.

In this era of Web-based tools and services, applications that use Web resources and data could come from any angle without any prior indication, as illustrated by the various kinds of data mashups that have become mainstream. Google has added a neighborhood map feature of global proportions that can combine item searches with a geographic locator. Map views of cities can be overlaid with pointers to all the bookstores that are in walking distance to coffee shops once the address semantics are used as a coordinate system. Eric Miller, who created this demonstration, refers to this as “recombinant data,” since the mashup can keep track of all the various semantic parts such as addresses or geographic coordinates, similar to how gene alleles can be recombined without (usually) losing their functionality.

Mashups illustrate the proposition that data need to not be too dependent on any single application. Typically, the phrase “data interoperability” is used to describe this, and several methodologies such as SOA attempt to address this issue. However, I will go one step further and suggest that recombinant data must be captured and defined in a way that is “application independent,” being free of any application formatting biases so that it has value on its own. I am not saying data only have value without any application, since that would be outlandish; just that it is not hampered by any individual application constraints. Data should have strong value even for future applications that were not considered when the data was created, a property referred to as “re-usefulness.”

Remixing Research
An interesting real-world application of mashups and re-usefulness comes from the music sampling space. The ccMixter Project, which allows anyone to choose other samples, mix them together (i.e., mash them up), and share them back through the site. It utilizes the Creative Commons’ legal RDF tag system (by whom, commercial vs. noncommercial use, propagative — viral — copyrighting or not), which users can set for the desired copyright values they want for their aural creations. In ccMixter, the cc:by tag describes who created the specific mix, and any remix automatically builds upon the tree of cc:by tags from all ancestral works into the new creation, forming a phylogeny of how the remix music came about. This is a completely new application that the Creative Commons folks had not even considered (remember “application independence”?), yet it clearly shows how a lightweight type-tag system can be used for novel functionality.

Since science involves a social process through peer review of papers, the same Creative Commons tags can be applied to scientific publications. Indeed, this is what Science Commons is attempting to foster for the greater research community. Science Commons executive director John Wilbanks hopes that by adding the ability to tag the legal use and distribution of knowledge and data, Web-based science resources can be guaranteed to be openly available for both academic and commercial R&D, and thereby promote innovation in science by lowering the legal and technical costs of the sharing and reuse of scientific work. Science Commons’ vision fits very well to the goals of Public Library of Science (PLoS), which aims to empower the science community by making published knowledge more accessible. Since rights cover data as well, it can also take full advantage of the RDF model for its representation. Such an experiment has been initiated for Science Commons’ NeuroCommons project. The use of RDF for both content and usability should enable large-scale practices of knowledge sharing and building, analogous to music mash-ups, but with very beneficial effects.

Recently, the phrase “semantic interoperability” is being used to highlight the additional need to include data semantics in the data model, as exemplified by NCI’s caBIG project ( Here, many forms of data from molecular biology to clinical studies will be brought together with a large variety of tools in support of analytical R&D. Interestingly, combining application independence together with semantic interoperability strongly leads one to consider Semantic Web application to address this problem, since it both defines data semantics (see “RDF — The Web’s Missing Link,” March 2006 Bio•IT World, page 52) and allows incremental information association that originated from prior analyses.

Where is this taking us? Well we have hardly explored how to do a scientific mashup or what it means to take advantage of it. One thing is clear: If it is based on recombinant scientific data, a data description language such as RDF is necessary. Otherwise, the mashups will result in mush, unusable piles of unparsable data with unknown provenance! One of the project areas people are currently discussing is around a Neuroscience Mashup, where complex sets of data could all be co-joined by tissue locality as defined by a brain coordinate system. Data about neurological disorders, neurotransmitter receptor-types, neural functions, nerve fiber projections, and gene expression could all be co-registered for very powerful analyses and viewing. This scientific mashup would allow collaborating researchers to ask: “What genes are affected in responsive neural cells targeted by p38 inhibitors, and do these same cells go on to form amyloid plaques in Alzheimer’s-affected individuals?” Conceivably, this could have astounding benefits for research and medicine, but we’ll need to begin with a few incremental yet provocative demonstrations.

Eric K. Neumann is senior director product strategy at Teranode. E-mail:


Click here to login and leave a comment.  


Add Comment

Text Only 2000 character limit

Page 1 of 1

For reprints and/or copyright permission, please contact Angela Parsons, 781.972.5467.