Semantic Web’s ‘Snap, Crackle, and Pop!’


By Eric K. Neumann

May 15, 2007 | In describing the benefits of the Semantic Web, people often say it is about open standards and data semantics. That is true, but it is too vague for most to see what they would gain from it. Specifically, it fails to satisfy those who want to know to what problems it can be applied, and what its advantages are. One needs to understand how to apply such tools to real-world informatics problems of data structure, discoverability, and usability.

To this end, a series of new W3C Recommendations are being developed in support of building and using the Semantic Web. What follows is a brief introduction to SPARQL, GRDDL, POWDER — the “Snap, Crackle, Pop!” of the Semantic Web.

SPARQL
SPARQL is a query language for RDF structured data. The equivalent of SQL for Web-data, it works in a similar way, using Select, Where, and From query parameters. The difference in SPARQL is that one constrains a query by using triples with variables (prefixed with ‘?’) rather than table fields, to match the RDF data:

PREFIX ls: http://lifesci.org/1.0
SELECT ?gene ?go_process
FROM myDataSource
WHERE
{ ?gene ls:has_Process ?go_process .
?go_process ls:associated_with <disease#Cardio
Vascular> }

When executed, the above query will find all genes in myDataSource with GO Processes that are associated with Cardiovascular Diseases. The information returned satisfies the WHERE conditional, and is a table of data when Select is called, and an RDF graph when Construct is used. SPARQL is meant to work not just with data stored in RDF (á la triples, see “The Missing Link,” Bio•IT World, March 2006), but also as an interface to existing relational databases (RDB); in such cases, SPARQL is dynamically translated to SQL calls, and the results are returned as RDF. This is especially useful when dealing with current data systems and legacy databases, and means that existent data can be readily ‘exposed’ as Semantic Web resources. By the time this article is published, SPARQL will likely be a Candidate Recommendation, a major step to becoming a Recommended standard.

GRDDL
Before one can query with SPARQL, there need to be RDF data sources. GRDDL (Gleaning Resource Descriptions from Dialects of Languages) is a mechanism that enables HTML and XML files to be transformed into RDF. Since most people have developed a lot of technology around HTML and XML, they are hesitant to throwing it away for something new. GRDDL allows the subtle insertion of a “hook” in such documents that allows those wishing to retrieve the file as RDF to perform a transformation on the HTML or XML document itself. Transformations are often performed using a specified XSL translator file referenced from within head tag in the original document:

<head profile=“http://www.w3.org/2003/g/data-view”>
                        <link rel=“transformation” href=“http://
www-sop.inria.fr/acacia/soft/RDFa2RDFXML.xsl”/>
                        <title>The Semantic Web</title>
</head>

The current file is used as the input of the XSLT transform and the newly generated RDF document is the output, which can be further processed and stored by servers. GRDDL will be a Proposed Recommendation in a few weeks time, also on the path of becoming a Recommended standard.

RDFa is an additional and related syntax for embedding RDF directly into HTML pages through the inclusion of tag attributes (property, rel) that say what the subject is, and its relations to the object resources or literals (strings):

 <body>
                        <h1>A life science Semantic Web: are we there yet?</h1>
                        <dl about="http://www.doi.org/10.1126/
stke.2832005pe22”>
                                    <dt>Title</dt>
                                    <dd property="dc:title”>A life science Semantic Web: are we there yet?</dd>    
                                    <dt>Author</dt>
                                                <dd rel="dc:creator” href="#a1”>
                                                <span id="a1”>
                                                            <link rel="rdf:type” href="[foaf:
Person]” />
                                                            <span property="foaf:name”>Eric
Neumann</span>
                                                            see <a rel="foaf:homepage”       
                                                                                 href="http://www.eneumann.org”>
homepage</a>
                                                </span>
                                                </dd>
                        </dl>
</body>

This is converted by the appropriate translator into the following RDF statements (i.e., subject-verb-object)…

http://www.doi.org/10.1126/stke.2832005pe22
   dc:title “A life science Semantic Web: are we there yet?”

http://www.doi.org/10.1126/stke.2832005pe22

   dc:creator http://www.myOrg.com/references/rdf_sem.html#a1

http://www.myOrg.com/references/rdf_sem.html#a
   rdf:type foaf:Person
   foaf:name “Eric Neumann”
   foaf:homepage http://www.eneumann.org

Using RDFa, web pages can be interpreted by humans and software tools that can extract RDFa metadata directly from the page content. This metadata can be collected and stored in databases and used for further mining. It also allows plug-ins to find metadata such as meeting date and times, and push them into calendars. RDFa serves as a powerful, yet simple bridge between the current web and its future manifestation.

POWDER
Finally, we need to understand some of the ways RDF metadata should be used with existing resources such as web pages, scientific publications, and data. The POWDER Work Group, “Protocol for Web Description Resource,” will be developing a mechanism through which structured metadata (“Description Resources”) can be authenticated and applied to groups of Web-based resources. It will allow retrieval of the description resources without necessarily retrieving the full documents they describe, letting people and programs make decisions about whether they wish to retrieve or index the content based on its description. Metadata here includes authorship, dates, authenticity (trusting source), legal bindings, disclaimers, and other conditionals. These descriptors could have a major impact on how researchers efficiently find and access scientific papers and/or data. The resource metadata would be available both as part of general queries (e.g., find any documents produced by the NCBO project), and when deciding if the identified content has the appropriate label (e.g., for public consumption) or legal conditions (e.g., Creative Commons public-sharing).

For many, these recommendations and best practices will be the necessary pieces to enable them to begin building Semantic Web systems. The Semantic Web is moving from a vision to the establishment of foundational components that can be applied directly to a large set of informatics challenges. Researchers will be able to utilize these to organize and connect resource content and their meta-data according to reliable principles.

Eric K. Neumann is senior strategist at Teranode. E-mail: eneumann@teranode.com.

Subscribe to Bio-IT World  magazine.

 

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1

White Papers & Special Reports

Waters white paper image
Software Helps Doping Control Lab Streamline Results Management
Sponsored by Waters
The Karolinska University Hospital’s Doping Control Lab tests thousands of samples annually for stimulants, diuretics, and other masking agents. Increased regulatory pressure and new technologies increased the number of samples analyzed creating data management challenges. Waters® NuGenesis® Scientific Data Management System and TargetLynx™ Application Manager software were used to reduce the time required to calculate, review and search results.


sas whitepaper92
Managed Innovation, Assured Compliance
Sponsored by SAS
Discovery organizations are identifying a lot of promising compounds, but clinical research processes haven't kept pace with timely testing of all those potential therapies. This white paper describes how SAS® Drug Development supports true innovation across the clinical trial process.

In this white paper you will learn how to:

  • Assemble data to foster better collaboration
  • Get up-to-date information during clinical trials
  • Make informed decisions earlier in the trial process


BlueArc white paper image
Addressing Life Sciences Constantly Growing Data Challenges Research Environments
Sponsored by BlueArc
The continued explosion of raw experimental data, the increased use of video, the growing adoption of new data retention practices, and the move to high throughput computational workflows are all placing new demands on the way life sciences organizations store and manage their data.

Download this white paper to learn about:

  • Factors driving the data explosion in the life sciences
  • New data management issues that must be addressed
  • HPC trends that are placing new demands on storage
  • Storage solution attributes that address performance, manageability, and energy efficiency.


Life Science Webcasts & Podcasts

Medidata Solutions

Rising Clinical Trial Delays and Costs - Addressing the Cause, Not the Symptoms 

medidata podcastProtocol complexity is taking a toll on clinical study speed and efficiency: increasingly complicated and ambitious protocols are not only burdening sites and study volunteers but are also prolonging trials and increasing expenses. In response, sponsors have turned to global study placement, restructured site relationships and new site management practices, but the problem remains.

This podcast will discuss:

  • Why these responses address only the symptoms, not the underlying cause, of rising clinical trial delays and costs.
  • Results of a recent joint Tufts University / Medidata Solutions study.
  • New metrics benchmarking protocol design trends.
  • Systematic protocol design improvements and why they are essential to clinical trial performance excellence.

Speakers: Ken Getz, Senior Research Fellow at the Tufts Center for the Study of Drug Development, and Ed Seguine, General Manager, Trial Planning Solutions at Medidata.

Download Now 



More Podcasts

Job Openings

Director, Center For Information Technology (CIT) - National Institutes of Health  (NIH), Department of Health and Human Service
Located in Bethesda, MD. This position requires:
• High-level vision, leadership, management, and modernization of CIT programs and services.
• Strategic direction and policy development for CIT long-term operations and objectives.
• Serve as a key IT advisor to the NIH Chief Information Officer.
A TOP SECRET security clearance will be required.  More job detail is found at:  http://www.jobs.nih.gov under the Executive Jobs section.Or contact Ms.Winnie Garner at seniorre@od.nih.gov.  Applications must be received ELECTRONICALLY by (11:59 p.m.), December 17, 2008.  DHHS and NIH are Equal Opportunity Employers

Bioinformatics Manager- Lilly Singapore Centre for Drug Discovery
For more information click here 

For reprints and/or copyright permission, please contact The YGS Group, 1808 Colonial Village Lane, Lancaster, PA;

(717) 399-1900 ext. 125, or via email to Ashley.Zander@theYGSgroup.com.