By Kevin Davies
June 10, 2009 | Imagine a search tool for the life sciences literature that could, with one click, pull up a full-text PDF of any paper. That in essence is the attraction of Pubget, the first product of a small Cambridge, Mass. start-up.
Following a quiet launch last year, the company has just announced its first 50 partners, including Caltech, Dartmouth, Harvard, MGH, MIT, NIH, Princeton, UCSF, the University of Michigan, and University of Virginia. A further 200 organizations are waiting to partner as well. Ryan Jones, Pubget president, says the firm has already enrolled “tens of thousands of users at this point, and we’re doubling every month.” A couple of thousand users are inside the Harvard hospitals.
The original Pubget product was developed by one of the three co-founders, a clinical pathologist at Beth Israel Hospital (Harvard Medical School) named Ramy Arnaout. He got his PhD in mathematical biology from Oxford, but was frustrated by the challenge of getting full-text PDF access to science journal articles -- even while working inside well-endowed institutions like Harvard and Oxford. Arnaout joined forces with Ian Connor, formerly with Lotus and IBM, and started building the Pubget search tool.
“Pubget is a platform for life science research,” says Jones, who cites familiar statistics that are propelling the project: the rate at which data are growing exceeds Moore’s Law, and on average two new life sciences journals are launched every day. “A scientist’s tasks are shifting from working with test tubes, reagents and diagnostics equipment to, more and more, interfacing with the data that’s already out there.”
Jones, who was previously with a start-up acquired by Microsoft enterprise search, says Pubget is built on three key components. “One is a search engine that has all the content that Medline or the NIH’s PubMed has in it – 20 million research documents.” Pubget’s open-source search engine uses a relevancy algorithm similar to PubMed, Jones explains, except a little fresher. “We took an initial data dump from PubMed, and now we’ve based direct connections to the publishers themselves, so as soon as research is available, we get that feed from the publisher.”
Second, Pubget built a ‘pathing engine’ that understands the location of the full-text PDFs across all 20,000 journal titles. “It knows exactly where on the web that full-text document lives,” says Jones. “We have crawlers that go out and understand at Nature or Cell or Science where those full-text documents live. In very much the same way that Google finds HTML, we can find the PDF.”
The third component is what Jones calls “a credentials engine, which understands the credentials of the subscriptions you have based on where you are… It can go into a library’s holdings page and interpolate what they have rights to.”
What this means is that when scientists use Pubget to search by author for example, the results are delivered in the form of the full-text PDF, without having to navigate through abstracts or publisher’s electronic portals. “The end user sees us in two ways,” says Jones. “If they are not associated with a larger institution, we are the most thorough resource for free full-text documents. We not only have everything that’s in PubMed Central and the other free resources, but we spider the web for other full-text documents that happen to be out there. If you’re at an institution, we’re the fastest way to take advantage of the subscriptions your institution has provided for you.”
Pubget offers various links for functionality, including a Firefox plug-in to download PDFs; access to the publishers’ web page and the equivalent page in PubMed; email forwarding; and tagging (using a virtual cloud-based storage system) to metatag articles and keep them in a ‘locker.’ A widget, which works via RSS, allows continuous updates on topics or authors inside a lab web page.
The First 50
The first 50 partners are about two thirds academic organizations, as well as hospitals and some commercial. Jones says Pubget already has users at all of the top 12 big pharmas, but no formal relationships as yet (“meaning we haven’t turned them on yet”).
Pubget will in time make money in two ways. One will be the provision of premium services. The other will be by aggregating analytics about current life science search topics. “We can help vendors like Agilent or Bio-Rad understand what the community is searching on,” says Jones. “If you do a search on swine flu, and someone did a virus study and in the methods of that study cited a specific type of microscopy, we can present ads relevant to that.” Host institutions can decide if they want those ads presented or, for a fee, they can opt for “a closed, white label site.” Jones says a handful of the first 50 partners are paying.
Jones credits the staff at the Harvard Countway Library for their early assistance. His team was nervous about the reaction of the publishers at first, “but the reaction has been vastly positive.” Those publisher relationships will be nurtured over time. “We strongly believe that search is paramount and that the papers are really the center of science – it’s how scientists communicate with one another. We want to participate most strongly in those two things, search and papers,” and potentially partner with groups in the social networking space.
Pubget can be found at pubget.com