YouTube Facebook LinkedIn Google+ Twitter Xinginstagram rss  

Got PubMed? Pubget Searches and Delivers Scientific PDFs

By Kevin Davies

June 10, 2009 | Imagine a search tool for the life sciences literature that could, with one click, pull up a full-text PDF of any paper. That in essence is the attraction of Pubget, the first product of a small Cambridge, Mass. start-up.

Following a quiet launch last year, the company has just announced its first 50 partners, including Caltech, Dartmouth, Harvard, MGH, MIT, NIH, Princeton, UCSF, the University of Michigan, and University of Virginia.  A further 200 organizations are waiting to partner as well. Ryan Jones, Pubget president, says the firm has already enrolled “tens of thousands of users at this point, and we’re doubling every month.” A couple of thousand users are inside the Harvard hospitals.

The original Pubget product was developed by one of the three co-founders, a clinical pathologist at Beth Israel Hospital (Harvard Medical School) named Ramy Arnaout. He got his PhD in mathematical biology from Oxford, but was frustrated by the challenge of getting full-text PDF access to science journal articles -- even while working inside well-endowed institutions like Harvard and Oxford. Arnaout joined forces with Ian Connor, formerly with Lotus and IBM, and started building the Pubget search tool.

“Pubget is a platform for life science research,” says Jones, who cites familiar statistics that are propelling the project: the rate at which data are growing exceeds Moore’s Law, and on average two new life sciences journals are launched every day. “A scientist’s tasks are shifting from working with test tubes, reagents and diagnostics equipment to, more and more, interfacing with the data that’s already out there.”

Jones, who was previously with a start-up acquired by Microsoft enterprise search, says Pubget is built on three key components. “One is a search engine that has all the content that Medline or the NIH’s PubMed has in it – 20 million research documents.” Pubget’s open-source search engine uses a relevancy algorithm similar to PubMed, Jones explains, except a little fresher. “We took an initial data dump from PubMed, and now we’ve based direct connections to the publishers themselves, so as soon as research is available, we get that feed from the publisher.”

Second, Pubget built a ‘pathing engine’ that understands the location of the full-text PDFs across all 20,000 journal titles. “It knows exactly where on the web that full-text document lives,” says Jones. “We have crawlers that go out and understand at Nature or Cell or Science where those full-text documents live. In very much the same way that Google finds HTML, we can find the PDF.”

The third component is what Jones calls “a credentials engine, which understands the credentials of the subscriptions you have based on where you are… It can go into a library’s holdings page and interpolate what they have rights to.”

What this means is that when scientists use Pubget to search by author for example, the results are delivered in the form of the full-text PDF, without having to navigate through abstracts or publisher’s electronic portals. “The end user sees us in two ways,” says Jones. “If they are not associated with a larger institution, we are the most thorough resource for free full-text documents. We not only have everything that’s in PubMed Central and the other free resources, but we spider the web for other full-text documents that happen to be out there. If you’re at an institution, we’re the fastest way to take advantage of the subscriptions your institution has provided for you.”

Pubget offers various links for functionality, including a Firefox plug-in to download PDFs; access to the publishers’ web page and the equivalent page in PubMed; email forwarding; and tagging (using a virtual cloud-based storage system) to metatag articles and keep them in a ‘locker.’ A widget, which works via RSS, allows continuous updates on topics or authors inside a lab web page.

The First 50

The first 50 partners are about two thirds academic organizations, as well as hospitals and some commercial. Jones says Pubget already has users at all of the top 12 big pharmas, but no formal relationships as yet (“meaning we haven’t turned them on yet”).

Pubget will in time make money in two ways. One will be the provision of premium services. The other will be by aggregating analytics about current life science search topics. “We can help vendors like Agilent or Bio-Rad understand what the community is searching on,” says Jones. “If you do a search on swine flu, and someone did a virus study and in the methods of that study cited a specific type of microscopy, we can present ads relevant to that.” Host institutions can decide if they want those ads presented or, for a fee, they can opt for “a closed, white label site.” Jones says a handful of the first 50 partners are paying.

Jones credits the staff at the Harvard Countway Library for their early assistance. His team was nervous about the reaction of the publishers at first, “but the reaction has been vastly positive.” Those publisher relationships will be nurtured over time. “We strongly believe that search is paramount and that the papers are really the center of science – it’s how scientists communicate with one another. We want to participate most strongly in those two things, search and papers,” and potentially partner with groups in the social networking space.

Pubget can be found at

Click here to login and leave a comment.  


  • Avatar

    I found this article rather long after its publication. However, by 2009 publishers were already spending millions on value added services to the HTML version of the article, like adding comment facilities like this, altmetrics, links to other articles via Crossref and much more. Current generation articles can contain embedded videos, animating graphs, 3D models, formula rendering etc. It is therefore eerie that systems are created to bypass even the trusted abstract. End users just want downloadable and printable PDFs it seems, if library access to the journals is not a problem.

    Working in this way is quite an impoverished experience. One also get “enhanced PDFs” with the same attributes as HTML articles, but those that I worked with are not downloadable or printable. Alternative metrics involve amongst others monitoring reader views. This system is not perfect, I agree, but citation factors are not either, and takes far too long. Downloading and sharing PDFs by e-mail or within systems like Mendeley defeat usage monitoring . I still fail to understand how these systems proliferate when the copyright policy of the publisher prohibits storage in an electronic system. Preferably, each time an article is needed, it must be called up online and each user within an organisation must call it up independently.

    Disclaimer: I am not a rep from a publisher, but a librarian observing and wondering. Either these systems are not necessary or the publishers and libraries are out of tune with reader needs and preferences.

Add Comment

Text Only 2000 character limit

Page 1 of 1

For reprints and/or copyright permission, please contact Angela Parsons, 781.972.5467.