Big Data, BGI and GigaScience

New journal links manuscripts with a database to better share data.

September 23, 2011 | Does the world really need yet another journal? The Chinese founders of BGI (formerly the Beijing Genomics Institute) clearly think so. This fall they will publish GigaScience, an open-access peer-reviewed journal dedicated to large-scale data. Laurie Goodman, an American science editor and writer, has worked for BGI for many years, polishing and editing scientific manuscripts, many dealing with sequencing and analysis. With her extensive experience, it was only natural that BGI should ask her to spearhead the launch. Goodman’s former colleague, Kevin Davies, asked her about the journal’s goals and features.

Bio•IT World: Laurie, why is BGI getting into the journal publishing game with GigaScience?

Goodman: The heads of BGI have long had an interest in creating a world-class international scientific journal that is based in China. They felt this was an important step in furthering scientific dialogue between researchers based in China with those throughout the world. BGI, having its ‘birth’ as part of a huge collaboration (a trait that has continued throughout its history, and of which has played a major role in its rapid growth), feels very strongly that such an increase in the international dialogue for researchers all over China is essential to the growth of research and technology in China.

The leaders of BGI also feel that having a journal (and hopefully a future publishing division) based in a research institution, much like that of the Cold Spring Harbor Laboratory Press, will enhance the scientific environment for the young people working at or visiting BGI—both through the interactions of editorial staff based at the institution as well as with international researchers from different fields.

Why the focus on “big data”?

It’s become clear that there is a growing niche in the scientific arena of large-scale data creation and sharing that did not have a journal that it could call ‘home’. While some of this type of work will always have a place in myriad journals, given the growing number of these types of studies (and the growing amount of data with special hosting and sharing needs), this would ultimately (and really has become) a field that would require a place that specifically focused on this area.

Much like the reason for the birth of Nature Genetics, when there clearly was an increase in the number of great human genetics, genomics papers—but not enough space in a journal like Nature or Science to handle them; or when Genome Research was created, as the needs for the Genome community at large—combining data, bioinformatics, novel technologies etc, required a new outlet for publication.

Given the work at BGI, with its world-class bioinformatics division and the fact that they are producing such an enormous amount of data and so necessarily had to have huge data computation, access, and storage facilities—the focus for the journal became a ‘no-brainer’.

When they proposed the idea to me, my first thought was... Argh! Run another journal?! I wasn’t sure, but when what they were proposing actually hit me, there was no way I could say no. It was too exciting, addressing some serious needs in the community, such as finding a way to host data types that can be an incredibly useful resource to the community—but have no good place to be hosted and properly shared, most importantly in a way that offered some permanency.

What makes GigaScience unique?

Several things, hopefully! First, the most unique feature of GigaScience is that it combines standard manuscript publication with a connected database that will host (and provide sharing tools) for all associated data. Second, all the data in the database will be given a DOI so that it can serve as a separate publication—especially if the data generation and quality assessment be the primary focus of that work. This achieves many things:

(A) It provides a way for researchers to track and cite extensive work that they have done, that the community at large uses. (Currently, if someone is involved in data production, they are penalized by having little to no recognition for their work);

(B) It encourages very rapid data release, rather than having to wait until the paper is written, submitted, peer-reviewed, revised multiple times, and published. We have contacted many journal editors and read many of their editorials/blogs to assess how publication in a citable manner of data that is to be used in a future analysis paper would affect the publication of the analysis paper. Overwhelmingly, journals have stated that this would in no way impede publication of the analysis paper. (BGI has just released 8 datasets with DOIs—mostly unpublished—so we will be seeing some of the first tests of that); and

I’d like to stress something here: we expect that all data will also be posted in their currently available public databases. The more places data are available and in different ways, the better for the community at large. Plus interaction between the BGI bioinformaticians and those at the current repositories can only create a more synergistic environment. We are already working closely with the INSDC consortium of databases to make sure our datasets also fit smoothly into their current infrastructure.

Third, we’ve included an optional section at the end of our papers called ‘Potential Implications,’ which goes beyond the standard Discussion section. Potential Implications allows authors to comment and speculate on things they may have noticed in their work that could have implications for fields far beyond their own. Traditional science papers are very limited in their ability to cross fields and promote much broader ways of thinking. So having a separate section, designated as an area that is forward thinking and can create ways to make it possible for people in potentially very different fields connect with that work, is very worthwhile... Hopefully this section might be a bit of fun and perhaps promote more cross-field collaborations.

How will the peer review system work?

Good question. We are working toward modifying the peer review system, as there are many obvious problems there. We are having manuscripts peer reviewed, but reviewers will not be asked to give their sense of the ‘importance’ of the work, which can be very ambiguous, and changes nearly daily depending on the ‘science du jour’.

Personally, I don’t think that is the most problematic part of peer review, but it does impede speed of publication. Look at the rapid work done on the deadly E.coli strain in Germany recently, where the DNA sequences were released pretty much the day of their completion, and with rolling releases after that.

Our papers will be editorially assessed for scope, which is the relative amount of data created or used. Peer review will focus on accuracy of data, and validity/strength of conclusions made in the data analysis. Large-data papers typically are multi-faceted in terms of data production, computer analyses, and biological inferences, with many authors with different expertise. We think it makes sense to break the paper apart, so we’re breaking up the review process for a paper: the data part will be looked at by reviewers expert in that type of data production, who know the standard things that MUST be included. The analysis part of the paper will be reviewed by other relevant authorities. This should speed the process, as people can focus on those parts of the paper they are most qualified to judge, and limit misinterpretation of other parts of the paper.

We will also try to make peer-review more transparent. For instance, we have an opt-out review system: reviewers are named unless they specifically opt out. We want to give ‘some’ protection if reviewers feel they truly need it. Currently, in other journals that have this system, most reviewers do not opt out. And we are publishing the reviews with the paper, so that readers can see exactly what the reviewers thought in terms of the paper’s quality, accuracy, and use.

What sort of papers are you looking for?

We are looking for papers in any of the life sciences that are based on large-data sets. These can be data production papers, data analysis papers (using their own, currently generated data or utilizing publicly available data resources), and of course, papers with data production and analysis. Our first GigaBlog provides the basics of how we plan to handle [which data are big enough]. This will be constantly updated in future blogs, as it would be very difficult to do so in the ‘instructions to authors’ part of the journal website.

We really want to include non-‘omics type papers, as these typically have sharable data but no public repository. These are papers that might have massive amounts of imaging data, or even things like electroencephalograms from behavior studies in rats that could be viewed and used by the broader community. Huge cohort studies with a lot of metadata are definitely things we would consider.

What is specifically of interest here is getting these potentially shareable resources, which are often very costly to generate, more broadly available. One of the reasons these large datasets do not have a public database, making them inaccessible or present on websites that can become extinct, is because there is a smaller community of researchers creating/using them, or there can be complex issues for hosting/handling them. We will work closely with authors who see this issue with their data, where there are discussions in the field on potential ways to overcome such problems but no real resources to make this a reality.

BGI has huge data storage and bioinformatics capacities—we are interested in making these types of studies more useful to the community—and promoting more of this kind of collaborative research. The GigaScience database will evolve, and as researchers or authors see the benefit of putting something in place, we intend to work with these specific communities and create appropriate hosting and use tools within the Giga Database.

Who are the founding editors of GigaScience?

I’ve been in the publishing field for over 15 years, first with Nature Genetics, and then the formative editor of Genome Research. My career has been heavily based in ‘omics, and thus, focused on dealing with large data science. I’m familiar with all of the inherent issues of open sharing, difficulties in making the data available, difficulties of citation for huge data resources, and most importantly, finding ways to truly broaden the use of large scale data to those in the community that are not bioinformatics or computer savvy. (As Sidney Brenner has noted, the only one who has currently ‘read’ the human genome is a computer. Getting these data past only-computer eyes is a huge goal for GigaScience.)

Scott Edmunds, the editor, is a Londoner and came from Biomed Central, an open access publisher and now also the publisher for GigaScience. Scott is based in our main office in Shenzhen, China. He also has focused a lot on issues of data handling and hosting. At BMC, Scott worked on a variety of specialist journals and medical journals. Before coming to GigaScience he was a senior editor on their Genomics/Bioinformatics and genetics journals. I’m really glad I found him!

Alexandra Basford is the assistant editor. She reminds me of me 20 years ago, just coming into the field of scientific journalism. Her research expertise is in cognitive neurobiology, which will help in the outreach to the neuroscience community. She previously worked at PubMed, indexing biomedical and life science articles, handling metadata, MeSH terms, and the like. She has a keen interest in improving the ability of researchers to find and use new information to support and enhance their work. •

This article also appeared in the 2011 September-October issue of Bio-IT World magazine.