Genome in a Bottle Benchmark Published

By Bio-IT World Staff

February 26, 2014 | The Genome in a Bottle Consortium, a project overseen by the National Institute of Standards and Technology (NIST), was created in April of 2012 to provide a gold standard for genotyping. The goal is to provide “benchmark” genotype calls of very high confidence for specific reference material, so that organizations can sequence the same material themselves, and compare their own genotype calls to find sources of error or bias. This could provide a much-needed performance metric for both sequencing instruments and informatics pipelines, revealing each platform’s strengths and vulnerabilities in calling genetic variants.

The Consortium’s first reference material is a human cell line called NA12878, which can be easily distributed to third parties for re-genotyping on different platforms. Last July, NIST, in collaboration with researchers from the Harvard School of Public Health and the Virginia Bioinformatics Institute, made their benchmark genotype calls for NA12878 freely available on the GCAT online resource. (See “Arpeggi Adds Genome in a Bottle Consortium Data to GCAT.”) The data in that benchmark were generated through comparison of 14 different analyses of NA12878, representing a total of five different sequencing instruments, seven read mappers that assign short reads to the appropriate place on the genome, and three variant callers that flag SNPs, indels and more complex sources of variation.

Now, the process by which this team created their benchmark has been published in Nature Biotechnology, in a paper by lead author Justin Zook of NIST. This paper reveals how Zook and his colleagues arbitrated between different data sets when deciding which calls were accurate, and spells out the reasons for excluding certain genomic regions – especially regions of large structural variation – from the Genome in a Bottle genotype calls.

The paper describes a number of hurdles the team had to face in creating their benchmark genotype of NA12878, which should benefit future groups looking to repeat this work with other reference materials. An example is the difficulty in resolving complex variants, like a CAGTGA>TCTCT change on chromosome 1, which might be accurately captured on different platforms but represented differently in VCF files. The authors also describe challenges in validating their work, since the benchmark calls can necessarily only be compared to less-accurate or more-biased genotyping methods. The multiple layers of validation, and guidance about manual inspection of the alignments to resolve differences, could serve as an example practice for creating benchmark genotype calls accurate enough for use as performance metrics.