By Aaron Krol
April 25, 2014 | Syngenta, one of the world’s largest crop engineering
companies, was not in the market for a new genome browser. In 2008, Syngenta, like many companies, was using the open source program GBrowse, which is
substantially similar to most commercial platforms. GBrowse has an intuitive
interface for displaying genes and chromosomes, and all the basic tools, like
BLAST, that users need to run in genome browsers. It didn’t seem like a
particular pain point for research.
But then a group of Syngenta scientists did a site visit
to Ceres, a smaller bioengineering company that specializes in creating crop
strains for use as biofuels. There, they were introduced to an in-house browser called Persephone.
“They were looking at something completely different,” says Eric Ganko, a
computational biologist at Syngenta. “But they happened to see this software,
and they were quite interested in how fast you were able to look at a whole
Genome browsers have to render massive amounts of data in an
interactive visual format, which makes heavy demands on memory. It can take ten
or fifteen seconds to load a new file, or to zoom from a specific gene out to a
whole chromosome. “If the research stream is 15 clicks, and every click is 10
to 15 seconds, it may not sound like a lot, but over the course of looking up
your information it can be very time-consuming,” says Tim Swaller, Ceres’ Vice
President of Genomic Technologies.
In fact, in subtle ways, it can begin to
shape the kind of research that gets done. Scientists entering a genome browser
tend to navigate straight to a single gene or region of interest, and not move
far afield, or switch to another data source, without a specific reason.
Carl Meinhof, Ceres’ Manager of Research Informatics, likens
the situation to the early days of Google Earth. “[The map] was all there,” he
says, “but when you panned around, or zoomed in, you always had to wait for
things to load. It was great that something like this existed in the first
place, but in the end, it was really painful to use.”
Timothy Swaller (L), VP of Genomic Technologies, and Carl Meinhof (R), Manager of Research Informatics. Image credit: Ceres
Persephone, however, works essentially in real time. Users
can open a 300 million-nucleotide chromosome and see it appear onscreen in less than a second,
and immediately scale down to the level of introns and exons, or pan between
genic regions. “You can scroll the mouse wheel, and things just react
immediately, just like you would expect from a picture you zoom in and out of,”
says Meinhof. “In other browsers, you deal with delays in reloading data when
you go between regions.”
Related data – from gene expression studies, or large
sets of SNPs – can then be quickly layered on top of the gene or chromosome of
brought up 60 million SNPs in 30 seconds,” says Swaller.
The Seeds of a New Approach
In a way, it’s not surprising that a platform like
Persephone should come out of a company in the field of plant genetics. While
the issues Persephone addresses frustrate many geneticists, they are
particularly acute for those working with crops.
“The funding for plant research is not as high as for
humans, so a lot of the time we deal with less-than-full datasets,” says
Swaller. “We end up working with a lot of scaffold maps. For example, with
wheat there are scaffold maps with hundreds of thousands of scaffolds.” These
genetic maps are much less complete than physical maps of the chromosomes, and
so get split into many more pieces. Where a human geneticist can access the
whole human reference genome in just 23 chromosomes, plant geneticists often
deal with exponentially more files, all of which have to be loaded individually in browsers.
Researchers in the plant field also work more often with
homologous sequences from related species, meaning yet more genetic maps to
load. At Ceres, one of the key crops is sorghum, an organism so incompletely
sequenced that the company uses the corn genome as a reference. This reliance
on cross-species data means that Persephone has to move rapidly
between separate maps, rather than focusing on just one at a time.
For this reason, Persephone has the ability to display more
than one chromosome or genetic map on the same screen, which is not a standard
feature of genome browsers. If you want to match sequences from two species,
says Swaller, “you can bring up syntenic chromosomes, with all the orthologous matches,
in less than a second.” Visualizing these separate sequences side by side offers
a more intuitive picture of genetic homology. Persephone can search for
homologous or orthologous regions between two different datasets, and draw
lines between the two maps to show where matches occur.
Human and mouse chromosomes viewed side by side in Persephone. The lines indicate areas of synteny between chromosomes. Image credit: Ceres
One quirk of Persephone’s origins in plant genetics is that
it displays maps vertically, a perspective that Swaller says helps make visual
sense of cross-chromosome comparisons. When users want to switch to a more
targeted view, they can simply highlight a region or gene, and see that information in
the more familiar horizontal display that most genome browsers use. In the
horizontal mode, users can scale down to the level of single nucleotides, to
better explore SNPs and other variants.
Viewing multiple maps in the same window can give users a more
comprehensive view of polygenic traits. “Most traits, in general, are
quantitative in nature – multiple genes that are additive in effect,” says
Swaller. “What our scientists want is to be able to view two, three, four
different genes at once. And if those are across different chromosomes, we
still want to add expression data and see all the SNP data for those.”
This combination of speed and keeping multiple datasets
onscreen doesn’t just accelerate computational tasks. By making it more
inviting to skim through the whole genome, it encourages users to do more
exploratory, hypothesis-free research.
“We want to allow people to browse through the data, and
explore the data, without a concept of what the results will be,” says Swaller.
“If you want to see methylation patterns across the genome, you need to bring
up multiple chromosomes at one time, see methylation patterns, maybe see gene
annotations, without any kind of preconceived notion of where you want to end
up.” This can lead to unexpected insights and connections that would not be
made if researchers looked at only the genes they’re currently working on.
“At some point, speed is not just a quantitative improvement
– it’s a qualitative jump,” says Meinhof. “It makes things possible that are
not possible at the slower speed.”
From In-House Tool to Software-as-a-Service
Persephone’s speed is not a product of sophisticated
hardware or parallel computing. At the Molecular Medicine Tri-Conference in San
Francisco this February, Meinhof demonstrated the software on an ordinary
laptop. “It’s fundamentally software design, and the data storage and
transport” that accounts for the speed, he says.
Ceres’ lead software designer comes from the gaming
industry, a perspective that Swaller says made him “very interested in
applications that gave our researchers immediate gratification.” The designers
borrowed techniques from game development for displaying elements onscreen, and
temporarily reducing them during navigation, which can control loading times.
“The other aspect is how data is transported between the
back end and the front end, and the compression being used,” adds Meinhof. “We
can load very large datasets with a relatively small memory footprint. We can
load a few million SNPs, and still be under a gigabyte in memory usage. It can
be done on a mediocre machine.”
Still, Persephone was originally built with in-house use at
a medium-size company in mind. Ceres had no inclination to become a software
provider, and only began licensing its genome browser when Syngenta happened to see the platform and expressed
interest in adopting it.
In 2008, when Syngenta first saw Persephone in use,
next-generation sequencing was also beginning to pile unprecedented amounts of
data into companies’ servers. “There are genomes out there that have a million
sequences,” says Ganko. “That’s not something Persephone was necessarily
designed for originally, but it is something they’ve adjusted for over time.”
Ganko manages the reference genomes at Syngenta, and has made a gradual transition to using Persephone as the central repository for his
data. Over the years that Syngenta has licensed the genome browser, Ceres has
worked hard to adapt it to the needs of large clients in the era of
high-throughput sequencing. “When next-generation sequencing came about, we
really revamped the software to handle these very, very large datasets from all
these dispersed databases,” says Swaller.
“It’s definitely come along in terms of speed under heavy
loads,” adds Ganko. “We really have enjoyed working with Ceres’ developers.
It’s not always easy to find people who are good and responsive at development,
but also can understand the biology.”
A screenshot from Persephone showing both the vertical macro view of a chromosome, and the horizontal micro view of a target gene. Image credit: Ceres
Syngenta has now fully transitioned to Persephone as both
the primary genome browser, and a database for genomic data and annotations. Ganko
is encouraged by the results. “What it can open up is that ad hoc kind of discovery,” he says. “Most
people are going to Persephone with a specific target in mind, a gene or a
marker they’re interested in exploring… [but] when you are able to look at
what’s around you much more easily, you might find out we actually have several
other markers nearby that we could also try. My hope is always that, by making different
types of data available, you might allow for chance discovery.”
Many of the features in Persephone appeal to Syngenta
specifically as a company in plant engineering, including the ability to work more
easily with genetic maps that fall short of whole physical chromosomes. “The
genetic maps is a really big and needed feature,” says Ganko. “Some crops still
don’t have a genome. Things like wheat, and sugarcane, are just so big that we
don’t have good physical sequence, and probably won’t for some time.” The
public databases Persephone draws on, too, were originally concentrated heavily
in the plant space.
Yet Ceres has recognized a broader need for a real-time
genome browser, and the company is now beginning to demo the platform for users
in the human genome space.
Persephone is already available for large customers with the
resources to license the platform and install it internally. However, Ceres is
also working on a software-as-a-service model, where users can run Persephone
through Amazon Web Services, and store their data in the cloud. This is the
architecture that prospective customers are using to demo the software, and
Ceres hopes to fully deploy it as a commercial solution by the end of the year.
“We realize there’s a big community of individual users – at
companies, universities, institutes – that want an application with this
performance and speed,” says Swaller. After years of development with its major client, Ceres now sees a large potential market for its first software platform.