Tackling the Organizational Challenges Slowing Biotech

By Allison Proffitt

November 4, 2016 | “How do you change the way scientists operate in a fundamental way to win them over?” Matthew Trunnell, CIO at the Fred Hutchinson Cancer Institute, asked the crowd at the Converged IT Summit* last week in San Diego during his opening keynote. Funding agencies are offering collaborative funding, people are pushing team science, but nothing changes until people see value in a new way of operating, he said.

While there were plenty of the talks wading into the weeds of how to build and structure data centers, much of the conversation at the Summit tackled the more amorphous problems plaguing bio-IT today: building a community that shares data and incentivizes collaboration.

Posting a timeline of his own year-long tenure at the Hutch, Trunnell pointed out that most of his time so far has been spent winning people. His first year culminated in the launch of the Hutch Data Commonwealth in mid-October, an analytics platform and hub for data science. The Commonwealth consists of four verticals: data acquisition, data engineering, data analytics, and product development.

Launching the Hutch Data Commonwealth took a “tremendous amount of socialization,” he said. The right researchers needed to be identified: those who understood the problem, and may have a solution. The Institute’s Board of Directors needed to be recruited and incorporated into strategic planning. Advisory boards needed to be assembled.

It’s an organizational problem, Trunnell explained. The project has an infrastructure component, but also a research component. Faculty are needed, but applied faculty. The project needs cross-institutional support to survive. And of course, there’s semantics. “Commonwealth” doesn’t really mean anything to Americans, Trunnell pointed out. It’s not threatening; it’s not an institute or a department.

But for all the work required, the burden of proof is not terribly high, Trunnell believes. There’s value purely in making data discoverable, he said. It’s doesn’t need to be cleaned or annotated.

Of course the goal is combined data that can move research forward. Genomics is great, Trunnell said, but it’s “pretty simple.” (He was previously CIO at the Broad Institute.) Bringing in clinical data is much harder, he believes. Biomedical data includes clinical, research, commercial, and operational data, and in order to move forward, anyone pursuing precision medicine is going to need to bring all of these sources together.

The call for seamless data environments isn’t new. The week before the Converged IT Summit, the authors of the October 17 Cancer Moonshot Task Force report listed five strategic goals. The second was to “unleash the power of data” and included, “enabling a seamless data environment for clinical and research data through shared policies and technologies,” as the second action item.

Far earlier, groups have been working on ways to share data both through public efforts and private ones. These are unwieldy problems, and Trunnell suggested that instead of national—or even global—solutions, we consider regional centers. And who pays for it? At the Hutch, with the support of the leadership, Trunnell is “deferring” that question for a few years. Apparently the socialization has worked.

Born Digital

Vivien Bonazzi, senior advisor for data science technologies at National Institutes of Health, works with her own Commons, and said she sees the same problems as Trunnell and others. The NIH has had a big data initiative since 2013—the Big Data to Knowledge, BD2K program—she pointed out, but there is still much work to be done to change how people think about the data.

Organizations that were “born digital”—Bonazzi mentioned Uber and AirBNB—are working, but older institutions are at a disadvantage, she said. In biomedical research, data can no longer be seen as a means to an end, Bonazzi said. That makes the data itself low value, and there are few incentives to share the data because that may impede publication.

But as we move forward, “Organizations will be defined by their digital assets,” Bonazzi said, by data, software, workflows, documentation, and journal articles. “The most successful organizations of the future will be those that can leverage their digital assets and transform them into a digital enterprise.”

The NIH Data Commons (a Commons, not the Commons, Bonazzi emphasized) is a platform to foster a digital ecosystem that rewards FAIR data, data that is Findable, Accessible, Interoperable, and Reproducible.

Bonazzi conceded that communally defining findable, accessible, interoperable, and reproducible, and then mapping a way to achieve that is challenging. But the first step, she said, is changing culture to reward data sharing.

The NIH Data Commons is made of three layers: the platform stack which includes software services and tools; the data layer with either large reference datasets or user defined data; and the compute platform, either cloud or a supercomputer.

Many groups are working on related efforts, and Bonazzi highlighted the cancer cloud from Seven Bridges, DNAnexus, and Sage Bionetworks as classic examples. She also mentioned the Genomic Data Commons project, the ongoing work with BD2K, and GA4GH’s efforts.

In its initial phase, the Commons is working on establishing digital object identifiers to establish data provenance, promoting machine readable data and a minimal set of searchable metadata, and clear access rules. In the future, Bonazzi said, digital objects could be accessible by standard APIs, and community based standards, ontologies, and DOI conventions could emerge.

Data In, Data Out

Across speakers, the audience again and again asked how initiatives like this are getting buy in? How are researchers being convinced to invest their data and time into these new efforts?

Trunnell pays researchers to take part. Researchers are eligible for $10,000-$30,000 internal grants to bring their data into the Commonwealth and let it be annotated. It’s a model Trunnell had used at the Broad as well. “If you let me manage [your data], you don’t have to pay for storage,” he said.

Bonazzi, too, discussed financial incentives: access to and share-ability of FAIR data is part of the NIH grant review criteria, she said.

Patricia Kovatch, associate date for scientific computing at Mount Sinai School of Medicine, has taken a similar approach with the NIH’s Children's Health Exposure Analysis Resource (CHEAR) project. Researchers apply to be part of it, Kovatch explained, and if accepted they get their data processing at no charge. But, everything must be shared after publication.

(It’s worth noting that publication can sometimes take up to 18 months, BioTeam’s Ari Berman pointed out, a lag time he called “ridiculous.”)

The flip side of the coin, of course, is making shared data usable to others. Nancy Wilkins-Diehr at the San Diego Supercomputer Center believes Gateways will be one answer.

Wilkins-Diehr described one science gateways as the In-N-Out Burgers of science: they do just a few things, she explained, but do them well.

CIPRES, Cyberinfrastructure for Phylogenetic Research, is one example. Wilkins-Diehr was understandably enthusiastic about a platform that empowers both Nature articles and a 10^th grade science project.

And there’s been a proliferation of gateways, so much so that Wilkins-Diehr and colleagues have built ScienceGateways.org to collect gateways and support the researchers building them. Gateways can fall off grants and suffer from researcher transition, she explained. Wilkins-Diehr is active in national and international efforts to build and support gateways.

What’s she’s not interested in doing is pruning the number of gateways. There’s no reason to, she said. Well-designed gateways will be used. There’s no need to limit what’s available.

Lightening Round

Questions of management and team building even made their way into Chris Dagdigian’s Trends from the Trenches talk. Delivered with his usual caveats and lightning speed, the BioTeam co-founder ran through the technologies he found over-hyped (data lakes), growing (object storage), and paused (cloud adoption).

Dagdigian also issued an organizational warning to those working to build collaborative teams and accelerate research: make the infrastructure serve the scientists.

Find the scientists who are chained to the lab, he said. They are retention problems. They are your best and most innovative scientists; they are willing to be tied to their instruments to ask good questions. But not for long.

*Converged IT Summit, October 24-26, 2016; San Diego, Calif.