Trends from the Trenches Part 2: Data Structures, Habits, and Teams
By Allison Proffitt
July 20, 2023 | After exploring the infrastructure upheaval of the past year, the annual Trends from the Trenches session at Bio-IT World moved to questions of data and teams. Ari Berman, BioTeam CEO and Adam Kraut, director of infrastructure and cloud architecture took on our data structures, habits, and the what’s needed to build teams equipped to handle it all.
When it comes to data, AI—of course—is at the front of everyone’s radar. In fact, Adam Kraut sees a lot of comparisons between AI today and the cloud between 2008-2014. Then, people viewed the cloud as an easy way out of their capacity planning problem. “In the same way, I think people look at AI as a way out of their data problems,” he said.
But AI is not a quick fix, easy button. Instead, it’s a developing technology for which most of us will be consumers, not producers. “You are not leading the AI revolution; you are merely a participant,” Kraut said, “and that’s ok.”
Kraut shared advice on the winning and losing strategies—crediting Fernanda Foertter, a past Trends speaker, for much of it. The losing AI strategies, Kraut said, include acquiring AI startups, ignoring ethics and bias in your data, adding complexity to your processes, starting with petabytes of historical data and hoping AI will fix it, deploying AI into customer-facing products first, trying to solve broad problems, and outsourcing competency. Instead, Kraut (and Foertter) recommends hiring data curators, building ways to tag data with existing processes, using data platforms, solving internal problems within the organization, sharing data and building partnerships, and reusing model architectures.
We’re transitioning from the Big Data Age to an Analytics Age, said Berman, where we need to gain value from all the data we’ve been generating. Every company wants to be AI-ready, but AI in life sciences is still really two to five years away from productivity, he said. In the meantime, it seems much of the work before us is making our data ready for AI, even as the various AI technologies—machine learning, natural language processing, neural nets, large language models, robotics, image recognition, and more—mature.
FAIR data goals—making data findable, accessible, interoperable, and reusable—should help with that, but FAIR data efforts, Berman noted, have not so far been successful. There are pockets of excellence, but the problem, in his estimate, is people, not technology. FAIR data is complex and must be a community effort. Unless the community as a whole is incentivized—rather than tying incentives to individual NCI grants or personal KPIs for a company—there will still be blocked progress. The truth is, Berman said, it pays to create a new ontology or metadata cache rather than share funding with groups of collaborators or refine existing offerings.
Community-wide change will require standards, and while there are plenty of data standards in biomedicine and life sciences, there is a lack of unified standards. It’s very difficult to combine data from multiple sources for deeper analysis; historic data are locked in publications; and different systems store and make data available differently, Berman said.
Data platforms—lakes, oceans, fogs, swamps, warehouses, commons, meshes, fabrics, etc—are all simply more “siloes of excellence”, Berman said. And while the broad research space does need a diversity of tools and approaches, they should also be cross-compatible. The “i” in FAIR—interoperability—isn’t there yet, he said.
Platform-building is still relevant, Kraut added, though we have dozens of data environments. Now we must prioritize interoperability and integration to power insights. Our science will be rate-limited by the connectedness of the data, he said. “The more connected the data is, the better it is going to be applied in an AI/ML context.”
In Kraut’s view, a healthy data ecosystem or platform is characterized by data findability and ease of exploration, creating a culture of data citizenship and stewardship, common languages and standardization, using automation to iterate quickly, and integrity of data generation with AI/ML in mind. Key to integrity in data generation is data management. Discipline equals freedom, Kraut said, advocating for data dictionaries, data harmonizing, data cleaning and other organizational tasks. Embrace the pain, he advised.
But possibly a more fundamental question, Berman said, is that we still struggle to understand the value of our data, vacillating between valuing it based on what it cost to generate or how much we could sell it for and valuing it according to its scientific impact, how long we can use it, and how much we’re willing to pay to store it. Without a clear understanding of data value, everything is high value, he warned. We hoard data and can’t prioritize our data investments.
Kraut recommends demonstrating the value of your data by using it to solve your own, real-life struggles first. Users are experts in their problems and builders are experts in solutions, he said, quoting a previous Bio-IT World Conference presentation from Recursion Pharmaceuticals. “This is a stay-in-your-lane thing. We all want to think we’re experts in things we are not,” he said. Instead, he advises building cross-disciplinary teams of experts.
And where do those teams come from? Kraut warns against outsourcing all your expertise. Instead invest in ongoing, cooperative learning within the organization aiming for cross-disciplinary knowledge, problem-solving, collaboration, and adaptability. Borrowing an idea from jiujitsu training, he also challenged the audience to keep it playful—enough of a challenge to learn, but enough fun to enjoy. Think broadly, he said, including internal expertise from networking, security, infrastructure, biology, chemistry, data science, and software engineering.
Finally: “We’re a consulting company, but stop hiring so many consultants! Train your people! Insource your skills!” Kraut challenged.