Snthesis Launches with Data Harmonization Lessons Learned From the Web
By Allison Proffitt
June 17, 2021 | As the life sciences industry seeks to manage—and fruitfully mine—more and more data, data harmonization is a bit of a nagging thorn.
Emerson Huitt, founder of Snthesis, saw that firsthand in his sixteen years of software development for data management.
“I saw over and over again that people who are working with life science discovery data are not happy with the solutions that exist to integrate and harmonize that data and in a lot of cases to capture that data as well,” Huitt told Bio-IT World. Both individual researchers and organizations are hampered by clunky tools, opacity in the data, and silos within the organization, Huitt said. “It become very difficult to form a cohesive picture of your research when you a large team of scientists working on data.”
Editor’s Note: Huitt will be speaking at the 2021 Bio-IT World Conference & Expo.
Huitt had been building solutions—"all kind of solving for the same thing”—and knew that there remained a need in the marketplace for a data harmonization and integration platform that was source agnostic and could synthesize an organization’s research and discovery data into a single cohesive dataset for analysis. He also had insight into the ways that the advertising and financial sectors were handling similar problems with data storage and data management tools in the cloud.
Search engines have been using Semantic Web technologies since the early 2000s, Huitt said, to label website data for search engines. “Google and Duck Duck Go, for instance, are leveraging these technologies to generate those knowledge cards beside your search results,” he explained. “Essentially what they are doing is extracting data from the webpages and putting it into what’s called a knowledge graph and utilizing that to generate summary information—real information—from underlying data.”
When those technologies are applied to life sciences data, researchers can do more comprehensive searches than with a standard database, Huitt argued. “Instead of just asking questions like, ‘Can I have data that has a value in this column?’ you can ask much more interesting questions like, ‘Which samples that Bob tested last year do we have positive results for in a particular media?’ You can ask those very broad questions very easily, and it makes it possible to extract the exact dataset that scientists are looking for.”
Harmonization, Public and Private
In 2018, he took his breadth of experience and launched Snthesis, “combining those technologies and leveraging them for life science data.” The company has been publicly active this year. Still small, Huitt expects to grow his team of eight—based in Durham, North Carolina—to 12 by the end of 2021.
The Snthesis platform ingests data from a wide range of sources: LIMS platforms, electronic lab notebooks, structured data extraction from handwritten notes, PDFs, Excel files, EHR data, as well as data output from lab instrumentation (sequencers, etc.). The data are all harmonized in an automated way.
The platform can also harmonize data between private and public datasets. “Say you’re linking out to Bio Sample or you want to look at data that’s in something like Refine Bio. We can link your private data—so the information that’s in your data lake or your data warehousing systems—and link that with annotations that allow you to connect to that public data while keeping your private data in house,” Huitt explained.
“Harmonize” is the key word here. Snthesis does not manipulate, change, or move existing data. Just like the Google knowledge card example, the platform creates a knowledge graph of the data that is searchable without changing the original data.
The platform manages both text and numerical data. For text data, relationships can be drawn and corrected between data. “Our platform can automatically correct things like column heading errors, it can also unify columns that are different abbreviations of the same unifying term. For instance, if you have “temperature” and “temp” and “degrees C”, in the temperature field, our system can unify all of that data together,” Huitt explains. This sort of qualitative relationship harmonization is particularly important, Huitt said, when you are unifying data between different scientists over time, for instance.
One particular problem Huitt pointed out is when qualitative data are interpreted as numerical data—for instance how Excel handles some gene names. “We can in some cases reverse that, and in other cases, we call that to the attention of the scientist.”
For numerical data, the platform does statistical analysis, flags outliers, and does rule-based analysis of quantitative data. “What that enables us to do is process the bulk of the data in an automated fashion and surface the data that is either erroneous or unusual for humans to intervene. We’re trying to augment human data processing when people are doing complex data integration and really only focus human effort where it’s required.”
The promise of the large-scale data that’s available in life sciences is in unlocking analytics on top of that data. In order to do that, particularly if you’re doing machine learning or any sort of advanced analysis of that data, you need a very clean dataset to work with, so the output of your model is actually usable either in a clinical setting or in research to predict outcomes.
Since 2018, the company has been working privately with customers to build out and fine tune its solution. For instance, Huitt said, Snthesis has been working on “a number of really interesting proofs of concept with large national databases” helping to harmonize the public datasets with private clinical data.
Huitt expects Snthesis to appeal to both bench scientists and executives. “Scientists tend to use it directly. We have a pretty comprehensive search engine that is built into the platform that’s available for bench scientists to use. That allows them to very comprehensively pull-out datasets for their analyses. In my experience throughout my career, bench scientists like to be really hands-on with the data,” he said.
Executives tend to interact with the knowledge graph via an API—either developed with Snthesis for clients or built by internal data science teams. “The real selling point of that is that data that was previously locked up in this large, disaggregated workstream of not well-integrated data, becomes readily available,” he said. “It becomes much easier to build that overview, 30,000-foot status dashboards for even the most complicated research pipeline.”
Huitt also expects the Snthesis offering to appeal to both brand new and well-established research groups and companies. Startups with small teams and tons of flat files who are trying to scale a business where most of the data are in spreadsheets will benefit from the harmonization of their data to make strategic growth decisions. Really large customers, on the other hand, are likely burdened by many legacy systems and deep data archives, but harmonization can be equally essential to strategic decision-making for them as well.
“One of the things that’s interesting as we’ve developed this platform is the realization that scientists, historically, have not had good access to a unified view of their data across different teams,” Huitt said. “A lot of the questions that you can facilitate with that unified access to data are things that, especially businesses engaged in research, haven’t been able to answer very well. We find that we’re unlocking a lot of different possibilities for research teams.”