Improving Data Standardization in Neuroscience Research
Contributed Commentary by Shay Neufeld, PhD
June 17, 2022 | Recent advances in the commercialization of neuroscience data acquisition technology have enabled a quickly growing collection of massive brain activity and behavioral datasets. Within the data lie exciting promises of solving how the brain generates behavior, and how it might be better healed when diseased or disturbed. But interpreting these large, high-dimensional datasets isn’t easy. We need modern, collaborative solutions that help scale data management, analysis, and knowledge sharing.
Acquiring high-resolution files of both brain activity and animal behavior in a variety of experimental settings can quickly amount to terabytes of data for just a single project. While universities and companies can usually provide adequate data storage, what is more difficult is organizing the data in a reliably searchable and retrievable manner. Adding to the complexity is analyzing the data itself, which often requires computer vision, machine learning, and statistical methods to extract the relevant information. Impressive advances in these fields have yielded powerful new approaches to analyzing brain and behavior data. Unfortunately, it’s been difficult to effectively share software implementations of these innovative methods, restricting their accessibility to most scientists.
These limitations are currently overcome, to a degree, by herculean individual efforts in research labs and companies, where custom database and data science solutions enable new discoveries every year. But there is a cost to this siloed approach. Our rate of discovery is both slower and costlier. A significant leap forward in neuroscience and neurotherapeutics could be enabled by standardizing research data management and analysis practices.
A Lack of Scalable Solutions
Without an accepted data standard and an ecosystem of software products to support it, scientists are left to manage and organize research data themselves. Sometimes this means large, coordinated efforts by biotech companies and academic initiatives to build custom solutions. Other times it means using a patchwork of products like external hard drives, spreadsheets, Google Drive, and Dropbox. With every researcher having slightly different types of data and philosophies about organizing it, what is guaranteed is that every outcome and solution will be different. As a result, it’s often very difficult to share preclinical neuroscience research data across people and groups, hindering our ability to share these massively valuable datasets for wider analysis and collaboration.
It’s not just sharing data that is jeopardized by a lack of data format standards. As the age of computing continues to bring an explosion of new statistical and machine learning methods to life, there are many talented computational neuroscientists applying these exciting new tools to analyze and interpret research data in both biotech and academic settings. The problem is that because the data are all organized differently, each method implementation makes different assumptions and requirements of how the inputs should look.
And unfortunately, the problem doesn’t stop there. Along with expecting differently organized inputs, these custom methods also usually expect different “computing environments,” including software dependencies, access to memory, and CPU/GPU specifications. Even when scientists publish their methods and code open-source, these different inputs and compute requirements can make a method impractical to use by most other researchers.
The consequence is that brilliant innovations in analyzing and interpreting how neural activity relates to behavior are being stifled by an inability to effectively share these methods for broader use.
Toward Better Accessibility, More Collaboration
The road to accelerating discovery in preclinical and academic neuroscience starts with adopting data standards built for its particular needs and characteristics. This includes large, noisy files that need computationally intensive processing, high-dimensional time series data with complex patterns and correlations, and lots of experimental groups and environmental variables. A part of the challenge is technical since it's not trivial to both store the terabytes of data and organize their complex relationships.
Thankfully, neuroscience isn’t alone in facing this problem. There are already lots of powerful solutions developed for managing the various and vast amounts of data on the internet. Current solutions include relational database products like MySQL and PostgreSQL as well as serverless database products like MongoDB and Apache Cassandra. When set up and used properly, these software tools can provide quite effective, performant data management solutions. But in reality the considerable expertise required to set up, maintain, and use these products has so far limited broader use in neuroscience research.
Another part of the challenge is more social, and even linguistic: agreeing on standard formatting and semantic conventions, knowing that whatever is chosen won’t likely fit anyone’s needs perfectly. There are already some early efforts, for example, Neurodata Without Borders (NWB) and DataJoint. It’s probable (and likely better for the field) that a small number of separate standards will emerge rather than just one. Interoperability between the chosen standards will be important to facilitate more collaboration and accessibility to both data and methods.
Adopting data standards could do wonders for facilitating data sharing in neuroscience, but there is perhaps just as much value in the democratization of analysis methods enabled by such data standardization. With common data formatting, computational scientists around the world can write methods that make the same expectations about data inputs, instantly making them more compatible across groups, universities, and companies. Standardizing the compute environment requires some additional effort, but here too exists technology and products to make this happen. For example, “containerization” is a solution for defining and reproducing an entire compute environment needed to run a specific method or application. Using containerization makes it much easier to run the underlying code on different machines, which in turn makes the methods more accessible and reproducible across the community.
Adopting standards for data and methods alone won’t be enough. We will need a supporting ecosystem of tools to make the solutions accessible to the broad scientific community. As we make these products and services available, it’s crucial that the formats and standards stay as open as possible. This will ensure that scientists can continue to transparently innovate and scrutinize the analysis methods and interpretations. It’s also critical that the solutions are usable by scientists who lack computer science expertise, but comprise the current majority of neurobiologists who expertly design experiments and acquire these massive datasets. To this end, investing in developing highly performant and intuitive user experience software products that can centrally organize, analyze, and share data and methods will help advance the entire neuroscience field.
Shay Neufeld, PhD is the Director of Data Products and Analytics at Inscopix, where he oversees a team of computational scientists, data engineers, and software engineers that work together to create software products for managing, processing, analyzing, and visualizing neuroscience and preclinical research data. He can be reached at firstname.lastname@example.org.