The Replication Crisis: How Can Open Science Improve the Scale of Reproducibility?

Contributed Commentary by Evan Floden, Seqera

May 10, 2024 | For years, two of the key principles of science have been replication and the ability to produce consistent results. Yet, scientists across fields today are still grappling with challenges in ensuring the repeatability and reliability of their findings. More recently, this has been referred to as the ‘replication crisis’ or ‘reproducibility crisis’ and details the growing concerns around the lack of reproducibility in scientific studies.

Addressing this crisis and ensuring public trust in science necessitates collaboration from various entities and groups, and a comprehensive review of current practices. However, at its heart lies the need for greater transparency and accessibility in the tools and pipelines researchers employ. This would represent a huge step toward solving these issues.

The Reproducibility Problem and Building Data Confidence

While replication does boost confidence in scientific results, it is often the case that findings cannot be reproduced, leading to wasted time and resources. In 2016, a survey by Nature brought attention to the extent of the problem with a poll revealing that over 70% of researchers have attempted but failed to reproduce the experiment of other scientists. Furthermore, more than half admitted to being unable to replicate their own experiments. In this survey, the most cited challenges in replicating research were identified as selective reporting, pressure to publish, limited resources, lack of incentives and incomplete documentation. Additionally, the reproducibility of research is significantly hindered by the tendency to under-report studies that yield disappointing or insignificant results.

Another study in 2019 highlighted the importance of software and archival stability in omics computational tools, emphasizing that these factors are crucial for reproducibility in computational biology research. The study revealed that 49% out of 98 software packages were hard to install and nearly 28% of the omics resources are not accessible via their original URLs reflecting challenges in long-term accessibility, which impairs reproducibility. The paper recommends utilizing web services for source code hosting as a practical solution to enhance the availability and longevity of bioinformatics resources, thereby supporting scientific reproducibility.

The difficulty in reproducing research in highly regulated industries like healthcare and life sciences is partly due to data and processing tools often isolated in silos. Combining large datasets with extensive pipelines securely is a slow process that hampers the speed of research reproduction and limits analytical capabilities. Additionally, strict data privacy laws require secure handling of these processes, posing certain challenges to interoperability, and free sharing of data across teams, external organizations, and even other countries. However, recent significant improvements in tools that enhance accuracy and consistency are proving invaluable for scientists. One potential solution lies in the adoption of open-source data pipelines which provide visibility into research data, processes and methodologies, and ensure source code and workflows are openly accessible.

How Open Science and Open-Source Platforms Can Help Tackle the Reproducibility Crisis

With the advent of massive datasets, cloud infrastructure and AI, questions have been asked around how confident we can be in the results scientists are producing at scale—and how secure these processes really are. As reproducibility remains a key concern in bioinformatics research, this is where open-source platforms can bridge the gap between the cost and complexity of working with data en masse and processing huge datasets that are independently verified and replicated. Access to cloud-agnostic platforms for data flow and job management—and offering ready-to-use resources for other researchers—can promote the sharing of results within the scientific community. This is crucial for conducting reproducible research, as it promotes transparency and accessibility in scientific investigations.

The importance of the scientific community is immense, particularly for its ability to allow for international collaboration that sees past borders, creating a platform where scientists are encouraged to share all results, including unsuccessful ones, to create a collaborative culture. Furthermore, the most advanced data orchestration tools, being reproducible and platform agnostic, facilitate knowledge sharing among scientists, corporations, and even countries, enabling a collective response to major health challenges. This ethos of collaboration and transparency was exemplified during the COVID-19 pandemic when Nextflow was utilized as workflow manager to discover and track the Alpha, Delta, and Omicron SARS-CoV-2 variants.

Repeatable and Scalable Data Pipelines Are Essential for Research Organizations

In a report published last July, the US Government Accountability Office offered six specific recommendations, two for each of three federal science funders—US National Institute of Health (NIH), US National Science Foundation (NSF), and NASA—to improve scientific reproducibility by increasing transparency and rigor in scientific work. It argued that these agencies do not widely promote best practices for data collection, sharing, and storage. The overarching aim is to increase the quality of research produced. While these three agencies have since accepted the recommendations, the report underlines a wider need for researchers in healthcare systems to get access to the correct tools which can help to save time and maximize resources by allowing users to capture all steps in the data processing pipeline.

One great example of organizations leading the way in the UK is Genomics England, established to offer diagnostics and technology to enhance genomic healthcare and streamline patient care in the NHS. This organization is revolutionizing the way healthcare can be administered. Pivotal to its success are reproducible data pipelines as they can be deployed swiftly and accurately, significantly aiding in diagnosis and treatment.

Ultimately, by streamlining our workflows, simplifying software management, and ensuring scientific reproducibility, progress can be made to bring us closer to the goal of improving the validity and quality of research exponentially. Open science at its core, aligning with the ethos of openness of scientific research, can take us one step closer to solving the replication crisis. As more stakeholders come together to support the community, we can work toward maximizing transparency to ensure the integrity of all data.

Evan Floden is CEO and co-founder of Seqera and the open-source project Nextflow. He holds a Doctorate in Biomedicine from Universitat Pompeu Fabra (ES) for the large-scale deployment of analyses and is the author of 14 peer-reviewed articles. He can be reached at evan@seqera.io.