Life Science datasets

SD ID: Leveraging EOSC to offload updating and standardizing life sciences datasets and to improve studies reproducibility, reusability and interoperability

Organisations & Contacts:

Jordi Rambla, Centre for Genomic Regulation (CRG)

Cedric Notredame, Centre for Genomic Regulation (CRG)

Erik van den Bergh, EBI

Matthew Viljoen, EGI

OVERVIEW: This demonstrator will leverage EOSC resources to enhance science reproducibility of datasets uploaded to the European Genome-phenome Archive (EGA). By doing this the new dataset will also be made available in a FAIR manner, adding metadata according to the attributes that have been chosen to contribute the strongest to the FAIR principles. Pipelines will be developed as part of this demonstrator to automate the later process. This pilot will have a pragmatic impact by demonstrating how to make analyses portable (tools and workflows), how to increase findability, how to leverage security technologies for sensible data, how to deploy the workflow into a cloud and how to make data FAIR. It will also have a long-term impact by increasing the usability of EGA hosted data by assuring to potential users that up-to-date versions of an assured quality are available to download

SCIENTIFIC OBJECTIVES OF THE DEMONSTRATOR:

The European Genome-phenome Archive (EGA) (https://ega-archive.org) is a repository that facilitates access and management for long-term archival of bio-molecular data. Enhancing data analysis reproducibility and exploring new added-value services by leveraging EOSC resources are the main objectives of this SD. Applying the FAIR principles (Findability, accessibility, interoperability and reusability) to our data sets and information associated is a great mission we have accepted from the community

  • A set of results data has been reproduced using a portable version of the pipeline.
  • The same result set has been updated by re-analyzing it with a current version.
  • FAIRfied metadata on both result sets is available at a testing EGA server and/or at an appropriate repository.

MAIN ACHIEVEMENTS:

  • A set of results data has been reproduced using a portable version of the pipeline.
  • The same result set has been updated by re-analyzing it with a current version of the pipeline and the reference data.
  • FAIRfied metadata on both result sets is available at a testing EGA server and/or at an appropriate repositor

IMPACT: This pilot will have a pragmatic impact by demonstrating how to make analyses portable (tools and workflows), how to increase findability, by using persistent identifiers, how to leverage security technologies for sensible data, how to deploy the workflow into a cloud and how to make data FAIR. It will also have a long term impact by increasing the usability of EGA hosted data by assuring to potential users that up-to-date versions of an assured quality are available to download.

The success of the project will be monitored using well defined user cases and insuring their reproducibility across sites and platforms. This monitoring will occur through space (i.e. across sites) and time (i.e. reproduction and updating of existing results).
The potential scientific, and socio-economical impact is extremely significant at a time when insilico analysis are being routinely deployed in a medical context with this approach expected to dominate the so called precision medicine in the next decade.

RECOMMENDATIONS FOR THE IMPLEMENTATION

Being in possession of huge amounts of data is a first step but not enough to achieve the main goal: foster research. There exists a need for adding usefulness to the bio-molecular data the repositories currently store. The EOSC project is a unique framework to add this necessary layer of standardization and interoperability while unifying and discovering the files and associated pipelines.