Find data
Noor and Sam are looking for data that is already available to use in their respective projects.

In order to explore whether there are any genetic mutations associated with an increased likelihood of being diagnosed with diabetes, Noor now needs to find cohorts of individuals presenting this condition. They may use Zenodo or the EGA browser to browse datasets. Noor can also use a Beacon to specifically query the cohorts of interest for the project, and therefore identify the datasets that will contain the relevant phenotypic and genomic information.

Sam may also use these resources to find synthetic datasets in order to increase the sample size for testing the tool.
Findability is the ability for relevant and useful data to be discovered, and it is the first essential step for data reuse in research. However, and despite its importance, finding data can be a difficult challenge, especially regarding human data.
Zenodo
Zenodo is an open repository – developed under the European OpenAIRE program and operated by CERN – which allows researchers to deposit any research-related, open digital artefacts including datasets.
Try it out! Search Zenodo for “synthetic AND genetic AND human AND cohort”, filtering to only include type “Dataset”. You should find the “CINECA synthetic cohort Africa H3ABioNet v1” dataset.
EGA and ENA
The European Genome-phenome Archive (EGA) and the European Nucleotide Archive (ENA) are repositories which provide controlled-access and unrestricted access, respectively, to genetic data. While the EGA contains only human data, the ENA contains human data in addition to data from other organisms.
- Try it out! Synthetic datasets can be found in the EGA here – https://ega-archive.org/synthetic-data – including the “CINECA synthetic cohort EUROPE UK1” dataset.
- Try it out! Search ENA for “synthetic AND human” to find dozens of open synthetic human datasets.
While the resources described above each have their own interfaces for searching for synthetic human datasets, it can be time consuming to search each resource independently to find data of interest. To overcome this challenge, the Beacon standard, explained in the next section, was developed as a basic data discovery protocol that can facilitate federated data discovery across many resources at the same time.