Datasets
Curated datasets are publications tagged, either computationally or manually by a curator, as being relevant to a specific area of biology.
These are actively maintained and grow with every release.
New datasets can be requested, if relevant to your work, by mailing intact-help@ebi.ac.uk.
Manually selected datasets
-
Affinomics
- Interactions curated for the Affinomics consortium.
This dataset contains interactions which have been derived in the context of the EU Affinomics project (Grant number 241481). This comprises interactions directly submitted by the consortium partners as well as interaction derived from the literature. The current focus is on interactions derived by Proximity Ligation Assay (MI:0813), a method pioneered by the group of consortium partner Ulf Landegreen.
-
Alzheimers
- Interaction dataset based on proteins with an association to Alzheimer disease.
The compilation of this dataset and its curation was carried out in collaboration with Perreau V.M. University of Melbourne, Australia. Interactions were investigated in the context of Alzheimers disease with a particular focus on APP (A4) protein. The articles to be curated were determined based on protein annotations and literature scanning.
Publications based on this dataset: PMID: 20391539
-
BioCreative
- Critical Assessment of Information Extraction systems in Biology.
The Biocreative dataset is a large dataset of curated publications from the Journal of Biological Chemistry (2006) and Nature publishing house which were manually curated by IntAct curators. This dataset has been used in BioCreative II (Critical Assessment of Information Extraction systems in Biology): Protein-Protein Interaction Task . The protein-protein interaction task focused on the prediction of protein interactions from full text articles, which are represented in the Biocreative dataset. The Biocreative dataset provided by IntAct is a resource for text mining development and testing. The data file (source-text.txt) that provides a mapping between IntAct interactions and the sentence(s) of the publication that allowed an IntAct curator to identify the interaction is available here.
Publications based on this dataset: PMID: 18834496, 18834487, 19208158
-
Cancer
- Interactions investigated in the context of cancer.
This dataset consists of interactions of proteins that are involved in cancer. An ongoing literature survey was carried out to determine publications of interest. Protein annotations were also considered when choosing the publications to be curated.
-
Cardiac
- Interactions involving cardiac related proteins.
A collection of interactions relating to proteins identified as associated with the cardiovascular system. These annotations create a PPI network which can be used to advance the understanding of protein interactions within the cardiovascular system. The gene lists have been assembled by the Cardiovascular Gene Annotation group at the University College, London and the dataset has also largely been assembled by that group, funded by the British Heart Foundation grant RG/13/5/30112, with the help of IntAct curators located at the EBI. This work is a collaboration with the Cardiac Proteomics and Signalling Laboratory at UCLA, funded by NHLBI Proteomics Center Award HHSN268201000035C.
-
Chromatin
- Epigenetic interactions resulting in chromatin modulation.
Chromatin relevant protein-protein interaction studies have been curated by IntAct curators from peer reviewed literature. These comprise interactions which are involved in modulating, modifying or forming chromatin. This dataset aims at capturing major epigenetic interactions resulting in chromatin modulation. Most of the publications were derived from 'Chromatin Papers ListServe' maintained by Bone J.
-
- Interactions investigated in the context of Coronavirus.
Dataset of molecular interactions extracted from publications involving viral proteins from the Coronaviridae family and human proteins, along with a certain proportion of other model organisms. The data features mostly protein-protein and some RNA-protein interactions and covers SARS-CoV2 and SARS-CoV primarily, with some interactions from other members of Coronaviridae as well. Interactions between human proteins relevant for SARS-CoV2 infection have also been included.
-
Crohn's disease
- Interactions of proteins identified as having a link to Crohn's disease.
Interactions extracted from publications focused on proteins involved in Crohn's disease.
-
Cyanobacteria
- Interaction dataset based on Cyanobacteria proteins and related species.
This dataset was obtained in a collaboratice effort with Franck Chauvat, Corinne Cassier-Chauvat, Jean-Cristophe Aude, Magali Michaut, and Pierre Legrain from DBJC, CEA Saclay, Gif-Sur-Yvette, France. Cyanobacteria like Synechocystis sp. can be used as model organism as they undergo both oxidative respiration and photosythesis. Cyanobacteria have many features common to bacteria including a lack of compartmentalisation. This dataset is used to gather articles showing interactions relevant to plant photosynthesis, redox metabolism, resistance to metal and oxidative stress. Most interactors belong to Cyanobacteria species, with a focus on Synechocystis sp. (strain PCC 6803), TaxID 1148, but some interactors belong to species where biological events seen in Cyanobacteria may also occurs like plants for photosynthesis. Also, the dataset contains a number of hybrid experiments using electron transfer between proteins from different species.
Publications based on this dataset: PMID: 18508856
-
Diabetes - Interactions investigated in the context of Diabetes.
This dataset consists of interactions of proteins that are involved in diabetes.
-
Huntington's
- Publications describing interactions involved in Huntington's disease.
Interactions extracted from publications focused on proteins involved in Huntington's disease.
-
IBD
- Inflammatory bowel disease
Interactions extracted from publications focused on proteins involved in Inflammatory Bowel Disease.
-
Neurodegeneration
- Publications depicting interactions involved in neurodegenerative diseases
Interactions extracted from publications focused on proteins involved in neurodegenerative diseases.
-
Parkinsons
- Interactions investigated in the context of Parkinsons disease
Interactions were investigated in the context of Parkinsons disease with a particular focus on LRRK2 protein and were derived in the context of the The Michael J. Fox Foundation for Parkinson's Research LRRK2 Biology LEAPS Award 2012.
-
Rare diseases
- Interactions investigated in the context of Rare genetic disease
This is a dataset of molecular interactions extracted from publications focused on the study of any rare disease. 'A rare disease' is defined according to the European Union Regulation on Orphan Medicinal Products (1999): a disease that affects not more than 1 person per 2000 in the European population. The dataset is enriched with experimentally proven impact of clinical mutations on interactions, and also with the non-clinical mutations which are found to have effect on protein functionality.
-
Ulcerative colitis
- Interactions of proteins identified as having a link to ulcerative colitis
Interactions extracted from publications focused on proteins involved in ulcerative colitis.
Computationally maintained datasets
These datasets are computationally maintained but additional papers may be manually added to this set by a curator during the curation process. When datasets are computationally added to a publication, the large scale papers (more than 100 interactions per experiment) are excluded.
-
AFCS
- Interactions from the Alliance for Cell Signaling database
This dataset was obtained from the Alliance for Cell Signaling database. The Alliance of Cellular Signalling (AFCS) consisted of around 20 institutions which were engaged in a collaborative effort to investigate and understand cellular signalling networks (http://www.afcs.org/). The AfCS used high-throughput methods to detect protein-protein interactions between signaling molecules expressed in B cells and cardiac myocytes. The AfCS arranged a collaboration with Myriad Genetics to perform large-scale yeast two-hybrid screens. IntAct acted as a data repository of protein-protein interaction data generated by the AFCS project.
-
Apoptosis
- Interactions involving proteins with a function related to apoptosis
Datasets of apoptosis relevant protein-protein interaction studies are curated by IntAct curators from peer reviewed literature. These datasets are a resource for biologists seeking to understand protein interaction networks and cell death. Small-scale interactions involving proteins annotated with the GO terms "Apoptosis" are included in this set.
-
Archaea
- Interaction dataset based on Archaea proteins
Archaea are phylogenetically very different from Bacteria and Eukarya and show many differences in their biochemistry from other forms of life. This was considered of interest and peer reviewed literature that is curated is scanned for interactions involving proteins from this group.
-
PDBe
- Data obtained from the Protein Data Bank Europe
The Protein Data Bank in Europe (PDBe) is the European project for the collection, management and distribution of data about macromolecular structures, in collaboration with Worldwide Protein Data Bank (wwPDB). IntAct has incorporated a subset of the data from this database involving heterodimeric protein interactions.
-
NDPK
- Interactions involving proteins containing InterPro domain IPR001564, Nucleoside diphosphate kinase, core.
NDPKs, which play a major role in the synthesis of nucleoside triphosphates other than ATP, also possess other enzymatic activities and are required for cell proliferation, differentiation and development.
Publications based on this dataset: PMID: 19415463
-
Synapse
- Interactions of proteins with an established role in the presynapse.
This dataset has been created for proteins-protein interactions involving at least one protein with an established link to the synapse. The list of human, rat and mouse gene names used for computationnally maintaining this dataset are available here. Interactions made by orthologous proteins have been added manually by IntAct curators.
-
Virus
- Publications including interactions involving viral proteins.
The MINT and HPIDb databases are major contributors to this dataset.
Species-based datasets
Species specific datasets are generated from the protein-protein interaction data curated from peer reviewed journals and are available here. The data are based on the taxonomy of the proteins taking part in the interaction. Analysis of one such dataset, which involved Arabidopsis proteins has been discussed in PMID: 20371643.
Mutations influencing interactions dataset:
Download
The latest mutation dataset is available here: mutations.tsv
Introduction
This dataset is the result of the deep-curation policies used within the IMEx Consortium and it provides a set of protein sequence changes (mutations) and their effect over interaction outcome. The information given in the dataset expands the information that can be found in the "Feature(s) interactor A/B" columns in the PSI-MITAB 2.7 standard format and within the attribute "feature" in PSI-MI-XML format.
Dataset description
This dataset contains over 28,000 instances where mutations have been experimentally shown to affect a protein interaction. There are several types of mutations covered, the terms used have been described in the PSI-MI controlled vocabularies, accessible at www.ebi.ac.uk/ols4/ontologies/mi:
- Mutation (MI:0118): A change in a sequence or structure in comparison to a reference entity due to a insertion, deletion or substitution event. This root term is used when there is a mutation present in a protein and the wild type version has not been tested or shown to interact in the referenced paper.
- Mutation causing an interaction (MI:2227): A change in a sequence or structure in comparison to a reference entity due to a insertion, deletion or substitution event that enables an interaction when compared with the wild-type, which does not interact.
- Mutation decreasing interaction (MI:0119): Region of a molecule whose mutation or deletion decreases significantly interaction strength or rate (in the case of interactions inferred from enzymatic reaction).
- Mutation decreasing interaction rate (MI:1130): Region of a molecule whose mutation or deletion decreases significantly interaction rate (in the case of interactions inferred from enzymatic reaction).
- Mutation decreasing interaction strength (MI:1133): Region of a molecule whose mutation or deletion decreases significantly interaction strength.
- Mutation disrupting interaction (MI:0573): Region of a molecule whose mutation or deletion totally disrupts an interaction strength or rate (in the case of interactions inferred from enzymatic reaction).
• Mutation disrupting interaction rate (MI:1129): Region of a molecule whose mutation or deletion totally disrupts an interaction rate (in the case of interactions inferred from enzymatic reaction).
• Mutation disrupting interaction strength (MI:1128): Region of a molecule whose mutation or deletion totally disrupts an interaction strength.
- Mutation increasing interaction (MI:0382): Region of a molecule whose mutation or deletion increases significantly interaction strength or rate (in the case of interactions inferred from enzymatic reaction).
- Mutation increasing interaction rate (MI:1131): Region of a molecule whose mutation or deletion increases significantly interaction rate (in the case of interactions inferred from enzymatic reaction).
- Mutation increasing interaction strength (MI:1132): Region of a molecule whose mutation or deletion increases significantly interaction strength.
- Mutation with no effect (MI:2226): A change in a sequence or structure in comparison to a reference entity due to a insertion, deletion or substitution event that does not have any effect over an interaction when compared with the wild-type.
Output format description
We provide a tab-delimited flatfile download with the full dataset. Every instance of a mutation affecting a given interaction is given a unique identifier (“Feature AC”). Please notice that some mutations will affect more than one residue position range in non-consecutive order. Each one of these non-consecutive ranges is recorded in a separate line, hence there will always be a difference between the number of lines in the file and the number of unique mutations. Here is a brief description of the different columns used in the file:
- Feature AC: Accession number for that particular mutation feature.
- Feature short label: Human-readable short label summarizing the amino acid changes and their positions, compliant with the Human Genome Variation Society recommendations (see http://varnomen.hgvs.org/recommendations/protein/) .
- Feature range(s): Position(s) in the protein sequence affected by the mutation.
- Original sequence: Wild type amino acid residue(s) affected, in one letter code.
- Resulting sequence: Replacement sequence (or deletion) in one letter code.
- Feature type: Mutation type, following the PSI-MI controlled vocabularies as stated above.
- Feature annotation: Specific comments about the feature that can be of interest.
- Affected protein AC: Affected protein identifier (preferably UniProtKB accession, if available).
- Affected protein symbol: As given by UniProtKB.
- Affected protein full name: As given by UniProtKB.
- Affected protein organism: TaxID and species name as given by UniProtKB.
- Interaction participants: Identifiers for all participants in the affected interaction, along with their species and molecule type between brackets.
- PubMedID: Reference to the publication where the interaction evidence was reported.
- Figure legend: Reference to the specific figures in the paper where the interaction evidence was reported.
- Interaction AC: Interaction accession within our databases. This can be used to obtain further information about the interaction.
Further columns will be added once the dataset is enriched with cross-references to other databases such as UniProt and Ensembl.
Mutations dataset format example
| Feature AC | Feature short label | Feature range(s) | Original sequence | Resulting sequence | Feature type | Feature annotation | Affected protein AC | Affected protein symbol | Affected protein full name | Affected protein organism | Interaction participants | PubMedID | Figure legend | Interaction AC |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| EBI-1203484 | p.Asp3702_Tyr3703delinsAlaAla | 3702-3703 | DY | AA | mutation decreasing strength(MI:1133) | uniprotkb:P20659 | TRX | Histone-lysine N-methyltransferase trithorax (EC 2.1.1.43) (Lysine N-methyltransferase 2A) | 7227 - Drosophile melanogaster (Fruit fly) | uniprotkb:P20659 (protein, 7227 - Drosophile melanogaster (Fruit fly)) | 10656681 | 5A | EBI-1203452 | |
| EBI-709274 | p.Leu152Glu | 152-152 | L | E | mutation decreasing(MI:0119) | uniprotkb:O43521 | B2L11 | Bcl-2-like protein 11 (Bcl2-L-11) (Bcl2-interacting mediator of cell death) | 9606 - Homo sapiens | uniprotkb:O43521(protein, 9606 - Homo sapiens);uniprotkb:P97287(protein, 10090 - Mus musculus) | 15694340 | EBI-709266 | ||
| EBI-10762087 | p.Pro116_Gly117insXaa | 116-117 | PG | PXG | mutation(MI:0118) | comment:Feature - insertion of crosslinkable amino acid p-benzoyl-L-phenylalanine (pBpa) | uniprotkb:P69411 | RCSF | Outer membrane lipoprotein RcsF | 83333 - Escherichia coli (strain K12) | uniprotkb:P69411 (protein, 83333 - Escherichia coli (strain K12));uniprotkb:P0A940 (protein, 83333 - Escherichia coli (strain K12)) | 25525882 | Fig. 2C Supp Fig.3D | EBI-10761554 |

Loading...
Loading...



