Curation Criteria and Data Submission Guide

This page describes the criteria for publications that are curated into the database. If any of the information in this document is unclear, or if you have questions about why a particular paper was not included in the database, please email gtc@ebi.ac.uk.

Paper curation

The flow chart below is used to assess whether a paper contains sufficient information to be included in the database.

accept_paper_flowchart.png

If the paper fulfils the criteria listed above then it will potentially be suitable for curation. Supplementary files are also taken into consideration.

Additional requirements for curation are given below, these criteria aim to standardise incoming papers and to aid in programmatic accession:

  • The species in the paper must have a sequenced genome in Ensembl or Ensembl Genomes.
  • The gene targets must be mapped to the sequenced genome.
  • Cell lines must be described in one of the ontologies available in the Ontology Lookup Service. This includes the potential for compound terms for primary cell lines..

If a paper is found to fulfil these criteria then the required information can be entered into an excel file to be programmatically entered into the database.

If at any point it becomes clear that a paper does not meet the criteria listed here, or lacks key information required to make the data comparable and repeatable, it will be deemed low priority. Low priority papers are not entered into the database, independent of the level of curation already carried out. If further information becomes available, for example through author submission, the paper can be included in the database.

Data Submission Guide

Data can be submitted for the database in the form of an Excel spreadsheet with metadata related to the publication and the experiments contained within it.

The first step is to download the curation form and change the file name to include the first author of the paper, the PubMed ID of the paper and the year it was published in. These should be separated by an underscore for ease of programmatic analysis. The PubMed ID (PMID) can be found by searching for the paper in the Europe PMC database.

For example:

Jinek, M., Chylinski, K., Fonfara, I., Hauer, M., Doudna, J. A., & Charpentier, E. (2012). A Programmable Dual-RNA–Guided DNA Endonuclease in Adaptive Bacterial Immunity. Science, 337, 816–822.

Would be given the file name:

Jinek_22745249_2012.xlsx

Please do not include the names of multiple first authors or use any accented letters or apostrophes. Please write compound surnames in Pascal case, for example van der Waals would be written as VanDerWaals, and Barré-Sinoussi would be written as BarreSinoussi

The aim of the curation process is to complete as many of the following fields as possible using the information found in the main text and supplementary information of the publication. Fields in which a response is required are indicated with an asterisk and also highlighted in red in the spreadsheet.

Where the letter D is shown in the second column this indicates that there is a drop-box containing options. Drop-down lists have been populated using data from papers that have been curated; this requires the regular addition of new options. Where an answer is required that has not been provided by the drop-down box please email us at gtc@ebi.ac.uk and the option will be added.

1. Publication

This section of the form records information about the paper, including the PubMed ID which makes it easy to retrieve the paper and capture any additional information required.

First author The full name of the first author as written on the published paper. Only one first author is allowed.
Title Full title of the paper as published.
PubMed ID PMID of the paper.
CRISPR related method papers cited (PMID) This field is for use when the methods section of the paper references other papers for additional details of genome targeting experiments. Multiple PMIDs should be separated by a comma.
Other technique(s) D Recording any additional techniques used to target the same genomic region or protein. This could include morpholinos, siRNA or another genome targeting technique such as Zinc Finger Nucleases.
Data type D

In the event that data has been saved in a repository, this field and the two following should be used to record the location and type of data.

This field describes the type of data that has been placed in a database, for example sequencing data of guide strand RNA sequences. Give all data types for each of the accession numbers.

Archive D The name of the database into which the data was entered. Give a database for each of the accession numbers.
Accession Number D In separate cells give the accession numbers of the data. Each data type should have an accession number, numbers can be repeated if necessary. This field remains blank if the raw data has been saved on the journal website.
Are off-target effects studied? D The extent to which off-target effects are studied can include sequencing experiments around predicted off-target site or simply discussing the steps taken to prevent off-target binding. If there is no discussion or experimentation regarding off-target effects then this should also be indicated.
Genome-wide/arrayed study included? Whether there is a genome-wide CRISPR study in the publication can be indicated in this field. This section provides programmatic direction that there will be relevant information in Section 4.
Targeted study included? D Whether there is a CRISPR study with selected targets in the publication can be indicated in this field. This section provides programmatic direction that there will be relevant information in Section 3.

2. Study information

This section allows the alignment of information about different studies carried out within the same paper. It is useful to think about a study in terms of an imaginary test tube; if you find that you would not add a reagent or cell line to the same test tube then it represents a separate study. The different species, cell types and experimental variables can all be recorded together in this section.

Internal key

The internal key indicates a number given to each study as an internal reference in order to connect data describing the same studies between different sections of the accession spreadsheet. Each study described in the paper using a different set of variables should be described in this section and assigned a number sequentially. There is no weight attached to the numbering; it can be carried out in any order that make sense to the person filling in the spreadsheet.

The internal reference should be set in Section 2. Study Information and then applied to the further description of the studies in Section 3. and Section 4.

Assay type D The desired method of action of the CRISPR associated protein and the experimental output. If the study aims to find which guide strand RNA sequences are present or absent in the population following enzyme mediated NHEJ, then the study would be “Knockout by NHEJ – proliferation.” Tiling of fluorescently labelled protein can be used for imaging, or tags can be attached to the CRISPR associated protein to aid transcriptional repression or activation.
HDR Donor D Where a study has been designed to carry out homology directed repair (HDR) of the strand breaks caused by the nuclease, indicate the form of the template DNA added for repair, for example a single-stranded donor oligonucleotide (ssODN).
Enzyme variant D The enzyme used (e.g. Cas9, Cpf1) and any additional information given. For example, the species from which the sequence was taken, whether the sequence has been humanised (i.e. codon optimised for expression in human cells), whether there are any additional protein domains fused to the CRISPR associated protein (e.g. KRAB for transcriptional repression, fluorescent compounds for FACS analysis or for imaging).
Inducible? D Indicate whether the study has expressed the CRISPR-related protein or sgRNA in an inducible manner.
Multiplexed or multitarget delivery? D

Indicate whether the study has aimed to introduce more than one single guide RNA to the cells or tissue, targeting the same gene or genomic region (multiplex). It can also be used to indicate where the guide strand or strands were designed to target multiple genes concurrently (multitarget).

While it is possible that multiple guide strands may enter cells in any pooled study, this field is used to indicate when the delivery of multiple guide strands was part of the experimental aim.

Organism ID A list of species is provided, giving a reference number which links to a sequenced genome in Ensembl or Ensembl Genomes. The reference number, e.g. 170 for Homo sapiens, should be recorded in this field.
Species D The scientific name of the species used in the study or studies.
Strain The strain of the species can be recorded here. Strains not found in Ensembl or the Ontology Lookup Service (OLS) can be recorded here.
Biosample IDs The IDs of the cell types and strains used in the study should be recorded in this field, for example the EFO ID for HeLa cells is EFO_0001185. IDs should be taken from the Ontology Lookup Service (OLS) and separated by commas when multiple entries are necessary.
Biosample modifications This field is used to record any additions or changes made to cell lines before their use in a targeting experiment. This could include the introduction of a plasmid containing a fluorescent reporter, or an antibiotic selection marker.
Total guide strand RNAs The total number of guide strands used in conjunction with the cell line/ strain of organism.
Genes targeted The number of genes for which guide strand RNA sequences were designed.
Positive controls The number of guide strand RNA sequences that were designed to act as positive controls, i.e. the targets of the sequences had a known phenotype and this could be measured to assess the activity of the experimental output.
Negative controls The number of guide strand RNA sequences that were designed to act as negative controls, i.e. the sequences had no known targets in the organism and their output could be used to define the background to the experimental output.
Biological replicates The number of biological replicates carried out in the study, for example non-sister clones or different animals.
Technical replicates The number of times the study was carried out using the same material, for example the same clone or tissue. Where no information is given to the contrary, it will be assumed that replicates are technical replicates.
Enzyme type D The form in which the enzyme is delivered to the cells or organisms, for example as an insert in a plasmid or as a ribonucleoprotein (RNP) complex.
Enzyme delivery D The method by which the protein is delivered to the cells or organisms in the study.
sgRNA type The form in which the sgRNA is delivered to the cells or organisms, for example as an insert in a plasmid or as a ribonucleoprotein (RNP) complex.
sgRNA delivery D The method by which the guide strand RNA sequences are delivered to the cells or organisms in the study.
Drug/condition treated D The growth conditions or drugs used as a selection pressure for the cells to identify the phenotype of the targeted cells, for example low nutrient media or high temperature.
sgRNA design tool D The tool used to design and optimise the guide strand RNA sequences. When in-house design was carried out or the library was purchased this should also be indicated.

3. Guide strand sequences (Targeted)

Where a small number of genes have been targeted it should be possible to record the gene targets, genomic coordinates and guide strand RNA sequences, as well as the efficiencies of the strands and the metric used to assess the efficiency. If a genome-wide screen has been used in the study, this sheet can either be left blank or can be used to record the details of preliminary or follow-up studies carried out on a smaller number of targets.

Internal key

The internal reference number assigned to the study in Section 2. should be repeated here.

Ensembl ID

The Ensembl symbol for the gene or genomic region to which the guide strand RNA directs the CRISPR associated protein. Gene targets must be verified using Ensembl or Ensembl Genomes.

Reported gene name

If the gene is given a different name in the paper this can be recorded here, otherwise this field should be left blank. Only gene synonyms from Ensembl can be accepted.

If a reported name isn’t in Ensembl but can be found via RefSeq, the Refseq reference should be given as well as the reported gene name, separated by a comma.

Transgene? D

This field indicates whether the gene targeted is endogenous or whether it has been introduced to the biological system.

Gene Symbol

When a gene is a transgene, this is the gene symbol.

Gene Description

When a gene is a transgene, this is the description.

Synonyms

When a gene is a transgene, these are the synonyms.

Genomic coordinates

The genomic coordinates of the guide strand RNA sequence. If this is not provided by the authors, then this field is left blank.

Genome assembly D

The genome assembly that the guide strand RNA sequence has been aligned to.

Guide strand RNA sequence

The 20 base targeting sequence is sufficient for this field. It is also possible to put the full sgRNA or crRNA and tracrRNA sequences if these are given.

This field should always be completed, except when a commercial sgRNA has been used. In the event that a commercially sourced sgRNA has been used and no sequence is given, then this field might be left blank provided the fields sgRNA Source and Source ID are completed.

sgRNA label

If identifiers are given to the guide strand RNA sequences in the paper, then these are recorded in this field.

Guide strand efficiency

Any measure of how efficient the guide strand RNA sequence was.

Metric D

The method by which the efficiency of the guide strand RNA sequence was measured.

sgRNA source D

Where a commercially sourced sgRNA has been used, this field should be used to indicate the sources from which it was purchased.

This field and the Source ID field are only required when a commercial sgRNA has been used and no sequence is given.

Source ID

An identifier, for example the catalogue number, of the commercial sgRNA used in this study. The sgRNA sequence of commercial sgRNAs is not usually given, so as much effort as possible should be taken to ensure this is a stable identifier.

4. Guide strand sequences (Genome-wide or arrayed)

Where a genome-wide screen has been used, either in pooled format or in an array, the information about the guide strand library and some broad experimental details can be recorded. If only a small number of genes were targeted, then this sheet can be left blank.

Internal key

The internal reference number assigned to the study in Section 2. should be repeated here.

sgRNA library name D

The commercial or lab-generated library used in the screen study.

Sublibrary D

The commercial or lab-generated sublibrary used in the screen study.

sgRNA/ gene

The number of guide strands targeted to each gene in the library or sublibrary used in the screen study.

Library source D

The company from which the library was purchased, or the lab from which it was received.

Genome assembly

The genome assembly to which the sequences were mapped.

Analysis tool D

The computational method used for analysing the data from the screen study.

Analysis tool version

The version of the computational analysis tool used.

Efficiency metric D

The statistical test used by the authors of the study for assessing the outcome of the study.

5. Experimental detail (Genome-wide)

Experimental detail is not a searchable section of the database and is only provided for papers describing genome-wide screens. This data can be provided by researchers who are curating their own publication and wish to share more of the experimental information in a standardised format.

5.1 General Information
Plasmids used

The names of the plasmids that were used for screening, e.g. those used to deliver the CRISPR-related protein sequence.

Production of plasmid library

The protocol used to produce the plasmid library.

5.2 Production of viral libraries
Packaging cell line

The name of the cell line that was used to support the packaging of the RNA strands into the virus.

Packaging plasmids and viral protein

The names of the packaging plasmids and the viral protein used to package the RNA.

Growth Conditions/Cell culture conditions to generate virus library

Copy an overview of the protocols used for cell line culturing.

Percentage of cell confluence at day of transfection

Give the percentage cell confluency when the cells were transduced.

Transfection protocol

An overview of the transfection protocol.

Obtaining viral supernatant protocol

Information on how the viral supernatant was obtained and processed.

5.3 Production of stable library cell line and Cas9 information
Growth Conditions/Cell culture conditions

The name of the cell line that was used to support the packaging of the RNA strands into the virus.

Transduction protocol

Provide the transduction protocol, including the multiplicity of infection (MOI) used for infection if it is given.

Antibiotic selection

Provide the name of the selection marker and the concentration used, if given.

Time period for selection

The length(s) of time between the introduction of CRISPR protein and/or guide strand RNA, allowing the production of the active complex, and assessment of the phenotype of the cells.

Cultivation protocol during selection

The protocol used for cell culture during the selection process.

5.4 Library and Sequencing Preparation
Number of cells harvested per sample

The number of approximate number of cells harvested for gDNA extraction per sample.

PCR Protocol

A detailed overview of the PCR protocol including enzyme names, cycling times etc.

Amount of input DNA per PCR

The amount of gDNA used as input for each PCR reaction.

Primer(s) for PCR

The name(s) and sequence(s) of the primer(s) used for sequencing.

Sequencing

A description of the sequencing parameters.