How are RNAcentral identifiers assigned?

Knowing how the identifiers are assigned can help you to take advantage of several useful features of the RNAcentral API and the FTP archive.

Assigning RNAcentral identifiers

1. First, the sequence is normalised:

Each sequence is uppercased and all U's are replaced with T's, thus the RNA sequence is converted to its DNA form. The DNA form is used for consistency with the sequence archives, such as ENA.
 

2. MD5 hash of the normalised sequence is computed:

An MD5 hash is a string of 32 characters that can be uniquely generated based on a full sequence. Instead of comparing sequences directly, RNAcentral compares their MD5 hashes because it can be much faster, especially for longer sequences.
 

3. Finally, the MD5 hash is checked against the database to see if it is already present:

If the hash is not present in the database, it means that the sequence is new, and then a new URS identifier is created and permanently stored along with the sequence and the MD5 hash.

If the hash is already in the database, then the sequence has been seen before and there is no need to generate a new identifier.

Now you can take any RNA sequence, get its MD5 hash and check if it is found in RNAcentral using the RNAcentral API or the FTP archive

info icon All sequences in RNAcentral are at least 10 nucleotides long because shorter sequences are not likely to represent biologically relevant ncRNAs.

Distinct sequences get distinct identifiers

Every distinct ncRNA sequence get its own identifier, so if two sequences are even slightly different, they still get distinct RNAcentral identifiers. This is similar to how UniParc assigns unique identifiers to protein sequences.

For example, sequence URS0000759BE2 (2,547 nucleotides long) is the same as URS0000621DCB (2,546 nucleotides) except that it has one more nucleotide on its 3' end. Although these sequences are almost identical and come from the same genomic location in the same species, they still get distinct identifiers to recognise the fact that the sequences are distinct.