Finding genes using a sequence

Finding a gene

Background

If you do not have a gene name or an accession number for your sequence of interest, you can use the sequence search facility of the ENA Browser to identify the closest matches in ENA. Sequence searching is also useful for finding potential homologues that are related to your gene of interest.

Scenario

Imagine that you have isolated a gene associated with an autoimmune disease in humans. You know part of the DNA sequence and that mutations in the sequence are associated with a T cell-mediated autoimmune disease, but you do not know the gene it belongs to. You can use the ENA Browser's sequence search to see the closest matches in the ENA database.

Notes

The sequence you identified is:

TGGAAAGATAATTAAAATAAGACATGGGAAATAGGGA
AGCTGATAACGTGGGGGAGAGGTTTTGCTTGTGTTTC
ACCAAGAGAAAATCAGCTTCCTGTTTGGATACCCACT
AAACATTTGAAGTTCTACAATGAACCCATCAGAGATG
CAAAGAAAAGCGCCTCCACGGAG

 To search ENA, copy/paste your sequence into the 'Sequence Search' box (red arrow) and click search

Figure 58. To search ENA, copy/paste your sequence into the 'Sequence Search' box (red arrow) and click search.

Steps

1. Open the ENA Browser in a new window.

2. Copy/paste your sequence into the Sequence Search box.

3. Click ‘Search’ to obtain search results.

 

Results - closest sequence matches

Our search will provide matches in both ENA and Ensembl, so you can look at the sequences and their annotation in ENA, then look at the Ensembl results to see the alignment of your sequence to the genome(s).

Note: ENA Browser's sequence search does not search raw sequences in either SRA or the Trace Archive, because the very short, redundant nature of raw sequences make sequence searching very difficult.

Close-up of the sequence search results page displaying the closest matching sequences in EMBL-Bank. The top hit is AF333072 (red arrow), which shows 100% identity (i.e. all the nucleotides in the query sequences align with the target sequence).

Figure 59. Close-up of the sequence search results page displaying the closest matching sequences in EMBL-Bank. The top hit is AF333072 (red arrow), which shows 100% identity (i.e. all the nucleotides in the query sequences align with the target sequence). 

Steps

1. Scroll down to the EMBL-Bank results.

2. Click on the entry AF333072.

 

Results - analysing the results

Entry AF333072 is described as Homo sapiens HERV-K18, but HERV-K18 is a virus so what's going on?

Close-up of results page of EMBL-Bank entry ABF333072 showing a graphical overview of the annotation available for this sequence and the source of the sequence

Figure 60. Close-up of results page of EMBL-Bank entry ABF333072 showing a graphical overview of the annotation available for this sequence and the source of the sequence.

Notes

[A] The graphical Overview section summarises the features associated with this sequence: gag, pol and env genes.

[B] Source of the sequence is Homo sapiens.

 

Although it is a human sequence, the gag, pol and env genes are usually associated with viruses, so why is this sequence not classified as being viral?

Genomes often contain 'foreign' DNA, such as endogenous viruses (ERVs) and transposable elements. ERVs are thought to have arisen from ancient viral infections, but through the course of evolution have remained permanently integrated within their host genome to be passed down to subsequent generations. When you sequence a genome and find these 'foreign elements', how do you classify them? Are they part of the host genome or are they separate entities?

Information

How a sequence is classified depends on the origin of the sequence.

If the virus was isolated and sequenced, then it would be classified as a viral sequence (VRL taxonomic division).

However, if the viral sequence was obtained from sequencing another organism, then it would be classified by the host organism.

 

In entry AF333072, the HERV-K18 endogenous virus was inserted into intron 1 of the human CD48 gene (see Note encircled in red in Figure 59), and the sequence was obtained from sequencing the human genome, therefore it is classified as human (HUM taxonomic division). On average the human genome contains 25-50 copies of endogenous HERV type K retroviruses.

Help

When searching for endogenous viral sequences, be careful not to restrict your search to just the VRL taxonomic class. Some may be under the VRL division, but others might be under their host taxonomy.