Figure 6. Sequences in public databases are aligned to the genome in order to determine positions of genes, along with splice variants.
The initial step is to obtain sequenced genomes from official centres. The sequenced genomes are then annotated in the Ensembl pipeline (also known as the Ensembl genebuild) using both automatic annotation, and manual curation for some species. Human, mouse, and zebrafish gene sets include manual annotation from the HAVANA project. The Ensembl gene set for human, including Havana transcripts, is the GENCODE set.
All Ensembl transcripts are based on experimental evidence, and draw on mRNAs and protein sequences deposited into public databases (such as UniProtKB and NCBI RefSeq) from the scientific community. The Ensembl gene set also includes automatically-annotated pseudogenes, non-coding RNAs, and alternative splicing events for model organisms. The resulting analyses of the genomes are stored in the Ensembl databases and can be accessed via the Ensembl website, BioMart and programmatically.
For more information you can read about the mouse genebuild.