A-AFFY-189 - Affymetrix Porcine Snowball Array

Sus scrofa
Porcine expressed sequences (cDNA) were collated from public data repositories (ENSEMBL, RefSeq, Unigene and the Iowa State University ANEXdb database) to create a non-overlapping set of reference sequences. A series of sequential BLASTN analyses, using the NCBI blastall executable, were performed with the -m8 option. The initial subject database comprised 2,012 sequences of manually annotated S. scrofa gene models from Havana provided by Jane Loveland (The Sanger Institute) on 29 July 2010, plus 21,021 sequences acquired using Ensembl BioMart Sscrofa (build 9, version 59 on 22 July 2010). For each iteration, query sequences that did not have an alignment with a bitscore in excess of 50 were added to the subject database prior to the next iteration. The iterations involved the following query datasets: 1. 35,171 pig mRNA sequences from NCBI, downloaded on 15 July 2010; 6,286 added to subject database 2. 7,882 pig RefSeq sequences from NCBI, downloaded on 15 July 2010: 0 added to subject database 3. 43,179 pig Unigene sequences from NCBI, downloaded on 15 July 2010 (filtered to include only those longer than 500 bases): 10,125 added to subject database 4. 121,991 contig sequences, downloaded from Iowa Porcine Assembly v1 (
) on 30 July 2010 (filtered to include only those longer than 500 bases): 10,536 added to subject database 5. 2,370 miRNA sequences (pig, cow, human, mouse), downloaded from miRbase, 30 July 2010 (Release 15, April 2010, 14197 entries): all added without blastn analysis. The final subject database comprised 52,355 expressed sequences. To facilitate the design of array probes that were uniformly distributed along the entire length of transcripts, transcripts were split into several probe selection regions (PSRs), each of which was then the target for probe selection. The size of each PSR, typically ~150 nucleotides, was determined by the length of the input sequence, with the ultimate aim being to obtain 20-25 probes per transcript. Oligonucleotide design against the ~343,000 PSRs was performed by Affymetrix (High Wycombe, UK). In addition, standard Affymetrix controls for hybridisation, labelling efficiency and non-specific binding were included on the array (a total of 123 probesets) together with complete tiling probesets for 35 porcine-related virus genome sequences (both strands, centre-to-centre gap of 17 nucleotides) for possible future infection-based studies. The final array is comprised of 1,091,987 probes (47,845 probesets) with a mean coverage of 22 probes/transcript. Initial annotation of the gene models was obtained from the sequence sources and converted into an annotation set using the AnnotateDbi Bioconductor package. However, following this exercise many probesets were without useful annotation. Therefore the original sequences from which the probes had been designed were blasted against NCBI Refseq in order to impute the most likely orthologous gene of the 'unannotated' pig transcripts. In order to have one gene per query sequence the following annotation pipeline was followed: 1. For each query the hit with lowest e-value within each species was chosen. 2. Genes with e-value hits <1e-9 against H. sapiens were annotated with HGNC names/descriptions, however genes with matches starting with 'LOC' were not used. 3. Step 2 was repeated using in order: Sus scrofa, Bos taurus, Pan troglodytes, Mus musculus, Canis lupus familiaris, Pongo abelii, Equus caballus, Rattus norvegicus, Macaca mulatta. 4. Step 3 was repeated using any other species (in no particular order) to which a hit could be obtained. 5. For the remaining probes LOC gene annotations were used from (in order of priority): Homo sapiens, Sus scrofa, Bos taurus, Pan troglodytes, Mus musculus 6. Everything else was used, in no particular order. Out of 47,845 sequences represented on the array, 27,322 probesets have annotations that correspond to a current (15th Dec. 2011) HGNC symbol for human protein coding gene, 14,426 of which are unique (out of a total 19,219 listed by HGNC). The remaining probesets were annotated with the information available for those sequences.
(Tom Freeman)tom.freeman@roslin.ed.ac.uk
