I joined EMBL-EBI as a software engineer and bioinformatics scientist in 2007 and have been developing and maintaining a set of professional bioinformatics software and databases for biological data analysis. My major on-going projects include 1) PSI-Search Sequence Alignment Algorithm, 2) Non-Redundant Patent Sequence Databases, 3) JDispatcher Bioinformatics Framework and Workflows and 4) EMBL-EBI SOAP/REST Web Services. I am also involved in relevant software services/support and collaborated projects, e.g. Carp Transcriptome, Clustal Omega, InterProScan5, ENSEMBL Genomes.
Before working at EMBL-EBI, I had finished my PhD study in bioinformatics and functional genomics at the University of Liverpool. I was involved in study of gene expression and sequence annotation, and as a result, I had developed a suite of bioinformatics software/algorithms, e.g. ExprAlign, EST-ferret, carpBASE, GoMatrix, GoProfiler.
PSI-Search, similar to PSI-BLAST, combines the Smith-Waterman search algorithm with the PSI-BLAST profile construction strategy to find distantly related protein sequences. Searches are done with SSEARCH, and the selected hits are combined with BLASTPGP to build a position specific scoring matrix (PSSM), which is then used for another search with SSEARCH in the next iteration.
The non-redundant patent sequence databases are a collection of non-redundant patent sequence databases, which cover the EMBL-Bank nucleotides patent class and the patent protein databases and contain value-added annotations from patent documents. The databases were created at two levels by the use of sequence MD5 checksums. Sequences within a level-1 cluster are 100% identical over their whole length. Level-2 clusters were defined by sub-grouping level-1 clusters based on patent family information. Value-added annotations, such as publication number corrections, earliest publication dates and feature collations, significantly enhance the quality of the data, allowing for better tracking and cross-referencing.
The EMBL-EBI analysis tool application framework, including sequence similarity search services (SSS, e.g. BLAST, FASTA), multiple sequence alignment (MSA, e.g. Clustal Omega, T-Coffee, MUSCLE), protein function analysis (PFA, e.g. InterProScan), sequence pair-wise alignment (PSA), sequence opereations (SO, e.g SeqCksum). over 2000 up-to-date genomes, proteomes and other sequence databases are available throught the SSS. These services are available over the web and via Web Services interfaces for users who require systematic access or want to interface with customized pipe-lines and workflows using common programming languages. The framework features novel result visualizations and integration of domain and functional predictions for protein database searches. The major analysis categories under the current JDispatcher include:
- Sequence Similarity/Homolgy Search: e.g. FASTA & NCBI BLAST;
- Protein Functional Analysis: e.g. InterProScan;
- Multiple Sequence Alignment: e.g. Clustal Omega.
The EBI provides programmatic access to various data resources and analysis tools via SOAP/REST-based Web Services technologies. Web Services is an integration and inter-operation technology, to ensure client and server software from various sources will work well together.
Other collaborated projects
Gene Expression Alignment (ExprAlign)
Sequence identification and annotation, and gene expression data analysis, particular in expression alignment (ExprAlign) for non-model species.
Oligo array construction using ESTs
I built a bioinformatics pipeline to optimise the rainbow trout ESTs collection for oligoarray contruction.
Non-model species sequence databases
I developed serveral sequence databases with annotations for non-model species. These included carpBase, squirrelBase, roachBase, troutBase, etc. The following paper describes the carpBase and carpArray.
Bioinformatics tools developed
I have developed a set of tools for the projects above. These tools can also be downloaded from the LEGR web site.
EST-ferret: a user-configurable, automated package for convenient analysis of ESTs data. It includes necessary steps for ESTs cleaning, submission to dbEST, clustering, identification and annotation.
GOprofiler: associating ESTs with Gene Ontology annotations.
GOmatrix: associating gene groups in gene expressions with Gene Ontology categories. It is able to process thousand times of Fisher's Exact tests to check significance and provides a GUI for data input.
CORR for ExprAlign: ExPrAlign is an approach for clustering gene expression data using Pearson's correlation coefficents. CORR is a C programme to compute million times of Pearson's correlation coefficients in minutes. Its performment is much better than similar programmes in MatLab.
FindOrthologs: a PERL programme to find orthology relationships across three species.
Balsamic – protein 3D structure visualisation (at the University of Leeds)
Prof William R. Pearson (Virginia University)
Clustal Omega Team (University College Dublin)
Liverpool LEGR (University of Liverpool)
Andrew Gracey lab (University of Southern California)
Jones P, Binns D, Chang HY, Fraser M, Li W, McAnulla C, McWilliam H, Maslen J, Mitchell A, Nuka G, Pesseat S, Quinn AF, Sangrador-Vegas A, Scheremetjew M, Yong SY, Lopez R, Hunter S. InterProScan 5: genome-scale protein function classification. Bioinformatics. [Epub ahead of print] [ PubMed: 24451626 ]
Pakseresht N, Alako B, Amid C, Cerdeñáaga A, Cleland I, Gibson R, Goodgame N, Gur T, Jang M, Kay S, Leinonen R, Li W, Liu X, Lopez R, McWilliam H, Oisel A, Pallreddy S, Plaister S, Radhakrishnan R, Riviè S, Rossello M, Senf A, Silvester N, Smirnov D, Squizzato S, Hoopen PT, Toribio AL, Vaughan D, Zalunin V, Cochrane G. Assembly information services in the European Nucleotide Archive . Nucleic Acids Res.(2013) 42:D38-43. [ PubMed: 24214989 ]
McWilliam, H., Li, W., Uludagi, M., Squizzato, S., Park, Y.M., Buso, N., Cowley, A.P., Lopez, R.(2013) Analysis tool web services from the EMBL-EBI. Nucleic Acids Research. [ PubMed: 23671338 ]
Li, W., Kondratowicz, B., McWilliam, H., Nauche, S. and Lopez, R. (2013) The Annotation-enriched Non-redundant Patent Sequence Databases. Database, 2013:bat005 [ PubMed: 23396323 | Abstract | Full-text PDF ]
Li, W., McWilliam, H., Goujon, M., Cowley, A., Lopez, R., and Pearson, W.R. (2012) PSI-Search: iterative HOE-reduced profile SSEARCH searching.Bioinformatics 28:1650-1651. [ PubMed: 22539666 | Abstract | Full-text PDF | DOI: 10.1093/bioinformatics/bts240 | ]
Sievers, F., Wilm, A., Dineen, D., Gibson, T.J., Karplus, K., Li, W., Lopez, R., McWilliam, H., Remmert, M., S?ding, J., Thompson, J.D., Higgins, D.G. (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 7:539 [ PubMed: 21988835 | Abstract | Full-text PDF ]
Goujon, M., McWilliam, H., Li, W., Valentin, F., Squizzato, S., Paern, J., Lopez, R. (2010) A new bioinformatics analysis tools framework at EMBL-EBI. Nucleic Acids Research. 38:W695-W699. [ PubMed: 20439314 | Abstract | Full-text PDF ]
Li W., McWilliam H., Richart de la Torre A., Grodowski A., Benediktovich I., Goujon M., Nauche, S. and Lopez, R. (2010) Non-redundant patent sequence databases with value-added annotations at two levels. Nucleic Acids Research. 38(Database issue):D52-D56. [ PubMed: 19884134 | Abstract | Full-text PDF ]
Li, W., Gracey, A.Y., Vieira Mello, L., Brass, A. and Cossins, A.R. (2009) ExprAlign - the identification of ESTs in non-model species by alignment of cDNA microarray expression profiles. BMC Genomics, 10:560 [ PubMed: 19939286 | Abstract | Full-text PDF ]
McWilliam, H., Valentin, F., Goujon, M., Li, W., Narayanasamy, M., Martin, J., Miyar, T. and Lopez, R. (2009) Web services at the European Bioinformatics Institute-2009. Nucleic Acids Research. 37(Web Server issue):W6-W10. [ PubMed: 19435877 | Abstract | Full-text PDF ]
Williams, D. R., Li, W., Hughes, M. A., Gonzalez, S. F., Vernon, C., Vidal, M. C., Jeney, Z., Jeney, G., Dixon, P., McAndrew, B., Bartfai, R., Oban, L., Trudeau, V., Rogers, J., Matthews, L., Fraser, E. J., Gracey, A. Y., Cossins, A. R. (2008). Genomic resources and microarrays for the common carp Cyprinus carpio L. Journal of Fish Biology. 72:2095-2117. [ Abstract | Full-text PDF ]
Olohan, L. A., Li, W., Wulff, T., Jarmer, H., Gracey, A. Y., Cossins, A. R. (2008). Detection of anoxia-responsive genes in cultured cells of the rainbow trout Oncorhynchus mykiss (Walbaum), using an optimized, genome-wide oligoarray. Journal of Fish Biology. 72:2170-2186. [ Abstract | Full-text PDF ]
Gonzalez, S. F., Chatziandreou, N., Nielsen, M.E., Li, W., Rogers, J., Taylor, R., Santos, Y., Cossins, A. (2007) Cutaneous immune responses in the common carp detected using transcript analysis. Mol Immunol. 44:1664-79. [ PubMed: 17049603 | Abstract | Full-text PDF ]
Williams, D., Epperson, L., Li, W., Hughes, M., Taylor, R. R., Rogers, J., Martin, S., Cossins, A. R., Gracey, A. Y. (2005) Seasonally hibernating phenotype assessed through transcript screening. Physiol Genomics. 24:13-22. [ PubMed: 16249311 | Abstract | Full-text PDF ]
Gracey, A. Y., Fraser, E. J., Li, W., Fang, Y., Taylor, R. R., Rogers, J., Brass, A. (2004). Coping with cold: An integrative, multitissue analysis of the transcriptome of a poikilothermic vertebrate. Proc. Natl. Acad. Sci. USA. 101: 16970-16975. [ PubMed: 15550548 | Abstract | Full-text PDF ]
MRes Bioinformatics, University of Leeds, UK, 2001 (Supervisor – Prof. David Westhead)
BSc Biochemistry, Sun Yat-sen University, China, 1996-2000
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus
Cambridge CB10 1SD
Tel: + 44 (0) 1223 492 517