An automated method for cell type discovery

Artist's interpretation of cell clusters in humans and mice. Image credit: Spencer Phillips

An automated method for cell type discovery

18 May 2020 - 16:22


  • A new method based on machine learning identifies cell types from single-cell RNA sequencing data
  • Characterising existing and new cell types helps researchers understand how cells differentiate as they divide in health and disease
  • The method is now implemented in the Single Cell Expression Atlas and the Human Cell Atlas

18 May, Cambridge – Identifying different types of cells within a tissue or an organ can be very challenging and time-consuming. Methods to identify cell types from single-cell RNA sequencing data have been proposed, but they all fall short in discovering potentially new cell types. Single Cell Clustering Assessment Framework (SCCAF) is a new method that bridges this gap. This automated method, published today in Nature Methods, uses machine learning and can replicate manual, expert annotations that are normally used for this task, and can characterise new cell types.

All somatic cells in a multicellular organism have the same genome, yet they perform a variety of functions. This functional diversity occurs between cells of different types (skin cells and neurons, for instance), but also between states of the same cell lineage as it differentiates.

Historically, researchers have identified cell types or states based on visible features or the expression of a handful of genes. Single-cell RNA sequencing (scRNA-seq) has brought high-throughput gene expression data into the picture.

A cell’s gene expression pattern (which genes are expressed at what level) serves as a proxy for its function and allows scientists to classify or “cluster” that cell with others that have the same function. Until now, annotating cells from scRNA-seq data has required time-consuming human intervention, with automated methods unable to identify cell types or states that had not been previously annotated by human experts.

Machine learning takes over

Zhichao Miao in the Brazma Group at EMBL’s European Bioinformatics Institute (EMBL-EBI) and in the Teichmann Group at the Wellcome Sanger Institute, in collaboration with  the Gene Expression Team at EMBL-EBI, have come up with a method that uses machine learning to address these challenges.

Single Cell Clustering Assessment Framework (SCCAF) starts by using a clustering algorithm to group the cells of a sample into many clusters, based on their gene expression patterns. Each cell cluster is split into a “training set” and a “testing set” for the second stage of the analysis. A classifying model then takes over, using the training set to learn to distinguish cell clusters, and predicting likely clusters in the testing set. The model’s accuracy is assessed by comparing its prediction with the original clusters. “The model repeats the training and testing steps, gradually merging indistinguishable clusters, until its accuracy reaches a good enough level,” explains Miao. Finally, SCCAF lists a set of feature genes to characterise each annotated cluster.

SCCAF: fast and reliable

Miao and colleagues have shown their method to be highly reliable. “We’ve tested the method on many existing large-scale datasets of human and mouse gene expression, treating human annotation as a gold standard. Our method can reproduce human annotation in an automated manner,” says Alvis Brazma, Functional Genomics Senior Team Leader at EMBL-EBI. “By minimising human involvement in data processing, we solve the most important bottleneck in high-throughput projects, such as the Human Cell Atlas.”

Not only does SCCAF reproduce and refine existing cell type classification, it also helps reveal new cell types and states from unannotated samples. The new method will be implemented in large-scale projects, including the Single Cell Expression Atlas and the Human Cell Atlas, to expand our knowledge of cell functional diversity.

The Single Cell Expression Atlas and the Human Cell Atlas

The Single Cell Expression Atlas is an online resource maintained by EMBL-EBI to search, visualise, and analyse single-cell RNA sequencing data across species and experiments. It includes data from the Human Cell Atlas, an online map of all human cell types.

Source article

MIAO, Z. (2020). Putative cell type discovery from single-cell gene expression dataNature Methods. Published online 18 May. DOI: 10.1038/s41592-020-0825-9

Contact the news team

Oana Stroe
Senior Communications Officer
+44 (0)1223 494 369

Subscribe to the email newsletter

Subscribe to our publications.

Sign up Or stay updated with the RSS feed (EMBL-EBI only).