Pipeline overview

Trainer: Wendi Bacon

Overview: This activity will give an overview on scRNA-seq pipelines and help on making careful and critical interpretations of scRNA-seq data.

By the end of the session you will be able to:

Analyse the data to determine:
- Number of cells
- Number of cell clusters (generate a cluster map!)
- Disease-specific clusters
- Disease-specific transcript signatures

2. Demultiplex your data

Both samples were run on the same sequencing lane with two sample indices from Index Read 1.
- Sample index N701 contained cancerous cells
- Sample index N702 contained only healthy cells

4. Generate a ‘cell matrix’

A “Cell matrix” is like a “Digital Expression Matrix,” where reads that contain the same cell barcode are stacked so that cell-cell differences can be analysed
Each emoji represents a cell barcode.
Organise your ‘reads’ into cells by combining cell barcodes (keep N701 and N702 separate)
- Example:

5. Filter the cells

Remove any ‘cell barcodes’ (emojis) that appear fewer than 4 times. You may also consider whether to put a cap on the highest number of transcripts constituting a cell (doublets may have more transcripts).
These likely represent background. Setting a cut-off point (i.e. how many genes or transcripts constitute the minimum number to define a cell) can be tricky.

6. Filter the genes

Remove any ‘genes’ (colours) that appear fewer than 3 times.
If a gene appears so few times in a sample, it’s unlikely to be informative – it is also difficult mathematically to compare expression when a gene appears so rarely.

7. Normalisation

You don’t actually have to do this. In this specific activity, each cell now has the same number of transcripts. However, in a real sample, this would not be true – imagine trying to compare transcript signatures between cells with drastically different numbers! Anyway, normalisation helps here.

8. Find Variable Genes

Some genes don’t vary much between cells – and carrying forward a matrix of size cells x genes can make computation a bit of a nightmare! Standard pipelines only take into account genes that vary significantly.
Remove all ‘yellow’ transcripts – according to the super intense algorithm of “I said = so”, these transcripts have been found to not vary.

9. Scale Data

This step is not always performed, although it can help make it easier to compare different samples with different depths of sequencing. This step scales the variation between genes to make them more easily comparable (otherwise, genes with strong expression differences will dominate the analysis, hiding subtle differences from other genes). With this step, you can also optionally ‘regress’ genes, which is to say, their variation will not contribute to cluster calling.
Green genes here have been found to contribute to cell cycling. We are not interested in this and don’t want it to obscure the genes driving cancer progression. Remove the green genes (‘cell cycle regression’).

10. Dimensionality Reduction

Normally dimensionality reduction is a huge part of this protocol. There are only 3 dimensions (i.e. 3 genes) in this data, so you can skip this!

11. Identify cell clusters

Group the cells by the ‘transcript signatures’.
- Exemple:
  These cells would be in the same cluster

But likely not in the same cluster as this cell:

12. Plot your cells

Select your Cluster Plot here
Plot the cells using the ‘cell clusters’ you identified in Step 5. Similar cells should be pletted close together. Put a circle around each cell cluster.
- Example:

13. Interpret the results

Answer the follow questions
- Were there any cells you couldn’t classify?
- How many total cells did you find?
- How many cell types (clusters) are in your final map?
- How did you interpret the results?

14. Check the answer key here.

Single-cell RNA-seq analysis using R