Pipeline overview

Trainer: Wendi Bacon

Overview: This activity will give an overview on scRNA-seq pipelines and help on making careful and critical interpretations of scRNA-seq data.

Learning outcomes

By the end of the session you will be able to:

  • Identify and describe challenges and limitations in scRNA-seq analysis

Activity goals

  • Analyse the data to determine:
    • Number of cells
    • Number of cell clusters (generate a cluster map!)
    • Disease-specific clusters
    • Disease-specific transcript signatures

Activity steps

1. Examine your data

1 Read 
Cell Barcode: 
Sample Index: N701 
Transcript: Pink (i.e. GAPDH)

2. Demultiplex your data

  • Both samples were run on the same sequencing lane with two sample indices on Index Read 1.
    • Sample Index N701 contained cancerous cells
    • Sample Index N702 contained only healthy cells
  • Divide your reads into N701 and N702 (and keep separate!!)
    • Example
Ν7Ο2 
Ν702 
Ν702 
Ν702 
Ν702 
Ν702 
Ν702 
Ν70Ι 
Ν701 
Ν701 
Ν701 
Ν7ΟΙ 
Ν701 
Ν701 
Ν701

3. Generate a ‘cell matrix’

  • A “Cell matrix” or “Digital Expression Matrix” is formed when reads that contain the same cell barcode are stacked so that cell-cell differences can be analysed. Duplicates (i.e. UMIs) are also removed in this step, but there are no duplicates in this magical dataset!
  • Each emoji represents a cell barcode.
  • Organise your ‘reads’ into cells by combining cell barcodes (keep N701 and N702 separate)
    • Example
TOlN 
TOLN 
τοΙΝ 
τοΙΝ 
τοΙΝ 
TOLN 
TOLN 
TOLN 
TOLN 
TOLN 
TOlN 
Τ.0ΙΝ 
zolN 
εοΙΝ 
ΖΟΙΝ 
TOLN 
TOlN

4. Plot your cells!

  • Make a copy of this template
  • Plot the cells into clusters as best you can (similar cells should be next to each other)
    • Example
Machine generated alternative text:

5. Filter the cells

  • Consider, what made the above step so difficult?
  • We will now apply a whole host of analytical tools to try and make defining clusters and signatures easier than what you just experienced.  Go back to your ‘reads‘ tab.
  • Remove any ‘cell barcodes’ (emojis) that appear fewer than 4 times. You may also consider whether to put a cap on the highest number of transcripts constituting a cell (doublets may have more transcripts).
    • Cell barcodes with few reads likely represent background. Setting a cut-off point (i.e. how many genes or transcripts constitute the minimum number to define a cell) is important, but can be tricky, particularly with heterogeneous cell sizes.
    • ΤΟΙΝ 
ΖOΙΝ 
T.OLN 
TOlN 
τοι 
το

6. Filter the genes

  • Remove any ‘genes’ (colours) that appear fewer than 3 times.
    • If a gene appears so few times in a sample, it’s unlikely to be informative – it is also difficult mathematically to compare expression when a gene appears so rarely.
      • Machine generated alternative text:

7. Normalisation

Just kidding! You don’t actually have to do this. In this specific activity, each cell now has about the same number of transcripts. However, in a real sample, this would not be true – imagine trying to compare transcript signatures between cells with drastically different numbers! Anyway, normalisation helps here… More on that later!

8. Find variable genes

  • Some genes don’t vary much between cells – and carrying forward a matrix of size cells x genes can make computation a bit of a nightmare! Standard pipelines only take into account genes that vary significantly.
  • Remove all ‘yellow’ transcripts. According to the super intense algorithm of “I said = so”, these transcripts have been found to not vary.
Machine generated alternative text:

9. Scale Data

  • This step is not always performed, although it can help make it easier to compare different samples with different depths of sequencing. This step scales the variation between genes to make them more easily comparable (otherwise, genes with strong expression differences will dominate the analysis, hiding subtle differences from other genes). With this step, you can also optionally ‘regress’ genes, which is to say, their variation will not contribute to cluster calling.
  • Green genes here have been found to contribute to cell cycling. We are not interested in this and don’t want it to obscure the genes driving cancer progression. Remove the green genes (‘cell cycle regression’).
7이

10. Dimensionality Reduction

  • Normally dimensionality reduction is a huge part of this protocol. Lucky for you, there are only 3 dimensions (i.e. 3 genes) in this data, so you can skip this!

11. Identify cell clusters

  • Group the cells by the ‘transcript signatures’.
  • Example:
    • These cells would be in the same cluster
Z0LN
  • But likely not in the same cluster as this cell:
Machine generated alternative text:

12. Plot your cells, for real this time!

  • Make a copy of this file
  • Plot the cells using the ‘cell clusters’ you identified in Step 11. Similar cells should be plotted close together. Put a circle around each cell cluster.
    • Example
Machine generated alternative text:

13. Interpret the results

  • Answer to these questions:
    • Answer the follow questionsWere there any cells you couldn’t classify?
    • How many total cells did you find?
    • How many cell types (clusters) are in your final map?
    • How did you interpret the results?

14. Check the answer key here.