Pipeline overview

Trainer: Wendi Bacon

Overview: This activity will give an overview on scRNA-seq pipelines and help on making careful and critical interpretations of scRNA-seq data.

Learning outcomes

By the end of the session you will be able to:

Identify and describe challenges and limitations in scRNA-seq analysis

Activity goals

Analyse the data to determine:
- Number of cells
- Number of cell clusters (generate a cluster map!)
- Disease-specific clusters
- Disease-specific transcript signatures

Activity steps

1. Examine your data

Make a copy of this activity template
- Key

1 Read
Cell Barcode:
Sample Index: N701
Transcript: Pink (i.e. GAPDH)

2. Demultiplex your data

Both samples were run on the same sequencing lane with two sample indices on Index Read 1.
- Sample Index N701 contained cancerous cells
- Sample Index N702 contained only healthy cells
Divide your reads into N701 and N702 (and keep separate!!)
- Example

Ν7Ο2
Ν702
Ν702
Ν702
Ν702
Ν702
Ν702
Ν70Ι
Ν701
Ν701
Ν701
Ν7ΟΙ
Ν701
Ν701
Ν701

3. Generate a ‘cell matrix’

A “Cell matrix” or “Digital Expression Matrix” is formed when reads that contain the same cell barcode are stacked so that cell-cell differences can be analysed. Duplicates (i.e. UMIs) are also removed in this step, but there are no duplicates in this magical dataset!
Each emoji represents a cell barcode.
Organise your ‘reads’ into cells by combining cell barcodes (keep N701 and N702 separate)
- Example

TOlN
TOLN
τοΙΝ
τοΙΝ
τοΙΝ
TOLN
TOLN
TOLN
TOLN
TOLN
TOlN
Τ.0ΙΝ
zolN
εοΙΝ
ΖΟΙΝ
TOLN
TOlN

4. Plot your cells!

Make a copy of this template
Plot the cells into clusters as best you can (similar cells should be next to each other)
- Example

5. Filter the cells

Consider, what made the above step so difficult?
We will now apply a whole host of analytical tools to try and make defining clusters and signatures easier than what you just experienced. Go back to your ‘reads‘ tab.
Remove any ‘cell barcodes’ (emojis) that appear fewer than 4 times. You may also consider whether to put a cap on the highest number of transcripts constituting a cell (doublets may have more transcripts).
- Cell barcodes with few reads likely represent background. Setting a cut-off point (i.e. how many genes or transcripts constitute the minimum number to define a cell) is important, but can be tricky, particularly with heterogeneous cell sizes.

6. Filter the genes

Remove any ‘genes’ (colours) that appear fewer than 3 times.
- If a gene appears so few times in a sample, it’s unlikely to be informative – it is also difficult mathematically to compare expression when a gene appears so rarely.

7. Normalisation

Just kidding! You don’t actually have to do this. In this specific activity, each cell now has about the same number of transcripts. However, in a real sample, this would not be true – imagine trying to compare transcript signatures between cells with drastically different numbers! Anyway, normalisation helps here… More on that later!

8. Find variable genes

Some genes don’t vary much between cells – and carrying forward a matrix of size cells x genes can make computation a bit of a nightmare! Standard pipelines only take into account genes that vary significantly.
Remove all ‘yellow’ transcripts. According to the super intense algorithm of “I said = so”, these transcripts have been found to not vary.

9. Scale Data

This step is not always performed, although it can help make it easier to compare different samples with different depths of sequencing. This step scales the variation between genes to make them more easily comparable (otherwise, genes with strong expression differences will dominate the analysis, hiding subtle differences from other genes). With this step, you can also optionally ‘regress’ genes, which is to say, their variation will not contribute to cluster calling.
Green genes here have been found to contribute to cell cycling. We are not interested in this and don’t want it to obscure the genes driving cancer progression. Remove the green genes (‘cell cycle regression’).

10. Dimensionality Reduction

Normally dimensionality reduction is a huge part of this protocol. Lucky for you, there are only 3 dimensions (i.e. 3 genes) in this data, so you can skip this!

11. Identify cell clusters

Group the cells by the ‘transcript signatures’.
Example:
- These cells would be in the same cluster

But likely not in the same cluster as this cell:

12. Plot your cells, for real this time!

Make a copy of this file
Plot the cells using the ‘cell clusters’ you identified in Step 11. Similar cells should be plotted close together. Put a circle around each cell cluster.
- Example

13. Interpret the results

Answer to these questions:
- Answer the follow questionsWere there any cells you couldn’t classify?
- How many total cells did you find?
- How many cell types (clusters) are in your final map?
- How did you interpret the results?

14. Check the answer key here.

Single-cell RNA-seq and network analysis using Galaxy and Cytoscape

Pipeline overview

Learning outcomes

Activity goals

Activity steps