Child pages
  • Bisulfite Sequencing (BS) pipeline
Skip to end of metadata
Go to start of metadata

Introduction

This page describes the processes involved in running the eHive pipeline used for the analysis of Bisulfite-Seq (BS-Seq) data. It relies heavily on Bismark [1] and it will align the bisulfite-treated reads to a preprocessed reference genome and will identify the methylation status of all cytosines.
This pipeline is capable of analysing either Whole-Genome bisulfite sequencing (WGBS) or Reduced representation bisulfite sequencing (RRBS) data. Please be aware that FASTQ files are not preprocessed (i.e. trimmed/quality filtered) in any way. So you will need to perform this yourself, a very useful program for doing this is Trim galore which has a specific -rrbs option which will trim and filter the reads by taking into account the peculiarities of the RRBS experimental protocol.

Finally, just mention that this documentation describes the entire process involved in running the pipeline and it is inspired from the Confluence document written by Susan Fairley to describe the NGS alignment pipeline (alignment pipeline)

Environment

This pipeline relies on the Ensembl eHive distributed scheduling system (see here) and also on the Reseqtrack codebase. So first thing you will need to do is to download the code from github:

#First, cd into the folder where the code will be cloned 
cd ~/lib
#clone the eHive codebase
git clone https://github.com/Ensembl/ensembl-hive.git
#clone the Reseqtrack codebase
git clone https://github.com/EMBL-EBI-GCA/reseqtrack.git

Then, add this to your .bashrc:

#hive
export PATH=${PATH}:~/lib/ensembl-hive/scripts
export ENSEMBL_HIVE_DIR= ~/lib/ensembl-hive
export PERL5LIB=$ENSEMBL_HIVE_DIR/modules/:${PERL5LIB}
#reseqtrack
export RESEQTRACK=~/lib/reseqtrack
export PERL5LIB=$RESEQTRACK/modules/:${PERL5LIB}

Set up a ReseqTrack database

This database will track all the files that will be analysed by the eHIve BS pipeline. It also tracks and categorises the different files generated by the pipeline and will monitor the completion for each of the samples to be analysed. Finally,  it also stores some basic metrics on different aspects of the data (i.e. FASTQC analysis results, alignment metrics, number of duplicates, etc.).

In order to create this database enter the following:

mysql -h hostname -P port -u user -ppassword -e"create database wgbs_pipeline_name"
mysql -h hostname -P port -u user -ppassword wgbs_pipeline_name < $RESEQTRACK/sql/table.sql
mysql -h hostname -P port -u user -ppassword wgbs_pipeline_name < $RESEQTRACK/sql/views.sql

Where wgbs_pipeline_name  is the name given to the database and should be descriptive of the type of data you are working with.

Obtain the ENA accession for the data that you wish to analyse

At the moment, the pipeline assumes that the data you want to analyse is already archived in the European Nucleotide Archive (ENA [2]). So you should make sure you have the ENA accession for the data you will be working with. For example, see the following ENA Bisulfite-seq study (SRP034862. It is worth taking time at this point to ensure that the data is the intended data and that you understand the files you should be getting, their relationship to samples, etc..

Load the meta data from ENA into the ReseqTrack database

A reseqtrack script is used to access the ENA, extract meta data about the sequence data you want to align and load that metadata into the reseqtrack database. Note: You will need an account in the ENA to access the data (read this if you do not have an account with the ENA)

perl $RESEQTRACK/scripts/metadata/load_from_ena.pl -dbhost hostname -dbuser user -dbpass password -dbport port -dbname wgbs_pipeline_name -era_dbname dbname -era_dbuser erauser -era_dbpass password -new_study SRP****** -new_study SRP****** -load_new

Once the data has been loaded, you should do some basic checks. These can be done on the mysql command line, including:

select count(*) from sample;
select count(*) from run;
select count(*) from study;
# also look at the contents of these tables. Does the data seem accurate and as expected?
select distinct sample_source_id from sample;
select * from study;

Note: Check if the library_name column within the experiment table is NULL by using:

SELECT library_name FROM experiment;

if so, set it to unspecified so the pipeline does not crash  

Fetch the FASTQ files from ENA

Once the meta data has been loaded successfully, it is time to fetch the sequence data from the ENA. This is done using a Hive pipeline. Please notice that for downloading the FASTQ files using this pipeline you will need to have an active account with the ENA. Please read this if that is not the case

The details of the Hive pipeline are specified in a config file. This is then loaded into the Reseqtrack database, from where the Hive pipeline will be created.

An example of a config file will be:

[GetProjectFastq]

table_name=run
config_module=ReseqTrack::Hive::PipeConfig::GetFastq_conf
config_options=-era_dbpass ***************
config_options=-era_dbuser erauser
config_options=-root_output_dir $SCRATCH/get_project_fastq
config_options=-fastq_output_dir $SCRATCH/project_fastq
config_options=-fastq_type FASTQ
config_options=-lsf_queue production
config_options=-era_dbname ERAPRO
config_options=-clobber 1
config_options=-get_fastq_module ReseqTrack::Tools::GetFastq::ENAFtp

$SCRATCH will point to the directory where the FASTQ files will be dumped

The file above is loaded into the Reseqtrack database using the load_pipeline_from_conf.pl script.

perl $RESEQTRACK/scripts/pipeline/load_pipeline_from_conf.pl -dbhost host -dbuser user -dbpass pass -dbport port -dbname wgbs_pipeline_name -file $USERWORK/config/get_fastq.config -read

You can check that the pipeline has loaded by looking in the pipeline table of the reseqtrack database.

Once the details of the Hive pipeline are present in the reseqtrack database, you can use the appropriate reseqtrack script to create the Hive database for the getFASTQ pipeline.
Note, there is no requirement for the Hive and Reseqtrack databases to be on the same host. The pipeline name should be as specified in the config file and database. $ENSEMBL_HIVE_DIR, mentioned in the command below, was specified in the environment section above. 

perl $RESEQTRACK/scripts/pipeline/init_hive_db.pl -dbhost host -dbuser user -dbpass pass -dbport port -dbname wgbs_pipeline_name -hive_host host -hive_user user -hive_pass pass -hive_port port -pipeline_name GetProjectFastq -ensembl_hive_dir $ENSEMBL_HIVE_DIR

The above command should create the Hive database as specified. Thereafter, the Hive database should have all the information required to run the Hive GetProjectFastq pipeline. The pipeline is started by using another Reseqtrack script that will, in turn, start a beekeeper running. This beekeeper will be responsible of the interfacing with the computer farm that is going to run the pipeline and will control the number of jobs submitted.

Also, please note that there are MANY options when running this script to control how many jobs are submitted initially and the later behaviour of the script. Please look at the options and read the documentation. This version of the command will start the whole pipeline and the -loop command will keep looping over the jobs until things can progress no further. Use of the -hive_log_dir option will record log output from the Hive. This is not the default behaviour of the system.

perl $RESEQTRACK/scripts/pipeline/run_pipeline.pl -dbhost host -dbuser user -dbpass pass -dbport port -dbname user_project_db -hive_user user -hive_pass pass -pipeline_name GetProjectFastq -hive_log_dir $SCRATCH/get_project_fastq/log -loop

Once the Hive pipeline is running, you can monitor progress by checking the progress table in the Hive database directly.

select * from progress;

At the end of this process, the files should have been downloaded successfully to the specified location.

BS-Seq pipeline

This pipeline assumes that the FASTQ file are correctly downloaded and tracked in the resetrack database (see previous step in order to learn how the files are downloaded).
This pipeline will accept either gzipped or uncompressed FASTQs.

Pipeline overview

Here is the list of the most relevant commands that are executed when this pipeline is run:

Quality of FASTQ file

Pipeline will routinely run FASTQC on all Fastq files. As previously mentioned, reads need to be trimmed or filtered before being taken by the pipeline. This is the command used to run FASTQC:

fastqc -q input.fastq.gz

Mapping

In order to align the bisulfite-treated reads into the genome this pipeline uses Bismark, which is run as follows:

bismark -n 1 reference.fa input.fastq.gz --rg_tag --rg_id xxxx --rg_sample xxxx -o /path/to/output --multicore x

Where reference.fa is the reference Fasta file that was indexed using Bismark's own tool (see below); --multicore value is used to set the number of parallel instances of Bismark to be run concurrently; --rg_id is used to set the ID field in the @RG header line of the output BAM file and --rg_sample is used to set the SM field in the @RG header line. --rg_tag needs to be present in order to use the other 2 --rg_sample and --rg_id options

Then we sort the generated BAM file using biobambam:

bamsort SO=coordinate I=run.bam O=run.sorted.bam index=0 indexfilename= fixmates=1

Merging (Optional)

For this step the pipeline uses Biobambam bammerge. This step is optional and it will happen if there is more than one run for each sample, producing as a result of the merge, a sample-level BAM file.

This merge is done as follows:

bammerge I=run1.sorted.bam I=run2.sorted.bam index=0 indexfilename= SO=coordinate  > sample.merge.bam 

And then we calculate some simple stats on the merged file by using samtools flagstat:

samtools flagstat sample.merge.bam > sample.merge.bam.flagstat

Duplicate marking

For this, we use Biobambam bammarkduplicates:

bammarkduplicates I=sample.merge.bam O=sample.merge.mrkdup.bam rmdup=0 index=0

Notice that at this point the duplicates that are present in the file are not removed, but just marked (rmdup=0).

Again, we get some stats on sample.merge.mrkdup.bam by running samtools flagstat:

samtools flagstat sample.merge.mrkdup.bam > sample.merge.mrkdup.bam.flagstat

Remove Duplicates (Optional)

This step will use samtools filter in order to discard PCR duplicates. Please be aware that this step will be skipped if you are analysing RRBS data (see here).

samtools view -b -o sample.merge.filtered.bam -F 1024 sample.merge.mrkdup.bam

Then, we run samtools flagstat on sample.merge.filtered.bam:

samtools flagstat sample.merge.filtered.bam > sample.merge.filtered.bam.flagstat

Identification of the methylation sites

For this, this pipeline uses bismark_methylation_extractor (see here). The command line used in this step is:

bismark_methylation_extractor -s --bedGraph --cutoff X --multicore X sample.merge.filtered.bam

Where --bedGraph is the option used to generate an output file in the bedGraph format, --cutoff is the minimum number of times a methylation state has to be seen for that nucleotide before its methylation percentage is reported, and --multicore is used to set the number of cores to be used for running this program. 

Diagram of the pipeline

Below you can see a diagram including all the steps in the pipeline.

 

Prerequisites

Bismark prerequisites

Reference genome

In order to use Bismark, the genome needs to be preprocessed. For this, use Bismark's own tool:

bismark_genome_preparation <path_to_genome_folder>

Where path_to_genome_folder is the folder containing the Fasta file with the genome that needs to be processed. 

Once bismark_genome_preparation has finished, you will have a subfolder named Bisulfite_Genome/ containing the required files.

The path you need to use for setting up the pipeline is the same <path_to_genome_folder> used with bismark_genome_preparation (see below).

Bowtie2

Bismark relies on Bowtie/Bowtie2 for aligning the reads to the genome. So you will need to download  and install it (see here

At the time of writing, the BS-Seq pipeline only supports Bowtie2 (and this will not change unless there is strong evidence showing that Bowtie outperforms Bowtie2. Email me ernesto@ebi.ac.uk if you want to discuss about this)

Other programs

Additionally, this pipeline also requires the following programs:

Setting the pipeline

Once you have fetched the FASTQ files and the metadata info on these files is already in the Reseqtrack database (read previous sections), one has to create the BS-sequencing Hive pipeline. For this, create a config file with the basic configuration options for the pipeline (see an example below):

[wgbs_pipeline]
table_name=sample
config_module=ReseqTrack::Hive::PipeConfig::WGBS_conf
config_options=-require_experiment_columns instrument_platform=ILLUMINA
config_options=-type_fastq FASTQ
config_options=-reference path_to_genome_folder
config_options=-root_output_dir $SCRATCH'/#sample_source_id#'
config_options=-final_output_dir $SCRATCH'/#sample_source_id#'
config_options=-run_bam_type WGBS_RUN_BAM
config_options=-unfilt_bam_type WGBS_UNFILT_BAM
config_options=-unfilt_flagstat_type WGBS_UNFILT_FLAGSTAT
config_options=-mapper_report_type WGBS_MAPPER_REPORT
config_options=-dedup_bam_type WGBS_DEDUP_BAM
config_options=-dedup_flagstat_type WGBS_DEDUP_FLAGSTAT
config_options=-fastqc_summary_type WGBS_RUN_FASTQC_SUMMARY
config_options=-fastqc_report_type WGBS_RUN_FASTQC_REPORT
config_options=-fastqc_zip_type WGBS_RUN_FASTQC_ZIP
config_options=-mbiastxt_type WGBS_MBIAS_TXT
config_options=-mbiaspng_type WGBS_MBIAS_PNG
config_options=-bedgraph_type WGBS_BEDGRAPH
config_options=-chhcontext_type WGBS_CHHCONTEXT
config_options=-cpgcontext_type WGBS_CPGCONTEXT
config_options=-chgcontext_type WGBS_CHGCONTEXT
config_options=-splitting_type WGBS_SPLITTING
config_options=-biobambam_dir /path/to/bin/biobambam2-2.0.10/bin/
config_options=-fastqc_exe /path/to/bin/FastQC/fastqc
config_options=-bismark_exe /path/to/bin/bismark_v0.15.0/bismark
config_options=-bismark_methcall_exe /path/to/bin/bismark_v0.15.0/bismark_methylation_extractor
config_options=-samtools_exe /path/to/bin/samtools-1.3/samtools
config_options=-lsf_queue production-rh6
config_options=-multicore 2
config_options=-cutoff 5
config_options=-filter_duplicates 1

Where the value in the line: 

config_options=-reference path_to_genome_folder

Is the path used when running the bismark_genome_preparation tool in the previous step.

The config file is loaded into the Reseqtrack database using the load_pipeline_from_conf.pl script.

perl $RESEQTRACK/scripts/pipeline/load_pipeline_from_conf.pl -dbhost host -dbuser user -dbpass pass -dbport port -dbname wgbs_pipeline_name -file $USERWORK/config/wgbs.config -read

You can check that the pipeline configuration options have been loaded by looking in the pipeline table of the reseqtrack database.

After this, you can use the appropriate reseqtrack script to create the Hive database for the BS-seq pipeline. Note, there is no requirement for the Hive and Reseqtrack databases to be on the same host. The pipeline name should be as specified in the config file and database.
$ENSEMBL_HIVE_DIR, mentioned in the command below, was specified in the environment section above. 

perl $RESEQTRACK/scripts/pipeline/init_hive_db.pl -dbhost host -dbuser user -dbpass pass -dbport port -dbname wgbs_pipeline_name -hive_host host -hive_user user -hive_pass pass -hive_port port -pipeline_name wgbs_pipeline -ensembl_hive_dir $ENSEMBL_HIVE_DIR

Now the pipeline is ready to be run, for this enter the following:

perl $RESEQTRACK/scripts/pipeline/run_pipeline.pl -dbhost host -dbuser user -dbpass pass -dbport port -dbname wgbs_pipeline_name -hive_user user -hive_pass pass -pipeline_name wgbs_pipeline -hive_log_dir $SCRATCH/wgbs_project_pipeline/log -loop

Checking the progress of the pipeline

Once the Hive pipeline is running, you can monitor progress through Hive scripts or by checking the progress table in the Hive database directly.

SELECT * FROM progress;

Check the status column to monitor how the different analyses (i.e. pipeline analysis steps) are going. If the status is FAILED, this means that there was some kind of issue that you will need to investigate further. 

Clues on the failure reason can be obtained by checking the message generated by the worker:

SELECT * FROM msg WHERE analysis_id=(id of the analysis that has failed);

The analysis_id can be obtained by doing:

SELECT analysis_id FROM analysis_base WHERE logic_name=(name of the analysis that has failed); 

Output files

This pipeline uses the Reseqtrack MySQL database to track the location and metadata details of the different output files. This tracking database was created in a previous step (see here). 

I will list below the most relevant output files generated by the pipeline and how to query the Reseqtrack database to get information on their location. 

First thing of course will be to connect the Reseqtrack database:

mysql -h hostname -P port -u user -ppassword bs_pipeline_name

FASTQC

Use this query to know the location of all files generated by FASTQC:

SELECT name FROM file WHERE type LIKE 'WGBS_RUN_FASTQC_%';

Also, the information present on the summary file generated by FASTQC is directly stored in the database. In order to access this data, we use the following query:

SELECT file.name,attribute.attribute_name,attribute.attribute_value FROM attribute INNER JOIN file ON attribute.other_id=file.file_id where attribute.attribute_name like 'FASTQC%';

Bismark's run-level BAM

This BAM file is generated by Bismark and there will be exactly one for each Run in the database. In order to know their location use:

SELECT name FROM file WHERE type = 'WGBS_RUN_BAM' 

Also, a report file is generated by Bismark containing some useful stats on the alignment process and on the methylation status of the cytosines analysed. This file is stored by the pipeline and can be found using:

SELECT name FROM file WHERE type = 'WGBS_MAPPER_REPORT';

The information contained in this report is also stored in the database and can be found by using:

SELECT collection.name,attribute.attribute_name,attribute.attribute_value FROM attribute INNER JOIN collection ON attribute.other_id=collection.collection_id where collection.type='WGBS_RUN_BAM';

Sample-level BAM

This BAM file is the result of merging the different run-level BAMs into a single sample-level BAM. This merge process is performed by Biobambam bammerge

In order to get the path to this Sample-level BAM you need to enter:

SELECT name FROM file WHERE type='WGBS_UNFILT_BAM';

Also, SAMtools flagstat is run over this BAM in order to get some stats on the file. This flagstat file can be accessed using:

SELECT name FROM file WHERE type='WGBS_UNFILT_FLAGSTAT';

Number of duplicates

This pipeline uses Biobambam bammarkduplicates to mark the duplicates that are present in the sample-level BAM generated in the previous step.

In order to get information on what is the duplication rate of this file , the pipeline will run SAMtools flagstat on it. The location of the flagstat output can be found by using:

SELECT name FROM file WHERE type='WGBS_DUP_FLAGSTAT';

Also, this output is stored in the 'attribute' table and can be accessed by directly querying the database using:

SELECT collection.name,attribute.attribute_name,attribute.attribute_value FROM attribute INNER JOIN collection ON attribute.other_id=collection.collection_id where collection.type='WGBS_UNFILT_BAM';

Final BAM file

This BAM file is ready to use by the methylation caller (bismark_methylation_extractor) and it will or will not contain the read duplicates depending on whether the duplicate removal step is run. 

The get the location of this file use:

SELECT name FROM file WHERE type='WGBS_DEDUP_BAM';

Again, in order to have some stats on this BAM file, the pipeline will run SAMtools flagstat on it.

In order to access this flagstat file you can use:

SELECT name FROM file WHERE type='WGBS_DEDUP_FLAGSTAT';

Again, this output is stored in the 'attribute' database table and can be accessed by directly querying the database using:

SELECT collection.name,attribute.attribute_name,attribute.attribute_value FROM attribute INNER JOIN collection ON attribute.other_id=collection.collection_id where collection.type='WGBS_DEDUP_BAM';

Accessing the Methylation calls

In order to access the output files generated by bismark_methylation_extractor you need to use the following SQL queries to know the location of the files:

  • Final Cytosine Methylation Report

bismark_methylation_extractor generates a report with the different number of methylated Cs in different contexts. In order to access this file you can use:

SELECT name FROM file WHERE type='WGBS_SPLITTING';
  • Methylation output in the bedgraph format
SELECT name FROM file WHERE type='WGBS_BEDGRAPH';
  • Methylation calls in the CpG context
SELECT name FROM file WHERE type='WGBS_CPGCONTEXT';
  • Methylation calls in the CHG context
SELECT name FROM file WHERE type='WGBS_CHGCONTEXT';
  • Methylation calls in the CHH context
SELECT name FROM file WHERE type='WGBS_CHHCONTEXT';
  • M-bias plot

This plot is generated by bismark_methylation_extractor and it shows the methylation proportion across each possible position in the read. The extractor will generate the plot in .png format and also the data for generating the plot is written into a text file. In order to access these files use:

SELECT name FROM file WHERE type='WGBS_MBIAS_PNG';
SELECT name FROM file WHERE type='WGBS_MBIAS_TXT';

Comments on this pipeline

Bismark alignment

This process will align to the reference the Bisulfite converted reads. It is quite resource hungry and also it is a really slow process. In order to speed this alignment up, it is advisable to use the Bismark's multicore option in the configuration file, which sets the number of parallel instances of Bismark to be run concurrently. For modifying this value, open $USERWORK/config/wgbs.config and look for the line:

config_options=-multicore 2

and increase the multicore value. Please be aware that by increasing this value, you will also increase the amount of resources required. For example, Bismark authors suggest that:

setting -- multicore 4 for e.g. the GRCm38 mouse genome will probably use ~20 cores and eat ~40GB or RAM

By increasing this value you will also speed up bismark_methylation_extractor, which is the program used by the pipeline to asses the methylation status of each site.

Reduced Representation Bisulfite Sequencing (RRBS) data

On its default mode, this pipeline will first mark the PCR duplicates and then it will remove them. Because the RRBS data is generated by digesting the genomic DNA with Msp1 (see explanation here), we will have a  disproportionate number of reads starting at the same position that do not correspond to PCR duplicates but to 'real' reads. So, in order not to discard all these reads, you will need to edit $USERWORK/config/wgbs.config and modify the filter_duplicates value to 0:

config_options=-filter_duplicates 0

And the following line:

config_options=-duplicate_flag_value 0

So the duplicate_flag_value is undefined.

By modifying the config file in this way, the duplicate reads will be marked but not removed from the BAM file.

Defining the samples to be analysed

In order to start the pipeline with a particular set of samples one has to modify the configuration file ($USERWORK/config/wgbs.config) by adding:

config_options=-require_sample_columns sample_source_id='#expr(["SRS523504","SRS523505"])expr#'

Where SRS523504,SRS523505 are the ids of the samples to be analysed.

After saving the changes in $USERWORK/config/wgbs.config you will need to run load_pipeline_from_conf.pl,init_hive_db.pl and run_pipeline.pl as explained here

Downloading metadata for a particular ENA study

This step is only relevant if you do not have an ENA account to download programatically all metadata for a particular study. The process described here replaces what is described here and explains how to populate the ReseqTrack database with the required study metadata.

For illustrative purposes, I will use the ENA study with id SRP034862 as an example on how to download the metadata.

First, you need to create 4 different tsv files containing the details on the study and the different samples, experiments and runs comprising this study. Examples of the different types of files can be found at (study .tsv, sample.tsv, experiment.tsv, run.tsv ).

Then you need a ReseqTrack script to load the metadata from the different files into the database:

perl $RESEQTRACK/scripts/metadata/load_from_file.pl -dbhost hostname -dbport port -dbuser user -dbname wgbs_pipeline_name -dbpass password -load_new -studies study.tsv -experiments experiment.tsv -samples sample.tsv -runs run.tsv

Where $RESEQTRACK points to the location of he Reseqtrack codebase that was downloaded using git, and the database connection parameters are the ones used to create the Reseqtrack database that was explained in here.

After running this script you should check that all the samples, experiments and runs correspond to what you were expecting by doing:

select count(*) from sample;
select count(*) from run;
select count(*) from study;
# also look at the contents of these tables. Does the data seem accurate and as expected?
select distinct sample_source_id from sample;
select * from study;

Populating the Reseqtrack database with the metadata for the FASTQ files to be analysed

This step is only relevant if you do not have an ENA account to download the FASTQ files for a particular study using the pipeline described here. Below you will find how to populate the Reseqtrack database with the metadata details for the FASTQ files from a particular study.

For illustrative purposes, I will use the ENA study with id SRP034862 as an example on how to do this:

First, you need to download the FASTQ files for each of the runs in the study (see here).
Then, you need to write a tsv file with 2 columns (see and example here). The first column will have the run_source_ids and the 2nd column will have the file paths for each of the run FASTQs. These run_source_ids should be already present in the database (see the previous section to learn how populate the database with this information).

Then, you need to use the following command line to upload the FASTQ metadata.

perl $RESEQTRACK/scripts/file/load_files4Run.pl -list_file file.paths.txt -dbhost hostname -dbuser user -dbpass password -dbport port -dbname wgbs_pipeline_name -type FASTQ -host XXXX -run

Where $RESEQTRACK points to the location of the Reseqtrack codebase that was downloaded using git, and the database connection parameters are the ones used to create the Reseqtrack database that was explained in here.
file.paths.txt is the file containing the file paths to the FASTQs that is linked above. 

Questions

For any particular query contact me at ernesto@ebi.ac.uk

 

 

 

 


 

  • No labels