This page describes the processes involved in running the eHive pipeline used for the analysis of Bisulfite-Seq (BS-Seq) data. It relies heavily on Bismark  and it will align the bisulfite-treated reads to a preprocessed reference genome and will identify the methylation status of all cytosines.
This pipeline is capable of analysing either Whole-Genome bisulfite sequencing (WGBS) or Reduced representation bisulfite sequencing (RRBS) data. Please be aware that FASTQ files are not preprocessed (i.e. trimmed/quality filtered) in any way. So you will need to perform this yourself, a very useful program for doing this is Trim galore which has a specific -rrbs option which will trim and filter the reads by taking into account the peculiarities of the RRBS experimental protocol.
Finally, just mention that this documentation describes the entire process involved in running the pipeline and it is inspired from the Confluence document written by Susan Fairley to describe the NGS alignment pipeline (alignment pipeline)
This pipeline relies on the Ensembl eHive distributed scheduling system (see here) and also on the Reseqtrack codebase. So first thing you will need to do is to download the code from github:
Then, add this to your
Set up a ReseqTrack database
This database will track all the files that will be analysed by the eHIve BS pipeline. It also tracks and categorises the different files generated by the pipeline and will monitor the completion for each of the samples to be analysed. Finally, it also stores some basic metrics on different aspects of the data (i.e. FASTQC analysis results, alignment metrics, number of duplicates, etc.).
In order to create this database enter the following:
wgbs_pipeline_name is the name given to the database and should be descriptive of the type of data you are working with.
Obtain the ENA accession for the data that you wish to analyse
At the moment, the pipeline assumes that the data you want to analyse is already archived in the European Nucleotide Archive (ENA ). So you should make sure you have the ENA accession for the data you will be working with. For example, see the following ENA Bisulfite-seq study (SRP034862) . It is worth taking time at this point to ensure that the data is the intended data and that you understand the files you should be getting, their relationship to samples, etc..
Load the meta data from ENA into the ReseqTrack database
A reseqtrack script is used to access the ENA, extract meta data about the sequence data you want to align and load that metadata into the reseqtrack database. Note: You will need an account in the ENA to access the data (read this if you do not have an account with the ENA)
Once the data has been loaded, you should do some basic checks. These can be done on the mysql command line, including:
Note: Check if the
library_name column within the
experiment table is NULL by using:
if so, set it to
unspecified so the pipeline does not crash
Fetch the FASTQ files from ENA
Once the meta data has been loaded successfully, it is time to fetch the sequence data from the ENA. This is done using a Hive pipeline. Please notice that for downloading the FASTQ files using this pipeline you will need to have an active account with the ENA. Please read this if that is not the case
The details of the Hive pipeline are specified in a config file. This is then loaded into the Reseqtrack database, from where the Hive pipeline will be created.
An example of a config file will be:
$SCRATCH will point to the directory where the FASTQ files will be dumped
The file above is loaded into the Reseqtrack database using the load_pipeline_from_conf.pl script.
perl $RESEQTRACK/scripts/pipeline/load_pipeline_from_conf.pl -dbhost host -dbuser user -dbpass pass -dbport port -dbname wgbs_pipeline_name -file $USERWORK/config/get_fastq.config -read
You can check that the pipeline has loaded by looking in the
pipeline table of the reseqtrack database.
Once the details of the Hive pipeline are present in the reseqtrack database, you can use the appropriate reseqtrack script to create the Hive database for the
Note, there is no requirement for the Hive and Reseqtrack databases to be on the same host. The pipeline name should be as specified in the config file and database. $ENSEMBL_HIVE_DIR, mentioned in the command below, was specified in the environment section above.
perl $RESEQTRACK/scripts/pipeline/init_hive_db.pl -dbhost host -dbuser user -dbpass pass -dbport port -dbname wgbs_pipeline_name -hive_host host -hive_user user -hive_pass pass -hive_port port -pipeline_name GetProjectFastq -ensembl_hive_dir $ENSEMBL_HIVE_DIR
The above command should create the Hive database as specified. Thereafter, the Hive database should have all the information required to run the Hive
GetProjectFastq pipeline. The pipeline is started by using another Reseqtrack script that will, in turn, start a beekeeper running. This beekeeper will be responsible of the interfacing with the computer farm that is going to run the pipeline and will control the number of jobs submitted.
Also, please note that there are MANY options when running this script to control how many jobs are submitted initially and the later behaviour of the script. Please look at the options and read the documentation. This version of the command will start the whole pipeline and the -loop command will keep looping over the jobs until things can progress no further. Use of the -hive_log_dir option will record log output from the Hive. This is not the default behaviour of the system.
perl $RESEQTRACK/scripts/pipeline/run_pipeline.pl -dbhost host -dbuser user -dbpass pass -dbport port -dbname user_project_db -hive_user user -hive_pass pass -pipeline_name GetProjectFastq -hive_log_dir $SCRATCH/get_project_fastq/log -loop
Once the Hive pipeline is running, you can monitor progress by checking the progress table in the Hive database directly.
select * from progress;
At the end of this process, the files should have been downloaded successfully to the specified location.
This pipeline assumes that the FASTQ file are correctly downloaded and tracked in the resetrack database (see previous step in order to learn how the files are downloaded).
This pipeline will accept either gzipped or uncompressed FASTQs.
Here is the list of the most relevant commands that are executed when this pipeline is run:
Quality of FASTQ file
Pipeline will routinely run FASTQC on all Fastq files. As previously mentioned, reads need to be trimmed or filtered before being taken by the pipeline. This is the command used to run FASTQC:
In order to align the bisulfite-treated reads into the genome this pipeline uses Bismark, which is run as follows:
reference.fa is the reference Fasta file that was indexed using Bismark's own tool (see below);
--multicore value is used to set the number of parallel instances of Bismark to be run concurrently;
--rg_id is used to set the ID field in the @RG header line of the output BAM file and
--rg_sample is used to set the SM field in the @RG header line.
--rg_tag needs to be present in order to use the other 2
Then we sort the generated BAM file using
For this step the pipeline uses
Biobambam bammerge. This step is optional and it will happen if there is more than one run for each sample, producing as a result of the merge, a sample-level BAM file.
This merge is done as follows:
And then we calculate some simple stats on the merged file by using
For this, we use
Notice that at this point the duplicates that are present in the file are not removed, but just marked (
Again, we get some stats on
sample.merge.mrkdup.bam by running
Remove Duplicates (Optional)
This step will use
samtools filter in order to discard PCR duplicates. Please be aware that this step will be skipped if you are analysing RRBS data (see here).
Then, we run
samtools flagstat on
Identification of the methylation sites
For this, this pipeline uses
bismark_methylation_extractor (see here). The command line used in this step is:
--bedGraph is the option used to generate an output file in the bedGraph format,
--cutoff is the minimum number of times a methylation state has to be seen for that nucleotide before its methylation percentage is reported, and
--multicore is used to set the number of cores to be used for running this program.
Diagram of the pipeline
Below you can see a diagram including all the steps in the pipeline.
In order to use Bismark, the genome needs to be preprocessed. For this, use Bismark's own tool:
path_to_genome_folder is the folder containing the Fasta file with the genome that needs to be processed.
bismark_genome_preparation has finished, you will have a subfolder named
Bisulfite_Genome/ containing the required files.
The path you need to use for setting up the pipeline is the same
<path_to_genome_folder> used with
bismark_genome_preparation (see below).
Bismark relies on Bowtie/Bowtie2 for aligning the reads to the genome. So you will need to download and install it (see here)
At the time of writing, the BS-Seq pipeline only supports Bowtie2 (and this will not change unless there is strong evidence showing that Bowtie outperforms Bowtie2. Email me email@example.com if you want to discuss about this)
Additionally, this pipeline also requires the following programs:
Setting the pipeline
Once you have fetched the FASTQ files and the metadata info on these files is already in the Reseqtrack database (read previous sections), one has to create the BS-sequencing Hive pipeline. For this, create a config file with the basic configuration options for the pipeline (see an example below):
Where the value in the line:
Is the path used when running the bismark_genome_preparation tool in the previous step.
The config file is loaded into the Reseqtrack database using the load_pipeline_from_conf.pl script.
perl $RESEQTRACK/scripts/pipeline/load_pipeline_from_conf.pl -dbhost host -dbuser user -dbpass pass -dbport port -dbname wgbs_pipeline_name -file $USERWORK/config/wgbs.config -read
You can check that the pipeline configuration options have been loaded by looking in the
pipeline table of the reseqtrack database.
After this, you can use the appropriate reseqtrack script to create the Hive database for the BS-seq pipeline. Note, there is no requirement for the Hive and Reseqtrack databases to be on the same host. The pipeline name should be as specified in the config file and database.
$ENSEMBL_HIVE_DIR, mentioned in the command below, was specified in the environment section above.
perl $RESEQTRACK/scripts/pipeline/init_hive_db.pl -dbhost host -dbuser user -dbpass pass -dbport port -dbname wgbs_pipeline_name -hive_host host -hive_user user -hive_pass pass -hive_port port -pipeline_name wgbs_pipeline -ensembl_hive_dir $ENSEMBL_HIVE_DIR
Now the pipeline is ready to be run, for this enter the following:
perl $RESEQTRACK/scripts/pipeline/run_pipeline.pl -dbhost host -dbuser user -dbpass pass -dbport port -dbname wgbs_pipeline_name -hive_user user -hive_pass pass -pipeline_name wgbs_pipeline -hive_log_dir $SCRATCH/wgbs_project_pipeline/log -loop
Checking the progress of the pipeline
Once the Hive pipeline is running, you can monitor progress through Hive scripts or by checking the progress table in the Hive database directly.
SELECT * FROM progress;
status column to monitor how the different analyses (i.e. pipeline analysis steps) are going. If the status is
FAILED, this means that there was some kind of issue that you will need to investigate further.
Clues on the failure reason can be obtained by checking the message generated by the worker:
The analysis_id can be obtained by doing:
This pipeline uses the Reseqtrack MySQL database to track the location and metadata details of the different output files. This tracking database was created in a previous step (see here).
I will list below the most relevant output files generated by the pipeline and how to query the Reseqtrack database to get information on their location.
First thing of course will be to connect the Reseqtrack database:
Use this query to know the location of all files generated by FASTQC:
Also, the information present on the summary file generated by FASTQC is directly stored in the database. In order to access this data, we use the following query:
Bismark's run-level BAM
This BAM file is generated by Bismark and there will be exactly one for each Run in the database. In order to know their location use:
Also, a report file is generated by Bismark containing some useful stats on the alignment process and on the methylation status of the cytosines analysed. This file is stored by the pipeline and can be found using:
The information contained in this report is also stored in the database and can be found by using:
This BAM file is the result of merging the different run-level BAMs into a single sample-level BAM. This merge process is performed by
In order to get the path to this Sample-level BAM you need to enter:
SAMtools flagstat is run over this BAM in order to get some stats on the file. This flagstat file can be accessed using:
Number of duplicates
This pipeline uses
Biobambam bammarkduplicates to mark the duplicates that are present in the sample-level BAM generated in the previous step.
In order to get information on what is the duplication rate of this file , the pipeline will run
SAMtools flagstat on it. The location of the flagstat output can be found by using:
Also, this output is stored in the 'attribute' table and can be accessed by directly querying the database using:
Final BAM file
This BAM file is ready to use by the methylation caller (bismark_methylation_extractor) and it will or will not contain the read duplicates depending on whether the duplicate removal step is run.
The get the location of this file use:
Again, in order to have some stats on this BAM file, the pipeline will run
SAMtools flagstat on it.
In order to access this flagstat file you can use:
Again, this output is stored in the 'attribute' database table and can be accessed by directly querying the database using:
Accessing the Methylation calls
In order to access the output files generated by
bismark_methylation_extractor you need to use the following SQL queries to know the location of the files:
- Final Cytosine Methylation Report
bismark_methylation_extractor generates a report with the different number of methylated Cs in different contexts. In order to access this file you can use:
- Methylation output in the bedgraph format
- Methylation calls in the CpG context
- Methylation calls in the CHG context
- Methylation calls in the CHH context
- M-bias plot
This plot is generated by
bismark_methylation_extractor and it shows the methylation proportion across each possible position in the read. The extractor will generate the plot in
.png format and also the data for generating the plot is written into a text file. In order to access these files use:
Comments on this pipeline
This process will align to the reference the Bisulfite converted reads. It is quite resource hungry and also it is a really slow process. In order to speed this alignment up, it is advisable to use the Bismark's
multicore option in the configuration file, which sets the number of parallel instances of Bismark to be run concurrently. For modifying this value, open
$USERWORK/config/wgbs.config and look for the line:
and increase the multicore value. Please be aware that by increasing this value, you will also increase the amount of resources required. For example, Bismark authors suggest that:
setting -- multicore 4 for e.g. the GRCm38 mouse genome will probably use ~20 cores and eat ~40GB or RAM
By increasing this value you will also speed up
bismark_methylation_extractor, which is the program used by the pipeline to asses the methylation status of each site.
Reduced Representation Bisulfite Sequencing (RRBS) data
On its default mode, this pipeline will first mark the PCR duplicates and then it will remove them. Because the RRBS data is generated by digesting the genomic DNA with Msp1 (see explanation here), we will have a disproportionate number of reads starting at the same position that do not correspond to PCR duplicates but to 'real' reads. So, in order not to discard all these reads, you will need to edit
$USERWORK/config/wgbs.config and modify the
filter_duplicates value to 0:
And the following line:
duplicate_flag_value is undefined.
By modifying the config file in this way, the duplicate reads will be marked but not removed from the BAM file.
Defining the samples to be analysed
In order to start the pipeline with a particular set of samples one has to modify the configuration file (
$USERWORK/config/wgbs.config) by adding:
Where SRS523504,SRS523505 are the ids of the samples to be analysed.
Downloading metadata for a particular ENA study
This step is only relevant if you do not have an ENA account to download programatically all metadata for a particular study. The process described here replaces what is described here and explains how to populate the ReseqTrack database with the required study metadata.
For illustrative purposes, I will use the ENA study with id SRP034862 as an example on how to download the metadata.
First, you need to create 4 different tsv files containing the details on the study and the different samples, experiments and runs comprising this study. Examples of the different types of files can be found at (study .tsv, sample.tsv, experiment.tsv, run.tsv ).
Then you need a ReseqTrack script to load the metadata from the different files into the database:
$RESEQTRACK points to the location of he Reseqtrack codebase that was downloaded using
git, and the database connection parameters are the ones used to create the Reseqtrack database that was explained in here.
After running this script you should check that all the samples, experiments and runs correspond to what you were expecting by doing:
Populating the Reseqtrack database with the metadata for the FASTQ files to be analysed
This step is only relevant if you do not have an ENA account to download the FASTQ files for a particular study using the pipeline described here. Below you will find how to populate the Reseqtrack database with the metadata details for the FASTQ files from a particular study.
For illustrative purposes, I will use the ENA study with id SRP034862 as an example on how to do this:
First, you need to download the FASTQ files for each of the runs in the study (see here).
Then, you need to write a tsv file with 2 columns (see and example here). The first column will have the
run_source_ids and the 2nd column will have the file paths for each of the run FASTQs. These
run_source_ids should be already present in the database (see the previous section to learn how populate the database with this information).
Then, you need to use the following command line to upload the FASTQ metadata.
$RESEQTRACK points to the location of the Reseqtrack codebase that was downloaded using
git, and the database connection parameters are the ones used to create the Reseqtrack database that was explained in here.
file.paths.txt is the file containing the file paths to the FASTQs that is linked above.
For any particular query contact me at firstname.lastname@example.org