================================================================================ PeakAnnotator Overview ================================================================================ //////////////////////////////////////////////////////////////////////////////// INSTALLATION FROM SOURCE //////////////////////////////////////////////////////////////////////////////// Go to the PeakAnnotator.src folder. Compile the source files using g++ -o PeakAnnotator *.cpp An executable file named PeakAnnotator will be generated //////////////////////////////////////////////////////////////////////////////// USAGE //////////////////////////////////////////////////////////////////////////////// To launch the program, open a terminal window, go to the folder where PeakAnnotator executable file is located, and type: ./PeakAnnotator in order to get the three utilities of the program: >PeakAnnotator NDG for each peak finds its closest downstream gene on both strands TSS for each peak finds the distance to its closest TSS ODS finds overlaps between two position files Type PeakAnnotator to get help about the options specific to each utility. *** utility: This can be one of "NDG, TSS, ODS" 1. NDG - For each locus, search for its Nearest Downstream Genes on both the forward and reverse strand. If the position of the locus is within a gene, the program describes in which part of that gene the locus is located. 2. TSS - For each locus, find its closest TSS (transcription start site). In order to do this, the program searches both upstream and downstream for the closest genes to the genomic coordinate. 3. ODS - Compare between two position files, to identify overlapping and unique genomic locations. Uses random regions matched for chromosome and length to calculate an enrichment over random and p-value. *** Peak File The file lists the genomic coordinates output by a peak calling program (or obtained in some other way). The format should be tab/space delimited, where each locus is described by its "chromosome", "start" and "end" location. PLEASE REMOVE ANY HEADER LINES FROM THE FILE IF THESE ARE PRESENT THIS FILE SHOULD BE SORTED ACCORDING TO CHROMOSOME AND START LOCATION *** Annotation File The file lists the features/genes of interest and their location in the genome. This file should be in BED format, which can be obtained using the UCSC table browser. The BED format is defined in "http://genome.ucsc.edu/FAQ/FAQformat#format1". The annotation file requires three fields: chromosome, start and end locations. However, if the features of interests are genes, it is highly recommended that the annotation file includes the nine additional optional BED fields. These can be output by the UCSC table browser by selecting "BED-browser extensible data". Requirements for BED file format - NDG utility: The following fields (columns) are recommended for the NDG utility: chrom, chromStart, chrEnd, name, strand, thickStart, thickEnd, blockCount, blockSizes, blockStarts. Requirements for BED file format - TSS utility: The following fields (columns) are required: chrom, chromStart, chrEnd, strand. Please note that according to BED format, lower-numbered fields (columns) must always be present if higher-numbered fields are used. Hence, although the field "name" is not required for TSS, it should be specified in the file (inserting any character in column number 4 in the file is sufficient). *** Output File An output file name must be specified. *** Symbol File This is an optional parameter for the "NDG" and "TSS" utilities. The symbol file maps accession numbers to gene symbols; these can be obtained using the BioMart feature of Ensembl or from the UCSC table browser. *** chrSizeFile This is an optional parameter for the "ODS" utility. This file specify the size of each chromosome, and if its provided, a randomization test will be done in order to calculate the intersection p value, and enrichment over random. *** numRandomDatasets Number of random datasets to generate when calculating overlap p value (default 1000). Random regions matched by chromosome and length to the first regions file, are intersected with the second. //////////////////////////////////////////////////////////////////////////////// OUTPUT FILES //////////////////////////////////////////////////////////////////////////////// The output of the "NDG" utility is two tab delimited files: ************************************************************** A. "OutputFileName" as specified in the command line This file describes the closest downstream genes for each genomic locus, and contains the following fields: 1. Chromosome 2. Start 3. End - These first three columns describe the location of the peak in the genome. 4. # Overlapped_Genes - Number of transcripts overlapping the genomic loci. More details about these genes are reported in the second output file described below. 5. Downstream_FW_Gene - ID of the closest downstream gene on the forward strand. (6. Symbol - If a symbol file is specified, this field will contains the symbol of the closest downstream gene on the forward strand.) 7. Distance - Peak distance to its closest downstream gene on the forward strand. 8. Downstream_REV_gene - ID of the closest downstream gene on the reverse strand. 9. (Symbol - If a symbol file is specified, this field will contains the symbol of the closest downstream gene on the reverse strand.) 10. Distance - Peak distance to its closest downstream gene on the reverse strand. B. "Overlap_OutputFileName" This file describes the transcripts overlapping the peaks, if any such are found. 1. Chromosome 2. Start 3. End - These first three columns describe the location of the peak in the genome. 4. OverlapGene - Overlapping gene ID (5. Symbol - If a symbol file is specified, this field will contains the overlapping gene symbol) 6. Overlap_Begin - In which part of the gene does the peak's start position overlap 7. Overlap_Center - In which part of the gene does the peak's central position overlap 8. Overlap_End - In which part of the gene does the peak's end position overlap The output of the TSS option is a tab-delimited file: ***************************************************** "OutputFileName" as specified in the command line This file contains the following fields: 1. Chromosome 2. Start 3. End - These first three columns describe the location of the peak in the genome. 4. Distance - The distance from the peak to its closest TSS. 5. GeneStart - The start location of the closest gene on the genome. 6. GeneEnd - The end location of the closest gene on the genome. 4. ClosestTSS_ID - ID of the closest gene. (5. Symbol - If a symbol file is specified, this field will contains the symbol of the closest gene.) 6. Strand - Strand of closest gene. The output of the "ODS" option is three tab delimited files: ****************************************************************** A. "OutputFileName" as specified in the command line Each line in this file describes an overlap event between two genomic loci, and has the following fields: 1. Chromosome 2. peakFile1_Start - Start location of the first genomic locus 3. peakFile1_End - End location of the first genomic locus 4. peakFile1_Name - Name of the first genomic locus (if it exist in the input file) 5. peakFile2_Start - Start location of the second genomic locus 6. peakFile2_End - End location of the second genomic locus 7. peakFile2_Name - Name of the second genomic locus (if it exist in the input file) B+C. "PeakFileName.unique" - one file for each genomic input file, which describes the unique peaks.