Sequence Mapping, Part. 1: Reference Based Assembly
Reference Based Transcriptome Analysis (Genome Mapping)
Table of contents
- Preliminary Preparation:
- Step 1. Genome Data Preparation
- Step 2. Indexing
- Step 3. Genome Mapping
- Step 4. Bam Sorting (End of Genome Mapping)
- Step 5. Generating Transcript Assembly (Start of Quantification)
- Step 6. Merging Assemblies
- Step 7. Differential Expression Analysis
- Overall Alignment Rate Calculation
Preliminary Preparation:
Before we start, there are a couple of tools (hyperlinked) to download: HISAT2, Cufflinks, Samtools, and Bowtie2. Make sure you have the correct binary and the latest version when downloading and follow the instructions to install the tools.
Step 1. Genome Data Preparation
We also need to prepare reference genome data. You can get the genome data from databases such as NCBI, UCSC genome browser, Ensembl, etc. The data we need is a genome sequence file (genome.fa
) and an annotation file (genes.gtf
).
You can obtain those files for your reference model species, e.g. Homo sapiens or Mus musculus etc, from the Illumina website here. You should download this on your server rather than on your local PC.
For this demonstration, my model specie is Mus musculus (Mouse) using data from UCSC database. To download the data, use the following command below replacing the bracket with the correct information:
wget [paste the link address of UCSC Mus musculus]
Once you have completed the download, use tar -xvzf [name of the file including '.tar.gz']
command by replacing the bracket with the correct file name to unzip the file. As a result, a new directory would have been generated called Mus musculus
.
Step 2. Indexing
Before mapping the reads, we need to index the reference genome. Why are we doing this? By indexing, it will reduce the time and memory while mapping the genome.
Locate the
genome.fa
file in theMus musculus
directory usingcd
command.For example, mine would look like this:
cd sra_data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta
Note: The
WholeGenomeFasta
directory is where thegenome.fa
file is locatedMake a new directory name
genome_index
.We will now commence indexing creating the output file in the new directory,
genome_index
:cd genome_index hisat2-build -p 8 /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa [Name of Index file] ### Note: ### # -p: number of cpu we will be using when indexing
Note: You should replace the bracket with the output file name of your choice. I'm going to call my index file as
musculus_index
for this demonstration.There should be a total of 8 index files in
.ht2
format generated when the process has completed.
Step 3. Genome Mapping
To map the sequence reads using the HISAT2 tool, the command format looks like below, however, there is an extra step to add on:
hisat2 -p 8 -x [Index File Name] -1 [forward paired fastq file] -2 [reverse paired fastq file] -S [output file name].sam --dta-cufflinks
Note: the fastq files it's referring to are the Paired Fastq Files (forward and reverse) we generated from Trimmomatic.
While HISAT2 is mapping, an alignment summary will be generated for each sample/file. This report is not saved automatically but rather created as a standard error ("stderr").
Therefore we need to use the following command 2>&1 |tee [name of alignment file].summary
to save the stderr report and the mapped file:
hisat2 -p 8 -x musculus_index -1 [forward paired fastq file] -2 [reverse paired fastq file] -S [output file name].sam --dta-cufflinks 2>&1 |tee [name of alignment file].summary
So to give you an example:hisat2 -p 8 -x musculus_index -1 /home/compbio/sra_data/SRR8238941_1_P.fastq.gz -2 /home/compbio/sra_data/SRR8238941_2_P.fastq.gz -S SRR8238941.sam --dta-cufflinks 2>&1 |tee SRR8238941.summary
As a result, you should have generated a .sam
file and a .summary
file. Once the alignment summary report has been generated, you will need to calculate the Overall Alignment Rate which I will demonstrate at the end of the article.
Step 4. Bam Sorting (End of Genome Mapping)
In this step, we will be converting the .sam
file format into a bam file format using the following command:
samtools sort -@8 -o [output file name].sort.bam [file name from mapping].sam
Note: -@
command is a parameter for the number of CPUs to use during this step.
To demonstrate one example the code will look like this:samtools sort -@ 8 -o SRR8238941.sort.bam SRR8238941.sam
Step 5. Generating Transcript Assembly (Start of Quantification)
For this step, it will be based on the Tuxedo protocol. Use the following command format to commence transcription assembly.
cufflinks -p 8 -G [pathway of the Annotation file] -o [output_directory_name] [file name from Bam sorting].sort.bam
# -p: number of CPU
# -G: Annotation file (should write the pull pathway where this file is located)
# -o: cufflinks creates a new directory to store the output files therefore you just need to name it
To show you an example, mine would look like this:cufflinks -p 8 -G /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf -o SRR8238941_trspt SRR8238941.sort.bam
Note: you will need to write a full pathway for -G
command since the annotation file (genes.gtf
) is in a separate directory.
Once you have finished you should have a new directory with genes.fpkm_tracking, isoforms.fpkm_tracking, skipped.gtf, transcripts.gtf
files generated for each file.
Step 6. Merging Assemblies
We will be merging the cufflink-assembled files into a single annotation file so that we can compare the expression level between samples. You also need Python version 2.7 installed or the tool will not work, so make sure you have conda on your Linux/Ubuntu environment beforehand.
Compile the pathways for the assembled files (
transcripts.gtf
) from the previous step using the following command:vi assemblies.txt
As an example for one file mine will look like this:
/home/compbio/sra_data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome_index/gene_alignment/bam/SRR8238941_trspt/transcripts.gtf
Make sure you save the txt file using the correct vi editor commands.
Execute the below command to merge the assembled files
cuffmerge -p 8 -g [pathway to genes.gtf] -s [pathway to genome.fa] -o merged_asm assemblies.txt # -p: number of cpu # -g Annotation file # -s: genome sequence file # -o: directory where cuffmerge files will be stored
To demonstrate my command will look like this:
cuffmerge -p 8 -g /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf -s genome.fa -o merged_asm assemblies.txt
Note: I wrote the full pathway to
genes.gtf
file because it's located in a different directory. You need to provide a full pathway if the file(s) you are working with is not located in the same directory.Once it's done, the
cuffmerge
will create amerged_asm
directory where themerge.gtf
file is generated, which is the transcriptome assembly file.
Step 7. Differential Expression Analysis
At last, we will be finding differentially expressed genes (DEGs) between samples using cuffdiff
command. The command format looks like this:
cuffdiff -o diff_out -b genome.fa -p 8 -L Ctrl,Ra,Rv -u merged.gtf file1.sort.bam file2.sort.bam
# -o: name of output directory where your DEG file(s) will be stored
# -b: genome sequence file
# -p: number of cpu
# -L: sample name
# -u: multi-read-correction; reads that are mapped on multiple locations in the genome are calculated using weight read mapping
If your data (or sample) has replication you will need to use this file naming format with commas:file1.sort.bam,file1-1.sort.bam file2.sort.bam,file2-1.sort.bam file3.sort.bam,file3-1.sort.bam
To figure out what the sample names are for each corresponding file, go back to the journal article or here. Each file should tell you whether it is a control or a different sample.
Here is an example for mine:cuffdiff -o diff_out -b /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome_index/genome.fa -p 8 -L Ctrl,Ra,Rv -u /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome_index/merge_assemble/merged.gtf SRR8238941.sort.bam,SRR8238942.sort.bam SRR8238943.sort.bam,SRR8238944.sort.bam
Since the article states there are 3 major types of samples (Control, Ra, Rv) I specified the sample name as Ctrl,Ra,Rv
. Also notice how my bam files are grouped by same sample types. For instance SRR8238941.sort.bam,SRR8238942.sort.bam
are grouped and separated by commas, not spaced, because they are control group samples.
Once the process has been completed, you will find your several output files in the diff_out
directory.
If you are focused on gene expression level the files you want to refer to are gene_exp.diff
and genes.read_group_tracking
.
gene_exp.diff
: contains gene expression level for each sample, p & q values of differentially expressed genes.genes.read_group_tracking
file contains expression levels of each replicate.
Overall Alignment Rate Calculation
Referring back to Step 3. Gene Mapping after alignment summary reports ( [file name].summary
) have been generated, and the next step is to calculate the overall alignment rate. When taking a look at one of the alignment summary files you should see a report like below:
37537336 reads; of these:
37537336 (100.00%) were paired; of these:
986245 (2.63%) aligned concordantly 0 times
33036211 (88.01%) aligned concordantly exactly 1 time
3514880 (9.36%) aligned concordantly >1 times
— —
986245 pairs aligned concordantly 0 times; of these:
112422 (11.40%) aligned discordantly 1 time
— —
873823 pairs aligned 0 times concordantly or discordantly; of these:
1747646 mates make up the pairs; of these:
1077358 (61.65%) aligned 0 times
579867 (33.18%) aligned exactly 1 time
90421 (5.17%) aligned >1 times
98.56% overall alignment rate
<<End of Report>>
The following is the calculation how we derived 98.65% overall alignment rate, refer to bolded section in the report :
98.56% = {33036211 + 3514880 + 112422 + (579867 + 90421) / 2} / 37537336 x 100
Congrats, you have made it!
In the next article, I will be demonstrating De Novo Assembly. In the meantime, carefully go over each step and try to run through the workflow with other datasets.