Sequence Mapping, Part. 1: Reference Based Assembly

Sequence Mapping, Part. 1: Reference Based Assembly

Reference Based Transcriptome Analysis (Genome Mapping)

Preliminary Preparation:

Before we start, there are a couple of tools (hyperlinked) to download: HISAT2, Cufflinks, Samtools, and Bowtie2. Make sure you have the correct binary and the latest version when downloading and follow the instructions to install the tools.

Step 1. Genome Data Preparation

We also need to prepare reference genome data. You can get the genome data from databases such as NCBI, UCSC genome browser, Ensembl, etc. The data we need is a genome sequence file (genome.fa) and an annotation file (genes.gtf).

You can obtain those files for your reference model species, e.g. Homo sapiens or Mus musculus etc, from the Illumina website here. You should download this on your server rather than on your local PC.

For this demonstration, my model specie is Mus musculus (Mouse) using data from UCSC database. To download the data, use the following command below replacing the bracket with the correct information:

wget [paste the link address of UCSC Mus musculus]

Once you have completed the download, use tar -xvzf [name of the file including '.tar.gz'] command by replacing the bracket with the correct file name to unzip the file. As a result, a new directory would have been generated called Mus musculus.

Step 2. Indexing

Before mapping the reads, we need to index the reference genome. Why are we doing this? By indexing, it will reduce the time and memory while mapping the genome.

  1. Locate the genome.fa file in the Mus musculus directory using cd command.

    For example, mine would look like this:
    cd sra_data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta

    Note: The WholeGenomeFasta directory is where the genome.fa file is located

  2. Make a new directory name genome_index .

  3. We will now commence indexing creating the output file in the new directory, genome_index:

     cd genome_index
     hisat2-build -p 8 /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa [Name of Index file]
    
     ### Note: ###
     # -p: number of cpu we will be using when indexing
    

    Note: You should replace the bracket with the output file name of your choice. I'm going to call my index file as musculus_index for this demonstration.

    There should be a total of 8 index files in .ht2 format generated when the process has completed.

Step 3. Genome Mapping

To map the sequence reads using the HISAT2 tool, the command format looks like below, however, there is an extra step to add on:

hisat2 -p 8 -x [Index File Name] -1 [forward paired fastq file] -2 [reverse paired fastq file] -S [output file name].sam --dta-cufflinks

Note: the fastq files it's referring to are the Paired Fastq Files (forward and reverse) we generated from Trimmomatic.

While HISAT2 is mapping, an alignment summary will be generated for each sample/file. This report is not saved automatically but rather created as a standard error ("stderr").

Therefore we need to use the following command 2>&1 |tee [name of alignment file].summary to save the stderr report and the mapped file:

hisat2 -p 8 -x musculus_index -1 [forward paired fastq file] -2 [reverse paired fastq file] -S [output file name].sam --dta-cufflinks 2>&1 |tee [name of alignment file].summary

So to give you an example:
hisat2 -p 8 -x musculus_index -1 /home/compbio/sra_data/SRR8238941_1_P.fastq.gz -2 /home/compbio/sra_data/SRR8238941_2_P.fastq.gz -S SRR8238941.sam --dta-cufflinks 2>&1 |tee SRR8238941.summary

As a result, you should have generated a .sam file and a .summary file. Once the alignment summary report has been generated, you will need to calculate the Overall Alignment Rate which I will demonstrate at the end of the article.

Step 4. Bam Sorting (End of Genome Mapping)

In this step, we will be converting the .sam file format into a bam file format using the following command:

samtools sort -@8 -o [output file name].sort.bam [file name from mapping].sam

Note: -@ command is a parameter for the number of CPUs to use during this step.
To demonstrate one example the code will look like this:
samtools sort -@ 8 -o SRR8238941.sort.bam SRR8238941.sam

Step 5. Generating Transcript Assembly (Start of Quantification)

For this step, it will be based on the Tuxedo protocol. Use the following command format to commence transcription assembly.

cufflinks -p 8 -G [pathway of the Annotation file] -o [output_directory_name] [file name from Bam sorting].sort.bam

# -p: number of CPU
# -G: Annotation file (should write the pull pathway where this file is located)
# -o: cufflinks creates a new directory to store the output files therefore you just need to name it

To show you an example, mine would look like this:
cufflinks -p 8 -G /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf -o SRR8238941_trspt SRR8238941.sort.bam

Note: you will need to write a full pathway for -G command since the annotation file (genes.gtf) is in a separate directory.

Once you have finished you should have a new directory with genes.fpkm_tracking, isoforms.fpkm_tracking, skipped.gtf, transcripts.gtf files generated for each file.

Step 6. Merging Assemblies

We will be merging the cufflink-assembled files into a single annotation file so that we can compare the expression level between samples. You also need Python version 2.7 installed or the tool will not work, so make sure you have conda on your Linux/Ubuntu environment beforehand.

  1. Compile the pathways for the assembled files (transcripts.gtf) from the previous step using the following command:

     vi assemblies.txt
    

    As an example for one file mine will look like this:
    /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome_index/gene_alignment/bam/SRR8238941_trspt/transcripts.gtf

    Make sure you save the txt file using the correct vi editor commands.

  2. Execute the below command to merge the assembled files

     cuffmerge -p 8 -g [pathway to genes.gtf] -s [pathway to genome.fa] -o merged_asm assemblies.txt
    
     # -p: number of cpu
     # -g Annotation file 
     # -s: genome sequence file
     # -o: directory where cuffmerge files will be stored
    

    To demonstrate my command will look like this:
    cuffmerge -p 8 -g /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf -s genome.fa -o merged_asm assemblies.txt

    Note: I wrote the full pathway to genes.gtf file because it's located in a different directory. You need to provide a full pathway if the file(s) you are working with is not located in the same directory.

  3. Once it's done, the cuffmerge will create a merged_asm directory where the merge.gtf file is generated, which is the transcriptome assembly file.

Step 7. Differential Expression Analysis

At last, we will be finding differentially expressed genes (DEGs) between samples using cuffdiff command. The command format looks like this:

cuffdiff -o diff_out -b genome.fa -p 8 -L Ctrl,Ra,Rv -u merged.gtf file1.sort.bam file2.sort.bam

# -o: name of output directory where your DEG file(s) will be stored
# -b: genome sequence file
# -p: number of cpu
# -L: sample name 
# -u: multi-read-correction; reads that are mapped on multiple locations in the genome are calculated using weight read mapping

If your data (or sample) has replication you will need to use this file naming format with commas:
file1.sort.bam,file1-1.sort.bam file2.sort.bam,file2-1.sort.bam file3.sort.bam,file3-1.sort.bam

To figure out what the sample names are for each corresponding file, go back to the journal article or here. Each file should tell you whether it is a control or a different sample.

Here is an example for mine:
cuffdiff -o diff_out -b /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome_index/genome.fa -p 8 -L Ctrl,Ra,Rv -u /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome_index/merge_assemble/merged.gtf SRR8238941.sort.bam,SRR8238942.sort.bam SRR8238943.sort.bam,SRR8238944.sort.bam

Since the article states there are 3 major types of samples (Control, Ra, Rv) I specified the sample name as Ctrl,Ra,Rv. Also notice how my bam files are grouped by same sample types. For instance SRR8238941.sort.bam,SRR8238942.sort.bam are grouped and separated by commas, not spaced, because they are control group samples.

Once the process has been completed, you will find your several output files in the diff_out directory.

If you are focused on gene expression level the files you want to refer to are gene_exp.diff and genes.read_group_tracking.

gene_exp.diff : contains gene expression level for each sample, p & q values of differentially expressed genes.
genes.read_group_tracking file contains expression levels of each replicate.


Overall Alignment Rate Calculation

Referring back to Step 3. Gene Mapping after alignment summary reports ( [file name].summary ) have been generated, and the next step is to calculate the overall alignment rate. When taking a look at one of the alignment summary files you should see a report like below:

37537336 reads; of these:

37537336 (100.00%) were paired; of these:
986245 (2.63%) aligned concordantly 0 times
33036211 (88.01%) aligned concordantly exactly 1 time
3514880 (9.36%) aligned concordantly >1 times

— —

986245 pairs aligned concordantly 0 times; of these:
112422 (11.40%) aligned discordantly 1 time

— —

873823 pairs aligned 0 times concordantly or discordantly; of these:

1747646 mates make up the pairs; of these:
1077358 (61.65%) aligned 0 times
579867 (33.18%) aligned exactly 1 time
90421 (5.17%) aligned >1 times

98.56% overall alignment rate
<<End of Report>>

The following is the calculation how we derived 98.65% overall alignment rate, refer to bolded section in the report :

98.56% = {33036211 + 3514880 + 112422 + (579867 + 90421) / 2} / 37537336 x 100

Congrats, you have made it!
In the next article, I will be demonstrating De Novo Assembly. In the meantime, carefully go over each step and try to run through the workflow with other datasets.

Did you find this article valuable?

Support ShortLong-Seq Reads Bioinformatics by becoming a sponsor. Any amount is appreciated!