Identifying Sequence Differences: Variant Calling

Variant Calling is used when one wants to identify differences in genomic sequence such as SNP (single nucleotide polymorphism), insertions, deletions and structural variations between samples, or subjects of interest, and a reference genome (or transcriptome) using high-throughput DNA sequencing tech.

The below diagram is a visual representation of the overall workflow scheme.

The big focus of this pipeline is to first align the sequence reads, detect the SNPs, filter and report in variant calling format which later will be used for further downstream analysis.

workflow

Don't get too caught up in the specific tools since people use different tools for certain steps, but the main tools for the calling process are Picard and GATK:

The big focus of this pipeline is to first align the sequence reads, detect the SNPs, filter and report in variant calling format which later will be used for further downstream analysis e.g. visualization.

In this write-up, you will learn to align the sequence reads to identify positions where the sequences differ and then analyze them.

You will need the following tools installed beforehand: Bowtie2 (or HI-SAT2), Picard, GATK

Go ahead and download 5 data files from here. I will be using ERX4058018 - ERX4058022 for this demo. In the first couple of steps, from Quality Control up to the Genome Mapping step, the workflow is similar to the Reference-based assembly.

Step 1. Download & Quality Control

I will not go into detail for this step because by now you should be able to conduct quality control on the raw sequence data (fastq files) using the FastQC and Trimmomatic. If you need a refresher please refer back to my previous articles.

Step 2. Reference Genome Data Prep

Once you have cleaned up some of the sequence data files, the next step is to map the sequence reads onto the reference genome. Note that the demo data is from Homo sapien samples therefore you will need a human reference genome. You can download the file from here; download the UCSC version (hg38) using the command wget.

After downloading the reference genome file, use the tar -xvzf command to decompress. As a result, a new directory has been generated called Homo_sapiens.

Step 3. Indexing

Step 1. Locate the genome.fa file in the Homo_sapiens directory using cd command. For example, mine looks like this:
cd /home/compbio/data/Sapien_ref/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa

Make a new directory calling it genome_index
Create indexing files in the genome_index directory using the following command:
```
 cd genome_index
 # and then
 hisat2-build -p 8 /home/compbio/data/Sapien_ref/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFastagenome.fa [Name of Output Index file]

 ### Note: ###
 # -p: number of cpu to use when indexing
```
Note: You should replace the bracket with the output file name of your choice. I’m going to call my index file sapien_index for this demo.

There should be a total of 8 index files in .ht2 format as a result.

Step 4. Genome Mapping

To map the sequence reads you can use either HISAT2 or Bowtie2.
If you are using HISAT2 for your mapping the command will look like this:

hisat2 -p 8 -x [Index File Name] -1 [Forward Paired fastq file] -2 [Reverse Paired fastq file] -S [output file name].sam --dta-cufflinks 2>&1 |tee [name of Output alignment file].summary

To give you an example of how I did it:

hisat2 -p 8 -x sapien_index -1 /home/comp/data/raw_data/ERX4058021_1_P.fastq -2 /home/comp/data/raw_data/ERX4058021_2_P.fastq -S ERX4058021.sam --dta-cufflinks 2>&1 | tee ERX4058021.summary

Once it has finished, you should have .sam files and .summary files for each accession file.

If you are using Bowtie2 tool to align the sequence, the command will look like this:

bowtie2 -x [pathway to Bowtie reference genome]/genome -p 64 -1 [Forward Paired fastq file] -2 [Reverse Paired fastq file] -S [output file name].sam 2>&1|tee [output file name].log 

### Note: ###
# -p : number of CPU core

So your command should look similar to mine like this:
bowtie2 -x /home/compbio/data/Sapien_ref/Homo_sapiens/UCSC/hg38/Sequence/BowtieIndex/genome -p 64 -1 /home/comp/data/raw_data/ERX4058021_1_P.fastq -2 /home/comp/data/raw_data/ERX4058021_2_P.fastq -S ERX4058021.bowtie.sam 2>&1|tee ERX4058021.bowtie2.log

Note: HISAT2 is used more for RNA-seq data whereas Bowtie is used more for DNA-seq data.

Step 5. Bam Sorting

We will now concert the .sam file into a Bam file format using the following command:

samtools sort -@8 -o [output file name].sort.bam [file name from mapping].sam

# -@ : number of CPU to use

To provide an example your code should look similar to this

samtools sort -@8 -o ERX4058021.sam.bam ERX4058021.sam

Step 6. Create Sequence Dictionary

Make a new directory, I named mine as SeqDict. Locate the genome.fa file in the WholeGenomeFasta directory, copy and transfer the genome.fa file using the cp command to SeqDict. Like this (example):
cp genome.fa /home/compbio/data/Sapien_ref/Homo_sapiens/UCSC/hg38/Sequence/SeqDict

Note that you need to write the full pathway of SeqDict to transfer the copied genome.fa file to the designated directory.

In the SeqDict directory, we will create a sequence dictionary using the Picard tool. The command format looks like this:

java -jar [pathway of the picard tool]/picard.jar CreateSequenceDictionary R= genome.fa O=genome.dict

Once it has been completed you will need to index the genome.fa file:

samtools faidx genome.fa

Once you have finished this step, you should have a genome.dict, genome.fa, genome.fa.fai files produced in that directory.

Step 7. Picard: Add or Replace Read Groups

java -jar [pathway of the picard tool]/picard.jar AddOrReplaceReadGroups I= [pathway of sort.bam file] O=[output file name].sam.grouped.bam RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM= [pathway to .sam file]

To give you an example

java -jar /home/compbio/tools/picard.jar AddOrReplaceReadGroups I= /home/compbio/data/Sapien_ref/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome_index/ERX4058018.sam.bam O= ERX4058018.sam.grouped.bam RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM= /home/compbio/data/Sapien_ref/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome_index/ERX4058018.sam

As a result, you should have .sam.grouped.bam files.

Step 8. Picard: Mark Duplicate Reads

Since the demo data is the spliced read file we would need to use the following command:

java -jar [pathway to picard tool]/picard.jar MarkDuplicates -I [file name].grouped.bam -O [output file name].sam.grouped.marked.bam -M [output file name].sam.grouped.marked.metrics.txt -MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 5000000 -OPTICAL_DUPLICATE_PIXEL_DISTANCE 100

Your code should look similar to the example below

java -jar /home/compbio/tools/picard.jar MarkDuplicates -I ERX4058018.sam.grouped.bam -O ERX4058018.sam.grouped.marked.bam -M ERX4058018.sam.grouped.marked.metrics.txt -MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 5000000 -OPTICAL_DUPLICATE_PIXEL_DISTANCE 100

Once this process is done, we need to index the bam files

java -jar [pathway to picard tool]/picard.jar BuildBamIndex INPUT= [file name].sam.grouped.marked.bam OUTPUT=[output file name].sam.grouped.marked.bai

Step 9. HaplotypeCaller

gatk HaplotypeCaller -R [genome.fa file] -I [.sam.grouped.marked.bam file] -O [output file name].sam.vcf

To give you an example:
gatk HaplotypeCaller -R genome.fa -I ERX4058018.sam.grouped.marked.bam -O ERX4058018.sam.vcf

As a result, you should have .vcf format files.

Step 10. Filtering Variants

Extract SNP

gatk SelectVariants -R [genome.fa file] -V [.sam.vcf file] -select-type SNP -O [output file name].sam.snp.vcf

Filtering SNPs

gatk VariantFiltration -R genome.fa -V [sam.snp.vcf file] --filter-expression "QD < 2.0 || FS > 60.0 || MQ <40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filter-name "my_snp_filter" -O [output file name].sam.snp.filtered.vcf

How do you know what parameters/filter values to use?
You might have to read other papers to come to consult what would be a reasonable filter value but this link may help you as a guide.

Removing filtered SNPs

cat [.sam.snp.filtered.vcf file]|grep -v -w "my_snp_filter" > [output file name].sam.snp.filtered.removed.vcf

Congrats you have mastered Variant Calling. From here for further downstream analysis is up to your objective whether that is looking at codon region or non-codon region etc. etc.

Hope this helped, see you in the next write-up!