Identifying Sequence Differences: Variant Calling
Variant Calling is used when one wants to identify differences in genomic sequence such as SNP (single nucleotide polymorphism), insertions, deletions and structural variations between samples, or subjects of interest, and a reference genome (or transcriptome) using high-throughput DNA sequencing tech.
The below diagram is a visual representation of the overall workflow scheme.
The big focus of this pipeline is to first align the sequence reads, detect the SNPs, filter and report in variant calling format which later will be used for further downstream analysis.
Don't get too caught up in the specific tools since people use different tools for certain steps, but the main tools for the calling process are Picard and GATK:
The big focus of this pipeline is to first align the sequence reads, detect the SNPs, filter and report in variant calling format which later will be used for further downstream analysis e.g. visualization.
In this write-up, you will learn to align the sequence reads to identify positions where the sequences differ and then analyze them.
You will need the following tools installed beforehand: Bowtie2 (or HI-SAT2), Picard, GATK
Go ahead and download 5 data files from here. I will be using ERX4058018 - ERX4058022 for this demo. In the first couple of steps, from Quality Control up to the Genome Mapping step, the workflow is similar to the Reference-based assembly.
Step 1. Download & Quality Control
I will not go into detail for this step because by now you should be able to conduct quality control on the raw sequence data (fastq files) using the FastQC and Trimmomatic. If you need a refresher please refer back to my previous articles.
Step 2. Reference Genome Data Prep
Once you have cleaned up some of the sequence data files, the next step is to map the sequence reads onto the reference genome. Note that the demo data is from Homo sapien samples therefore you will need a human reference genome. You can download the file from here; download the UCSC version (hg38) using the command wget
.
After downloading the reference genome file, use the tar -xvzf
command to decompress. As a result, a new directory has been generated called Homo_sapiens
.
Step 3. Indexing
Step 1. Locate the genome.fa
file in the Homo_sapiens
directory using cd
command. For example, mine looks like this:cd /home/compbio/data/Sapien_ref/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa
Make a new directory calling it
genome_index
Create indexing files in the
genome_index
directory using the following command:cd genome_index # and then hisat2-build -p 8 /home/compbio/data/Sapien_ref/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFastagenome.fa [Name of Output Index file] ### Note: ### # -p: number of cpu to use when indexing
Note: You should replace the bracket with the output file name of your choice. I’m going to call my index file
sapien_index
for this demo.There should be a total of 8 index files in
.ht2
format as a result.
Step 4. Genome Mapping
To map the sequence reads you can use either HISAT2 or Bowtie2.
If you are using HISAT2 for your mapping the command will look like this:
hisat2 -p 8 -x [Index File Name] -1 [Forward Paired fastq file] -2 [Reverse Paired fastq file] -S [output file name].sam --dta-cufflinks 2>&1 |tee [name of Output alignment file].summary
To give you an example of how I did it:
hisat2 -p 8 -x sapien_index -1 /home/comp/data/raw_data/ERX4058021_1_P.fastq -2 /home/comp/data/raw_data/ERX4058021_2_P.fastq -S ERX4058021.sam --dta-cufflinks 2>&1 | tee ERX4058021.summary
Once it has finished, you should have .sam
files and .summary
files for each accession file.
If you are using Bowtie2 tool to align the sequence, the command will look like this:
bowtie2 -x [pathway to Bowtie reference genome]/genome -p 64 -1 [Forward Paired fastq file] -2 [Reverse Paired fastq file] -S [output file name].sam 2>&1|tee [output file name].log
### Note: ###
# -p : number of CPU core
So your command should look similar to mine like this:bowtie2 -x /home/compbio/data/Sapien_ref/Homo_sapiens/UCSC/hg38/Sequence/BowtieIndex/genome -p 64 -1 /home/comp/data/raw_data/ERX4058021_1_P.fastq -2 /home/comp/data/raw_data/ERX4058021_2_P.fastq -S ERX4058021.bowtie.sam 2>&1|tee ERX4058021.bowtie2.log
Note: HISAT2 is used more for RNA-seq data whereas Bowtie is used more for DNA-seq data.
Step 5. Bam Sorting
We will now concert the .sam
file into a Bam file format using the following command:
samtools sort -@8 -o [output file name].sort.bam [file name from mapping].sam
# -@ : number of CPU to use
To provide an example your code should look similar to this
samtools sort -@8 -o ERX4058021.sam.bam ERX4058021.sam
Step 6. Create Sequence Dictionary
- Make a new directory, I named mine as
SeqDict
. Locate thegenome.fa
file in theWholeGenomeFasta
directory, copy and transfer thegenome.fa
file using thecp
command toSeqDict
. Like this (example):
cp genome.fa /home/compbio/data/Sapien_ref/Homo_sapiens/UCSC/hg38/Sequence/SeqDict
Note that you need to write the full pathway of SeqDict
to transfer the copied genome.fa
file to the designated directory.
- In the
SeqDict
directory, we will create a sequence dictionary using the Picard tool. The command format looks like this:
java -jar [pathway of the picard tool]/picard.jar CreateSequenceDictionary R= genome.fa O=genome.dict
Once it has been completed you will need to index the genome.fa file:
samtools faidx genome.fa
Once you have finished this step, you should have a genome.dict, genome.fa, genome.fa.fai files produced in that directory.
Step 7. Picard: Add or Replace Read Groups
java -jar [pathway of the picard tool]/picard.jar AddOrReplaceReadGroups I= [pathway of sort.bam file] O=[output file name].sam.grouped.bam RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM= [pathway to .sam file]
To give you an example
java -jar /home/compbio/tools/picard.jar AddOrReplaceReadGroups I= /home/compbio/data/Sapien_ref/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome_index/ERX4058018.sam.bam O= ERX4058018.sam.grouped.bam RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM= /home/compbio/data/Sapien_ref/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome_index/ERX4058018.sam
As a result, you should have .sam.grouped.bam
files.
Step 8. Picard: Mark Duplicate Reads
Since the demo data is the spliced read file we would need to use the following command:
java -jar [pathway to picard tool]/picard.jar MarkDuplicates -I [file name].grouped.bam -O [output file name].sam.grouped.marked.bam -M [output file name].sam.grouped.marked.metrics.txt -MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 5000000 -OPTICAL_DUPLICATE_PIXEL_DISTANCE 100
Your code should look similar to the example below
java -jar /home/compbio/tools/picard.jar MarkDuplicates -I ERX4058018.sam.grouped.bam -O ERX4058018.sam.grouped.marked.bam -M ERX4058018.sam.grouped.marked.metrics.txt -MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 5000000 -OPTICAL_DUPLICATE_PIXEL_DISTANCE 100
Once this process is done, we need to index the bam files
java -jar [pathway to picard tool]/picard.jar BuildBamIndex INPUT= [file name].sam.grouped.marked.bam OUTPUT=[output file name].sam.grouped.marked.bai
Step 9. HaplotypeCaller
gatk HaplotypeCaller -R [genome.fa file] -I [.sam.grouped.marked.bam file] -O [output file name].sam.vcf
To give you an example:gatk HaplotypeCaller -R genome.fa -I ERX4058018.sam.grouped.marked.bam -O ERX4058018.sam.vcf
As a result, you should have .vcf
format files.
Step 10. Filtering Variants
- Extract SNP
gatk SelectVariants -R [genome.fa file] -V [.sam.vcf file] -select-type SNP -O [output file name].sam.snp.vcf
- Filtering SNPs
gatk VariantFiltration -R genome.fa -V [sam.snp.vcf file] --filter-expression "QD < 2.0 || FS > 60.0 || MQ <40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filter-name "my_snp_filter" -O [output file name].sam.snp.filtered.vcf
How do you know what parameters/filter values to use?
You might have to read other papers to come to consult what would be a reasonable filter value but this link may help you as a guide.
- Removing filtered SNPs
cat [.sam.snp.filtered.vcf file]|grep -v -w "my_snp_filter" > [output file name].sam.snp.filtered.removed.vcf
Congrats you have mastered Variant Calling. From here for further downstream analysis is up to your objective whether that is looking at codon region or non-codon region etc. etc.
Hope this helped, see you in the next write-up!