Sequence Data Quality Control, Part. 2: Trimmomatic
Using Trimmomatic to Clean High Throughput Sequence Data
Now that you have assessed the quality of your sequenced data using FastQC, it's time to remove the bad from the good. The last step in the quality control process is Trimming.
Today I will show you how to use Trimmomatic, a Java-based software tool used to trim poor-quality reads in your sequence data. Note: make sure your Linux has the latest java version installed.
Open up your Terminal and go here to get the Trimmomatic software.
Under 'Downloading Trimmomatic,' by the most latest version of the tool, there should be a hyperlink saying 'binary'. Place your mouse, then right-click > 'Copy the link address'.
Go to the directory where you would like to install your tool. Mine will be installed at
sra_data
continuing from our demonstration.Use the following command below to install, but make sure you are installing the latest version of the tool.
wget <insert the binary link address>
For me, the Trimmomatic version as of writing this article is 0.39 so my command will look like this:
wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip
If your version is different from mine make sure you get the latest version of the tool.
Once the download has been completed, replace the bracket with the correct Trimmomatic file name and run the command below:
unzip [insert your Trimmomatic zip file name]
Once the extraction is complete, set a PATH environment for the Trimmomatic tool. Please refer to the previous article if you need a refresher on PATH creation.
Using the command provided by the Trimmomatic website, input the correct file names (forward and reverse, since our files are pair-end read) and name the output files following the command guide.
If an error message appears try this command format below instead, which is writing the full pathway of Trimmomatic tool, replace the brackets with your information:
java -jar [insert full pathway of Trimmomatic directory]/trimmomatic-0.39.jar PE -phred33 [input_forward.fq.gz] [input_reverse.fq.gz] [output_forward_paired.fq.gz] [output_forward_unpaired.fq.gz] [output_reverse_paired.fq.gz] [output_reverse_unpaired.fq.gz] ILLUMINACLIP:[pathway to Trimmomatic adapters directory]/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
For example, my command will look like this:
java -jar /home/compbio/sra_data/Trimmomatic-0.39/trimmomatic-0.39.jar PE -phred33 SRR8238941_1.fastq.gz SRR8238941_2.fastq.gz SRR8238941_1_P.fastq.gz SRR8238941_1_UP.fastq.gz SRR8238941_2_P.fastq.gz SRR8238941_2_UP.fastq.gz ILLUMINACLIP:/home/compio/sra_data/Trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Note:
MINLEN:36
is a parameter that tells Trimmomatic to trim reads that are less than 36 base pairs. If the sequence data is <101bp, I useMINLEN: 36
,MINLEN: 50
for 150bp.As a result, you should see the following message:
TrimmomaticPE: Started with arguments:
-phred33 SRR8238941_1.fastq.gz SRR8238941_2.fastq.gz SRR8238941_1_P.fastq.gz SRR8238941_1_UP.fastq.gz SRR8238941_2_P.fastq.gz SRR8238941_2_UP.fastq.gz ILLUMINACLIP:/home/compbio/sra_data/Trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Multiple cores found: Using 4 threads
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Input Read Pairs: 11568748 Both Surviving: 11084813 (95.82%) Forward Only Surviving: 471921 (4.08%) Reverse Only Surviving: 9407 (0.08%) Dropped: 2607 (0.02%)
TrimmomaticPE: Completed successfully
In your directory, where your data is stored, you should see 2 files (paired and unpaired) created for each forward/reverse file.
Both the forward and reverse surviving, passed/ satisfied QC, will be stored as paired fastq.gz file format. We will be using those paired files that passed the quality control process for our next part of high throughput sequencing data analysis.