Sequence Data Quality Control, Part. 2: Trimmomatic

Sequence Data Quality Control, Part. 2: Trimmomatic

Using Trimmomatic to Clean High Throughput Sequence Data

Now that you have assessed the quality of your sequenced data using FastQC, it's time to remove the bad from the good. The last step in the quality control process is Trimming.

Today I will show you how to use Trimmomatic, a Java-based software tool used to trim poor-quality reads in your sequence data. Note: make sure your Linux has the latest java version installed.

  1. Open up your Terminal and go here to get the Trimmomatic software.

  2. Under 'Downloading Trimmomatic,' by the most latest version of the tool, there should be a hyperlink saying 'binary'. Place your mouse, then right-click > 'Copy the link address'.

  3. Go to the directory where you would like to install your tool. Mine will be installed at sra_data continuing from our demonstration.

  4. Use the following command below to install, but make sure you are installing the latest version of the tool.

    wget <insert the binary link address>
    

    For me, the Trimmomatic version as of writing this article is 0.39 so my command will look like this: wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip

    If your version is different from mine make sure you get the latest version of the tool.

  5. Once the download has been completed, replace the bracket with the correct Trimmomatic file name and run the command below:

    unzip [insert your Trimmomatic zip file name]
    
  6. Once the extraction is complete, set a PATH environment for the Trimmomatic tool. Please refer to the previous article if you need a refresher on PATH creation.

  7. Using the command provided by the Trimmomatic website, input the correct file names (forward and reverse, since our files are pair-end read) and name the output files following the command guide.

    If an error message appears try this command format below instead, which is writing the full pathway of Trimmomatic tool, replace the brackets with your information:

    java -jar [insert full pathway of Trimmomatic directory]/trimmomatic-0.39.jar PE -phred33 [input_forward.fq.gz] [input_reverse.fq.gz] [output_forward_paired.fq.gz] [output_forward_unpaired.fq.gz] [output_reverse_paired.fq.gz] [output_reverse_unpaired.fq.gz] ILLUMINACLIP:[pathway to Trimmomatic adapters directory]/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
    

    For example, my command will look like this:
    java -jar /home/compbio/sra_data/Trimmomatic-0.39/trimmomatic-0.39.jar PE -phred33 SRR8238941_1.fastq.gz SRR8238941_2.fastq.gz SRR8238941_1_P.fastq.gz SRR8238941_1_UP.fastq.gz SRR8238941_2_P.fastq.gz SRR8238941_2_UP.fastq.gz ILLUMINACLIP:/home/compio/sra_data/Trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

    Note: MINLEN:36 is a parameter that tells Trimmomatic to trim reads that are less than 36 base pairs. If the sequence data is <101bp, I use MINLEN: 36, MINLEN: 50 for 150bp.

  8. As a result, you should see the following message:

    TrimmomaticPE: Started with arguments:

    -phred33 SRR8238941_1.fastq.gz SRR8238941_2.fastq.gz SRR8238941_1_P.fastq.gz SRR8238941_1_UP.fastq.gz SRR8238941_2_P.fastq.gz SRR8238941_2_UP.fastq.gz ILLUMINACLIP:/home/compbio/sra_data/Trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

    Multiple cores found: Using 4 threads

    Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'

    ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences

    Input Read Pairs: 11568748 Both Surviving: 11084813 (95.82%) Forward Only Surviving: 471921 (4.08%) Reverse Only Surviving: 9407 (0.08%) Dropped: 2607 (0.02%)

    TrimmomaticPE: Completed successfully

    In your directory, where your data is stored, you should see 2 files (paired and unpaired) created for each forward/reverse file.

Both the forward and reverse surviving, passed/ satisfied QC, will be stored as paired fastq.gz file format. We will be using those paired files that passed the quality control process for our next part of high throughput sequencing data analysis.

Did you find this article valuable?

Support ShortLong-Seq Reads Bioinformatics by becoming a sponsor. Any amount is appreciated!