RNAseq Analysis: Count Matrix

RNAseq Analysis: Count Matrix

We learned how to quality control, align sequences and generate DEG analysis using the cufflink tool. What if I want to create just a count matrix? In this write-up, I'm going to show you how to make a count matrix.

Image credit: Lafzi et al. Tutorial: guidelines for the experimental design of single-cell RNA sequencing studies, Nature Protocols 2018.

Things you need beforehand:

  1. R/R Studio: you will need R 4.x version in your Linux Ubuntu, Windows or Mac.

    You can download and follow the installation instructions here (R) and here (R Studio). Make sure your Linux is up to date. If you need some visual guidance on installation this link is a good place to read up on.

  2. BAM files

  3. The gene annotation file (.gtf)

  4. featureCounts package: you will need to use Conda to get this

Install FeatureCounts

  1. Open up your Terminal and install the feature count package using this command. You can download this in the designated directory if you want.

     conda install -c bioconda subread
    

    Check to make sure we have downloaded correctly by typing this command

     featureCounts --version
    
  2. To generate the count matrix the command format looks like this

     featureCounts -a /path/to/genome_annotation_file/genes.gtf -o [output file name].out.txt -T 8 /path/to/bam_files
    
     ### Note: ###
     # -p species that fragments (or templates) will be counted instead of reads. 
     #    This is only applicable for paired-end reads.
     # -O assigns reads to all their overlapping meta-features.
     # -T number of threads
     # -a is the genome annotation file
     # -o output file name
     # -s specifies strand-specific read counting. 0 for unstranded reads, 1 for stranded reads and 2 for reversely stranded reads. 
     #   This depends on the library used in the sequencing protocol.
    

    We will use the BAM file we generated in the beginning. So to give you an example my command looks like this

    featureCounts -a /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf -o count.out.txt -T 8 /home/compbio/sra_data/*.sort.bam

    Note: *.sort.bam calls all the bam files to be processed. Saves time from processing the file one by one especially when you have a lot of data to analyze.

    Once it's finished you should see .out.txt and .out.summary files.

  3. Export .out.txt file to your local PC. Open the file using Excel.

  4. Depending on what analysis you are going to do you may want to clean up some columns or rows in that file. For example the first row of the file you will not need it. For simple DEG analysis, you only need Gene ID and the samples. The sample name columns are named as paths to the BAM files so you do want to rename those to your actual sample names. Once you have renamed/cleaned it up you can save the file as a .txt or .csv or excel file format.

Congrats, you have made your count matrix! I hope this write up has help you in anyway. Next write up will be about doing DEG analysis using DESeq2.

Did you find this article valuable?

Support ShortLong-Seq Reads Bioinformatics by becoming a sponsor. Any amount is appreciated!