High Throughput Sequence Data Pipeline, Part. 2: Sequence Mapping
Reference vs. De novo Transcriptome Assembly
Continuing in our series, the next part I will be discussing is sequence mapping. Referring back to this diagram below after cleaning (quality control) the sequence data we have to align the sequence reads.
What does aligning mean? It means "arranging the sequences of DNA or RNA or protein to identify regions of similarity that may have a consequence of functional, structural, or evolutionary relationships between sequences," according to the Encyclopedia of Bioinformatics and Computational Biology (2019).
There are different 'types' of protocols and analyses when aligning our sequence reads. If the sample species have a reference genome you map/align the sequence reads to the genome and construct the transcripts from the alignment. This process is known as reference-based transcriptome assembly or genome mapping.
But what if there is no or limited reference genome to map against? If that is the case, you will need to first need to assemble the sequence reads based on overlapping regions of the reads to construct into transcripts (or contigs or scaffolds). This process is called the de novo transcriptome assembly.
If you are a visual learner here's a diagram to describe the two different methods:
Once the sequence reads have been mapped, the next step is to quantify the read counts associated with the genes and then identify genes that are expressed differentially (differential gene expression analysis). I will discuss this process in part 3 of our series.
In the next article, I'll be demonstrating how you can do genome mapping, also known as reference-based transcriptome analysis, using the same example data from our previous demonstrations.
Image Reference: Martin, J. A., & Wang, Z. (2011). Next-generation transcriptome assembly. Nature Reviews Genetics, 12(10), 671–682. doi:10.1038/nrg3068