Today I will introduce you to the methods for RNA and protein analysis at massive scale - RNA sequencing (RNA-seq) and tandem mass tag mass spectrometry (TMT-MS). These two methods provide as an output enormously large amount of data that needs to be properly analysed in order to obtain the biological meaning of phenomena you're investigating.
Trancriptome and central dogma of molecular biology
Before we dive into the description of RNA sequencing methodology, first we need to explain certain basics of molecular biology.
According to the central dogma of molecular biology, genetic information contained within the DNA molecule is transcribed into the pre-mRNA, which is further processed and spliced into the mature mRNA. Mature mRNA is then exported from nucleus to cytoplasm, where is being translated into protein product, which is modified, folded into its final conformation and transported into the right cellular compartment.
Now that we know what is mRNA, we can define transcriptome, which represents a complete set of mRNA transcripts contain in one cell.
RNA sequencing allows RNA analysis through direct determination of cDNA sequence, and it's based on high-throughput next-generation DNA sequencing (NGS) technologies.
Briefly, mRNA is isolated from samples (in our case, transgenic cell line I personally made), and using the process of reverse transcription mRNA is transcribed into cDNA (complementary DNA). This collection of cDNA is called cDNA library and contains cDNA fragments with adaptors attached on both ends. After that, each molecule is being sequenced and millions of short reads are produced.
After the sequencing process is finished, filtering of reads and adapter trimming are performed, followed by de novo assembly of transcripts (in case that reference genome is not available) or mapping (alignment) of sequencing reads to reference genome or transcriptome.
Beside all of the above steps, many other "adjustments" of the raw data are necessarily performed, until the expression values are finally obtained:
Screenshot of my RNA-seq data
In these data I'm analyzing expression values are represented as TPM (Transcripts Per Kilobase Million). This means that these data are normalized to gene length and sequencing depth, which allows us to compare normalized reads between different samples.
In my RNA-seq data I have 9616 different genes for which I have to analyze expression values and compare them to control values!
In order to filter data out, I perform standard Student's t-test to obtain p values for each gene, which will help me discard data that are not statistically significant.
After filtering for p < 0.05, I got 5102 statistically significant, differentially expressed genes left - still sounds like a lot, don't you think? :)
Proteome analysis by TMT-MS
So we have determined the expression values for mRNA extracted from our model system, but is that enough to draw conclusions on what's going on in our cells over-expressing our gene of interest?
Well, not exactly, because if you remember the central dogma, mRNA is translated into protein in cells - so it would be useful to know the amount of each protein product as well.
One of the ways to quantify proteins within the sample is to use tandem mass tag mass spectrometry (TMT-MS).
Tandem mass tags (TMT or TMTs) represent chemical labels that are used for quantification and identification of biological macromolecules (eg. proteins). TMTs belong to isobaric mass tags, meaning that those chemical groups have the same mass. The method is based on pairing lighter with heavier regions of tags, in such way that the entire tag when attached to the peptide adds the same mass shift, which enables detection of the amount of each peptide.
What we get as output data after raw data analysis is normalized relative abundance (%) of each protein:
Screenshot of my TMT data
Initially, I got 7260 detected proteins, and after I performed Student's t-test and obtain p values for each protein (same as with RNA-seq data), I ended up with 2844 statistically significant (p < 0.05), differentially expressed proteins.
In the next Lab Diaries post, I will explain which method I use for analysis of such large-scale RNA-seq and TMT-MS data, and how do we obtain biological meaning from such enormous amount of data.
Until then, relax and keep steemSTEM! ;)
 Griffith, M., Walker, J. R., Spies, N. C., Ainscough, B. J., & Griffith, O. L. (2015). Informatics for RNA sequencing: a web resource for analysis on the cloud. PLoS computational biology, 11(8), e1004393.
 Tandem mass tag