Specific information pertaining to your RNASeq project 20191121_RetinaTest is shown in Figure 1 below. Check first to ensure that the designations in the SampleName and Factor columns match with the associated fastq file in column FileName1. It is critically important that these designations are correct as any contrasts made by the differential gene expression (DGE) analyses are conditioned on these values. If sequencing included a paired-end read, an additional column named FileName2 will be present and should also be inspected for accuracy. If technical replicates were including in your experiment, grouping information will be contained in the replicate column. The two columns labeled SampleLong and Experiment are not used currently and may be ignored.
The directory named deliverable contains numerous files of expression and DGE results including quality control (QC) metrics and data files. These files can be accessed directly and many are also are accessible from hyperlinks contained within this html document. Content description and explanations of these files are given in the appropriate sections below.
The designated contrasts made during the DGE analysis are shown in Figure 2 below. Each specific contrast is considered independent from the entire dataset, which is referred to as a sample. Please check to ensure that all of the contrasts requested are present.
Due to the imperfect nature of the sequencing process and limitations of the optical instruments (batch effects, sequencing error, GC bias, duplicate reads, copy number variation, mapability), base calling has inherent uncertainty associated with it. The magnitude of uncertainty in each base call is represented by an error probability or Phred (pronounced like “fred”) score. This parameter is denoted formally as \(Q\) and is proportional to the probability \(p\) that a base call is incorrect, where \(Q = -10log~10~(p)\).
Any biological inference or conclusion depends on the overall quality of your data. These following quality control (QC) statistics computed on your data are designed to help you evaluate the overall technical quality of your experiment. You should always examine if your raw sequence data exhibit good quality and lack obvious problems or biases, which may affect how you can ultimately use it in your research.
The trimming software skewer (Jiang et al. 2014) was used to preprocess raw fastq files. Skewer implements an efficient dynamic programming algorithm designed to remove adapters from Illumina-generated sequence reads.
During sequencing library construction, short oligonucleotides are ligated to the ends DNA fragments to be sequenced, so that they can be combined with primers for PCR amplification. While sequencing the fragments, if the read length is greater than that of the target DNA, the adapter sequence next to the unknown DNA sequence of interest is also sequenced, sometimes only partially. To recover the true target DNA sequence, it is required to identify the adapter sequence and remove it. Moreover, to improve further the quality of the sequence reads, skewer also trims the 3’ end of the fragment until a Phred quality of 20 is reached. This is equivalent to a base call accuracy of 99%. More information regarding Phred scores can be obtained from Ewing et al. (1998) and in Ewing and Green (1998).
After adapter trimming, the BRC RNASeq pipeline computes eight QC statistics on your read data including: (1) combined per cycle base quality, (2) per cycle base frequencies, (3) per cycle average base quality, (4) relative 3k-mer diversity, (5) Phred quality distribution, (6) mean quality distribution, (7) read length distribution and (8) read occurrence distribution.
The QC results for the FASTQ data are accessible from the link in Table 1 below.
Experiment | Analysis | Report |
---|---|---|
20191121_RetinaTest | RQC | FASTQ quality |
To gauge the overall quality of the read data, scan down the rows for each quality metric (a-h, described below). Clicking on any image will enlarge it. Graphs of summarized data should be fairly uniform for all samples in each QC metric. If one or more metrics in a sample [library] are considerably different, or if they are all highly variable, it may be indicative of a problem that warrants additional attention. Important note: because the number of reads generated in a typical RNASeq experiment is substantial (ca. 30M), read-level QC metrics are based on a randomly sampled subset comprising 5% of the total read number.
(a) combined per cycle base quality: This graph shows Phred-scaled quality scores of reads with respect to counts. The Y-axis depicts the quality scores for each base at each sequencing cycle (X-axis; 1 cycle = 1 base). Higher qualities indicate more reliable base calls. The solid blue-gray and jade-green lines depict the mean and Q50 quantile Phred scores, respectively. The two dashed jade-green lines represent the Q25 (lower) and Q75 (upper) quantiles.
Ideally, a single reddish band would occur at a quality score of 40 along the full range of sequenced reads, except for the first few cycles. In practice, this is not often realized as the quality of calls on NGS sequencing platforms degrade as the number of cycles increases. Therefore, it is not uncommon to observe a reduction in base quality (bands become increasingly yellow) toward the end of a read. As a general rule, Phred scores > 30 are sufficient for use in RNASeq analyses. This is equivalent to a base call accuracy of 99.9%.
(b) per cycle base frequencies: This graph depicts the frequency of each nucleotide base position across the length of the read. The relative frequency of each base should reflect that of the entire genome. During RNASeq library construction, targets are primed with random hexamers and will influence the base composition during early sequence cycles, especially in cycles 1-10. Enrichment with random hexamers is not entirely random as there is an incomplete sampling of all possible hexamers. This effect is known and does not affect the base content of the read. In other words, it does not represent any individually biased sequences.
(c) per cycle average base quality: This graph is similar to the representation of the per cycle base quality. On the Y-axis, the Phred scaled quality is plotted as a function of cycle number (X-axis). In contrast to the per cycle base quality graph, average qualities for each nucleotide base are plotted separately. Quality scores > 30 are good for RNASeq analyses and equivalent to 99.9% base call accuracy.
(d) trinucleotide frequency: The relative frequency of trinucleotides (3-mer) may indicate the presence of over-represented sequences in your library. This is based on the assumption that any small sequence should not have a composition bias within a diverse library of sequences. However, this can vary significantly depending on the organism or targets sequenced. If one or more 3-mers dominate the remaining 3-mers, it may indicate the presence of a contaminating sequence, perhaps an adapter dimer (rare) or biased sequence composition (more common).
(e) Phred quality distribution: This graph is related to graphs 1 (per cycle base quality) and 2 (per cycle base frequencies) but instead depicts the percentage of reads above a minimum Phred quality.
(f) read quality distribution: This graph shows the distribution of mean quality in terms of read percentage. A right skewed distribution of Phred scores greater than 30 is preferred, and equivalent to at least a 99.9% base call accuracy.
(g) read length distribution: When the RNASeq libraries were sequenced, a targeted read length was was chosen, e.g. 50 bp, 75 bp, 100 bp etc. Ideally, all reads would reflect this length but multiple phenomena can occur that result in a mixture of shorter read lengths. This graph shows the distribution of read lengths as a proportion of all reads. A successful sequencing run should produce a majority of reads equal to the number of cycles performed.
(h) read occurrence distribution: This graph summarizes the proportion over all reads that are represented among each sequence exactly N times. Occurrences represented by 0.1k, 1k, 10k, and greater than 10k denote ranges corresponding to 11-100, 101-1,000, 1,001-10,000, and greater than 10,000 instances.
Unbiased library preparation and sequencing should yield a majority of unique reads (N=1) and rapidly decrease to zero. If strong spikes occur in any of the defined ranges, it may reflect a technical issue, often when sample preparation resulted in biased PCR amplification of specific targets. In RNASeq, replicate reads may be due to artifacts of sample preparation, rather than differential representation of target sequence in the original unamplified sample. Over-representation (duplication) is often observed in low input library preparations.
Condensed to a single page, QC results for readpair 1 are also located here: Readpair1. If sequencing included a paired-end read, the QC results for readpair 2 are located here: Readpair2. These fastq quality statistics are estimated from a randomly sampled portion (20%) of the total reads contained in the fastq file.
To identify the transcripts present in a specific sample as well as associated expression levels, the genomic origin of the sequenced cDNA fragments must be determined. The assignment of sequencing reads to the most likely locus of origin is called read alignment or mapping. One challenge of short-read alignment is to map millions of reads accurately in a reasonable amount of time, but in the presence of sequencing errors, genomic variation, incomplete sequencing annotation etc. Many alignment programs exist and employ various strategies meant to find a balance between mapping fidelity, error tolerance, and time.
One challenge of aligning RNASeq data is the spliced alignment of exon-exon-spanning reads, possibly when multiple different transcripts (and isoforms) of the same gene also exist. Some alignment programs address this problem by aligning only to the transcriptome. Although some alternatives exist, this approach is limited to known transcripts and thus requires accurate sequence annotation. In many cases, reads will also overlap with more than one isoform and introduce mapping ambiguity. The most effective current solution uses existing gene annotation for the placement of spliced reads in addition to attempting to identify novel splice events based on reads that cannot be aligned to the reference genome and/or transcriptome.
The trimmed single- or paired-end reads are aligned against the selected reference genome sequence using STAR (Spliced Transcripts Alignment to a Reference) (Dobin et al. 2013). STAR is a splice-junction aware RNASeq alignment algorithm that uses suffix arrays and a mapping algorithm similar to those used in whole-genome alignment tools to align transcripts to a genomic reference. STAR is substantially faster than other RNASeq aligners, and appears to outperform other aligners in both sensitivity and specificity using both simulated and real (replicated) RNASeq data (Engström et al. 2013).
While informative, the sequence-only metrics reported above are alone insufficient to assess the usability of RNASeq data. Data summaries derived from read mapping to the genome (transcriptome) provide additional QC information. For example, sequencing depth should be saturated (or nearly so) before performing many basic RNASeq analyses including expression profiling, alternative splicing, novel isoform identification and transcriptome assembly. As part of this report, mapping statistics for each fastq file were also computed to help gauge various QC metrics including comparisons at the (1) sample-level (2) chromosome-level and (3) aggregate gene-level.
Sample-level information provides the broadest level of mapping QC and is shown in Figure 3. It includes the number of fastq reads used in mapping, the total number of alignments and percentages distributed across primary and secondary alignments. This report is best used to gauge how well reads from each sample mapped (aligned) to the specific reference genome chosen. An alignment of RNASeq reads is usually considered successful if the primary mapping rate is > 70 percent. Poor primary mapping (unique alignments) may indicate an inappropriate or poorly assembled reference genome. It may also suggest significant repeated regions, which would be reflected in an increase in secondary mapping (non-unique alignments).
The BRC recommends that you check visually the results of the sample-level mappings to ensure the reads align to the expected regions, preferably without too many mismatches. The Integrative Genomics Viewer (IGV) application is a software that can be downloaded with an academic email address from IGV. It is platform independent and only requires an up-to-date Java installation.
Indexed BAM files of all alignments are available in the directory deliverable/data. To view the coverages, first start IGV and load your genome of interest (IGV has most genomes available for download from within the IGV browser). Second, select Load from File… from the File menu. A dialog box will open. Navigate to and select the BAM file corresponding to the specific sampleName you are interested in viewing.
Chromosome-level information in Table 2 (Chromosome) provides more detailed measures of mapping QC across each chromosome and/or contig included in the reference genome. These files reported metrics include:
Sample | Analysis | chromosomeLevel |
---|---|---|
SRR1804235 | AQC | SRR1804235 |
SRR1804236 | AQC | SRR1804236 |
SRR1804237 | AQC | SRR1804237 |
SRR1804238 | AQC | SRR1804238 |
SRR1804239 | AQC | SRR1804239 |
SRR1804240 | AQC | SRR1804240 |
SRR1804241 | AQC | SRR1804241 |
SRR1804242 | AQC | SRR1804242 |
Aggregate gene-level alignment comparisons Table 3 (aggregate) provides useful diagnostic information that pertain to specific aspects of the RNASeq mapping quality. Metrics reported include: (1) gene body coverage, (2) GC content, (3) read duplication, (4) splicing events, (5) splice junctions, (6) splice junction saturation, (7) library strandedness, and for paired-end reads (8) read-pair lengths.
Experiment | Analysis | Report |
---|---|---|
20191121_RetinaTest | AQC | Aggregate quality |
(1) gene body coverage: This metric reports the aggregated distribution of read coverage in all annotated genes. Genes are scaled first to a uniform width of 100 nucleotide bins. Then, the number of reads covering each bin position is recorded and displayed as a percentile in the 5’ to 3’ direction. Ideally, there should be a uniform upside-down “U” shape spanning the entire 5’ to 3’ range. A hallmark of degraded RNA, which often occurs during sample acquisition and initial extraction, is reflected as a 3’ bias on the gene body coverage plot. In other words, the 5’ end of the plot will appear “eroded”. In cases where a stranded RNASeq library was created (almost exclusively at UW-Madison Biotechnology Center), there may be a slight read increase biasing the 5’ portion of the gene body. However, this is usually not discernible.
The first RNA QC metric you should have obtained relies on the Agilent Bioanalyzer 2100 (or similar device). Intact Eukaryotic/Prokaryotic rRNA should yield clear large subunit 28S/23S and small subunit 18S/16S rRNA bands. The large subunit rRNA band is approximately twice as intense as the small subunit rRNA band (2:1 ratio). As rRNA degrades, the 2:1 ratio of high quality rRNA decreases, and low molecular weight rRNA begins to accumulate. Under normal circumstances, it is very difficult, if not impossible, to gauge directly the integrity of mRNA, which typically appears as a very broad smear close to the detection baseline. However, if rRNA showed signs of degradation in the initial QC, it is highly likely that mRNA will too and be reflected in the gene body coverage plot.
(2) GC composition: Nucleotide GC-content varies throughout the genome and is often associated with gene functionality. For a variety of technical reasons, it is well known that GC-rich and GC-poor fragments tend to be under-represented in RNASeq experiments making it difficult to infer true expression levels from biased read count measures. However, for DGE analyses in the same species, the GC content is usually ignored, as the same gene is compared but in only a different condition: the GC content remains the same. At the isoform level, there could be larger GC differences if one isoform had significantly altered GC content than another. Here, we include the distribution of GC content across all mapped reads. The vertical blue dashed line intersecting the x-axis denotes the average G+C content. The curve represents a Gaussian fit to the data. If the curve and histogram bars do not coincide well (sensu lato) or the histogram appears multi-modal, the sequence composition may be biased. This often occurs when an unrelated contaminating sequence is present (e.g. bacteria in a cell culture) or in intentionally enriched library preparations (e.g. miRNA).
(3) read duplication: Current NGS workflows usually involve PCR steps at some point, which may create PCR artifacts and over-amplification of reads. Apart from technical PCR duplication, a strong source for biological read duplicates exists in RNASeq experiments. More precisely, for a given gene length and sequencing depth there exists an expression level beyond which it is not possible to sample the gene sequence without duplicating an already existing read. Fortunately, with sufficient total RNA inputs (anything > 10 ng) read duplication does not pose a problem. In this instance, the number of reads per base assigned to a gene in an acceptable RNASeq data set is expected to be proportional to the abundance of its transcripts in the sample. For transcripts with low expression we expect read duplication to happen rarely by chance, while for highly expressed transcripts - and depending on the total sequencing depth - we expect read duplication to happen more often. However, read duplication becomes an important QC aspect in low-input (< 10 ng total RNA) library preparations.
An evaluation of any trend toward duplication can be evaluated by plotting the normalized number of counts per gene (e.g. RPK, reads per kilobase) and the fraction of duplicated reads. A typical normal-input (100-1000 ng total RNA) RNASeq experiment targeting approximately 25-30 million reads per sample is expected to show low duplication rates for low abundance transcripts (bottom left of plot), with the duplication rate rising as the expression level approaches the 1 read/bp boundary (vertical red dashed line). Beyond that threshold, highly expressed transcripts are saturated (or nearly so) with reads and contain duplicates.
Note that some duplication is inherent to RNASeq experiments. The key QC interpretation here is that a majority of genes should be positioned to the left of the 1 read/bp threshold and be below 50% duplication. In libraries derived from low input total RNA or from low complexity sources, higher duplication rates will be evident even in transcripts with low abundance.
(4, 5) splicing annotation: The identification of novel splice junctions is an important aspect in RNASeq analyses. While significant progress has been made in this area of RNASeq, the short-read nature of many NGS platforms complicates matters. For example, most short-read mapping algorithms identify the most parsimonious combination of exons, which may not reflect accurately the true isoform origin as assumptions about transcript structures may or may not be correct. Importantly, isoforms with low expression may have limited number of reads that spanning their splice junctions. Even those with additional coverage are more likely to be false positives, unless read coverage is dense. Therefore, novel splice junctions will show a bias toward strongly expressed genes. Until read lengths increase substantially (e.g. Pacific Biosciences or Oxford Nanopore platforms), the alignment of spliced reads will persist as one of the most prevalent challenges in RNASeq data analysis.
By comparison to the reference gene model used to align the sequenced reads, this graph shown the proportion of all detected splice junctions obtained from the STAR alignment into know, complete novel and partial novel categories. The splicing annotation is performed at two levels: the splice event (3) and splice junction (4). A splice junction “event” is an occurrence of a mapped read-pair bridging a splice junction. Some read-pairs may contain multiple splice junction events, and some read-pairs may contain none. A known category reflects the junction described in the gene model; both the 5’ and 3’ splice sites are annotated. In complete novel, both splice sites are novel (not present in the gene model). In partial novel, either one of the 5’ or 3’ splice sites is novel and the other is known.
(6) splice junction saturation: This plot allows you to determine if the sequencing depth obtained is sufficient to perform alternative splicing analyses. Check first for a plateau in the known junctions curve. Since fixed splice junctions can be predetermined from the reference gene model, all known splice junctions should be present in a saturated RNASeq dataset. If not, alternative splicing analysis is difficult because not only are known sites incompletely represented, but lower abundance sites would also be missing, increasing the number of false negatives.
(7) strandedness: This table provides information regarding whether or not your RNASeq library preparation was stranded or unstranded. Strand-specific RNASeq libraries allow you to identify the coding DNA strand for transcripts; strand information is lost in unstranded RNASeq library protocols. If the frequency of both patterns is approximately equivalent, then the library sequenced was unstranded.
For paired-end RNASeq data, there are two different ways to strand reads:
For single-end RNASeq data, there are also two different ways to strand reads:
This information is merely provided for convenience, as it is not crucial to have stranded RNASeq libraries for typical DGE analyses. However, use of un-stranded libraries affects expression estimates for genes with over-lapping genomic loci and transcribed from opposite strands. This phenomenon is not uncommon in humans (ca. 20% of transcribed genes are affected). In contrast, stranded RNASeq libraries provide more accurate estimates of transcript expression for such genes (Zhao et al. 2015). If your libraries were prepared at UWBC GEC, stranded library preparations were used. Stranded RNASeq libraries are essential to analyze non-coding RNAs (ncRNA; e.g. miRNA, lncRNA), which play vital regulatory roles at the genomic, transcriptional and translational levels.
Recall that transcription of “sense” strand DNA produces antisense transcripts and often results in the production of ncRNA complementary to their associated sense transcripts. Antisense transcription has been identified in bacteria, fungi, protozoa, plants, invertebrates and mammals. Areas of intense antisense transcriptional activity have been identified near nucleosome-free regions such promoters, increasing the likelihood that antisense transcripts carry out important regulatory functions. By manipulating the input cDNA during the template preparation, strand-specific preparations capture this information and substantially increase the worth of any RNASeq experiment.
(8) read-pair inner distance: [Paired-end RNASeq only]. This metric reports the estimated inner distance distribution between paired-end reads. The average inner distance should be consistent with the size selection utilized during library construction. This distance is especially important if RNASeq data is used to detect structure variation (e.g. alternative splicing events). Moreover, in certain paired-end RNASeq sequencing designs that attempt to maximize read lengths, the distribution of inner distances will indicate whether or not the sequencing overlaps the alternative read. Overlap is denoted by read fragment sizes < 0.0. While having read overlap for DGE-type analyses is not a problem per se, it does reduce slightly the amount of information contained in your data.
Mapped paired-end reads for both genes and transcripts (isoforms) are counted in each sample using RSEM (RNASeq by Expectation Maximization), described in Li and Dewey (2011). RSEM uses a statistical model to take into account the uncertainty associated with read mapping, especially in a transcriptome where multiple isoforms may exist. In contrast to R/FPKM (reads/fragments per kilobase per million mapped reads), RSEM implements a TPM (transcripts per million mapped reads) metric to compare differences between sequenced sample libraries. The primary rationale is that libraries are not all of the same size and it is necessarily the case that an increase in expression of any particular gene in one library will lead to the exclusion of other genes.
Aggregate TPM adjusted expected read counts are written to MS Excel spreadsheet (.xlsx) files rsem_GENES and rsem_ISOFORMS. Comma-delimited (.csv) files are also available. Comma-delimited files containing additional information, including gene/transcript length, effective length, TPM, and FPKM, are also available in the directory deliverable/expression. Lastly, blinded rlog transformed counts, are also present.
Real RNASeq datasets will be comprised of both expressed and non-expressed genes, which are defined by an annotation model (see section 8, Annotation, for additional information.). Genes with zero or low-abundance counts create problems in RNASeq DGE analyses because small counts do not contain enough information for reliable statistical inference (Bourgon, Gentleman, and Huber 2010). For analysis methods that require a false discovery rate analysis (FDR; see below), the proportion of genes with very low expression out of the total set of all genes being tested influence considerably the power of detection after multiple testing correction; low-expression genes are usually indistinguishable from sampling noise. Removing (filtering) genes that are expressed at or near zero prior to any DGE analysis reduces the severity of the correction and improves the power of detection. Moreover, if not expressed in any condition, they exhibit no biologically meaningful information, at least within the confines of the experimental context.
Our current workflow uses filterByExpr, a function available in edgeR. This function filters those genes failing to meet a threshold of at least 10 read counts or more in a specific number of samples, which is defined as the smallest group sample size in the contrast.
Important: the filtering is performed independently for samples present in each contrast.
Plots depicting a density distribution of log2-transformed read counts before and after application of the filtering function are available in Table 4. The distributions are color-coded by the defined contrast group. The vertical dashed line depicts the log2-CPM threshold used in the filtering step (a CPM value of 1 is equivalent to a log2-CPM value of 0). Beyond showing the performance of the [independent] filtering, the log2-CPM plots also serve to indicate per-sample expression distributions, which may be useful as a quality control metric. In general, the distributions should be approximately the same within a specific group. Substantial shifts along either axis may reflect technical (e.g. library size [see below], sample inversion, variation in duplication levels) or biological (e.g. treatment effects) phenomena.
Contrast | Analysis | geneLevel | isoformLevel |
---|---|---|---|
E08R-E18C | CPM | E08R-E18C | E08R-E18C |
E16R-E08R | CPM | E16R-E08R | E16R-E08R |
E16R-E18C | CPM | E16R-E18C | E16R-E18C |
E16R-E18R | CPM | E16R-E18R | E16R-E18R |
E18R-E18C | CPM | E18R-E18C | E18R-E18C |
To compare gene expression between two conditions (treatments), the fraction of reads assigned to each gene and/or transcript relative to the total number of reads and with respect to the entire group of RNASeq samples compared must be estimated. While the number of sequenced reads is known in RNASeq experiments, it is still only a sample of the total mRNA captured in the original library and varies, often considerably, among samples. The purpose of normalization is to eliminate systematic effects not associated with the biological differences of interest.
A key assumption in RNASeq experiments is that the majority of genes are not differentially expressed. This dictates that all samples should have a similar range and distribution of expression values. However, during the processes of sample preparation and sequencing, extrinsic factors not of biological origin may affect perceived expression levels of individual samples in RNASeq experiments. For example, samples processed in the first stage of an experiment may have been sequenced more deeply than samples processed at another time. Moreover, the total read count is often dependent on a limited number of highly expressed transcripts. Thus, some form of sample normalization is required to ensure that the distribution of gene expression in each sample is similar statistically across the entire experiment.
Samples in your RNASeq experiment were normalized by the method of trimmed mean of M-values (TMM), where the product of this factor and the library sizes defines the effective library size (Robinson and Oshlack 2010). For each sample contrast, simple boxplots of per sample expression distributions were constructed before and after TMM normalization. Successful application of the TMM procedure should yield more uniform distributions centered on a common median. These figures are stored in the directory deliverable/DGE/edgeR_NRM. They can also be accessed directly from Table 5 below.
Contrast | Analysis | geneLevel | isoformLevel |
---|---|---|---|
E08R-E18C | NRM | E08R-E18C | E08R-E18C |
E16R-E08R | NRM | E16R-E08R | E16R-E08R |
E16R-E18C | NRM | E16R-E18C | E16R-E18C |
E16R-E18R | NRM | E16R-E18R | E16R-E18R |
E18R-E18C | NRM | E18R-E18C | E18R-E18C |
Caution: In edgeR, a pseudo-count is a type of normalized count of transcripts. The pseudo-counts represent the equivalent counts that would have been observed if the library sizes had been equal, assuming the fitted normalization TMM model. Therefore, do not interpret the pseudo-counts as general-purpose normalized counts, e.g. library size factors, normalizing for sequencing depth and gene length with RPKM, FPKM, TPM etc.
For an expression analysis seeking to discover differentially expressed genes using RNASeq, the goal is to distinguish differences in abundance of gene transcripts between experimental conditions. A gene is declared differentially expressed if expression levels between at least two experimental conditions is statistically significant. To achieve this, a comparison of variability between replicates and between conditions is required. However, reliable detection of differentially expressed genes is not trivial and cannot be relegated only to a numerical statistical test. We advocate the use of visual information as an indispensable component of a robust RNASeq analysis.
The following visualizations can be very important for troubleshooting fundamental aspects of any RNASeq experiment. Perhaps most importantly, these methods can often detect sample mislabeling (inversion) very quickly. Inverted samples can have profound [negative] influences on a differential gene expression analysis. In fact, sample inversion is a primary explanation for why histograms of raw p-values computed in DGE analyses acquire a peculiar shape (see below).
Another issue easily detected (but not always easily classified) are so-called batch effects. A batch effect is a source of variation due to partitioning of experiments into subgroups, most often during the collection phase (often the field or wet-lab phase). Batch effects commonly include - but are not limited to - variation in sample collection and storage techniques, personnel, and reagents. Importantly, batch effects can be taken into account only if they do not confound with another technical or biological factor included in the experimental design. Therefore, rigorous experimental control is essential for robust RNASeq experiments (Leek et al. 2010).
One of the most important exploratory plots to examine prior to any differential gene expression (DGE) analysis is an unsupervised multidimensional scaling (MDS) plot. An MDS plot is a visualization of distances among a set of objects. In RNASeq, an MDS plot will show variation among samples such that those with higher dissimilarity are further apart. It is different from a principle components analysis (PCA; PCA is a particular instance of MDS) in that MDS projects data to (usually) a 2D space (i.e. utilize only the first two dimensions) and focuses on distance relationships among scaled objects. In contrast, PCA project a multi-dimensional space onto the directions of maximum variation between samples and genes.
Most importantly, both methods transform a large set of variables (genes) into a smaller one that still contains most of the information. Despite subtle differences, inspection of MDS patterns, with respect to your sample groupings, can provide clues to the quality of upstream processing procedures (e.g. sample collection; see below), which may affect downstream analyses and/or interpretation. The scaled axes on the sample-level MDS plots (immediately below) are computed from contributions of all independently filtered genes (see Section 4.2) and representative of Euclidean distances between samples. The two dimensions depicted are ordered based on how well they fit your samples. In contrast, at the contrast-level (Table 6), the axes are re-scaled and report distances as the leading log2 fold-change between the samples for the genes that comprise those samples. This particular analysis uses only the top 500 most variable genes.
If your experiment is well controlled (e.g. uniform sample acquisition and processing yielding a measurable effect), expect to observe the greatest sources of variation between the groups (factors) you are contrasting.
Ideally, each factor in the MDS plot will cluster and be separated from other conditions. This indicates that differences between groups (effect size) are larger than differences within groups. In other words, the between-group variance of gene expression is greater than the within-group variance and therefore can be detected reliably. If grouped samples are widely scattered or if one or more samples of an otherwise [expected] cohesive group are distant from each other, these samples can be examined further for sources of error (e.g. sample inversion) or additional and possibly unknown variation (e.g. batch effects, experimental design or possibly true biological variation). If present, technical replicates should located be very close to one another. If you have more than two groups in your experimental design, be sure to examine carefully the contrast-level plots for higher-resolution details.
The plots in Figure 4 and Figure 5 show the MDS for all samples together at the gene- and transcript-level, respectively. Ellipses show the 95 percent confidence interval of groups containing at least four (4) samples. The mean value of each group is depicted by a larger unlabeled dot. The directory deliverable/DGE/edgeR_MDS contains individual MDS plots computed for each contrast. They can also be accessed directly from Table 6.
Contrast | Analysis | geneLevel | isoformLevel |
---|---|---|---|
E08R-E18C | MDS | E08R-E18C | E08R-E18C |
E16R-E08R | MDS | E16R-E08R | E16R-E08R |
E16R-E18C | MDS | E16R-E18C | E16R-E18C |
E16R-E18R | MDS | E16R-E18R | E16R-E18R |
E18R-E18C | MDS | E18R-E18C | E18R-E18C |
Hierarchical clustering is a complementary approach to ordination methods like PCA or MDS. In RNASeq, it seeks to determine whether samples display greater variability between experimental conditions than between replicates of the same condition. While many methods exist to view clustering relationships, heatmaps (sometimes called Clustered Image Maps; CIMs) work well for RNASeq data. This type of representation is based on a hierarchical clustering of dissimilarities in expected read counts across samples.
Sample relationships are graphically depicted in a 2D image, where each entry is colored on the basis of its dissimilarity to other samples, and where the rows and columns are reordered according to the hierarchical clustering. A dendrogram shows the hierarchical arrangement of the samples produced by the hierarchical clustering (UPGMA).
The heatmaps in Figure 6 and Figure 7 represent graphically the entire expression data set (gene- and transcript-level, respectively) where the individual values contained in a matrix are represented as colors, themselves representative of a measure of expression dissimilarity. To create the heatmap, the expected read counts were normalized and transformed with a regularized logarithm (rlog) function. Next, the Pearson correlation was computed for each pair of samples and Euclidean distances of the correlation distances determined.
As the Euclidean distance is proportional to the Pearson correlation coefficient, the heatmap will cluster together samples that have positively correlated expression values because large positive correlations correspond to small distances. The branches of the dendrograms on the CIM plots should reflect sample clustering with correlated expression patterns. Although computationally distinct, patterns between the MDS and CIM plots may be visible.
A MDS or PCA plot is a two-dimensional (2D) scatter plot in which the data is plotted using the two most descriptive principal components (PCs). The key difference between 2D and 3D is the number of PCs selected. Principal components are constructed to capture decreasing amounts of variation in the data: PC1 describes the most variation, PC2 describes the second most variation etc. As a result, the first two or three PCs may capture most of the variation and the remaining components can be discarded without losing much information.
Ideally, in RNASeq-based differential expression analyses, PC1 and PC2 would capture the the entirety of expression differences between and within conditions, respectively. Unfortunately, expression counts derived from RNASeq data are affected by additional technical and biological sources of variation unrelated to the biological effect of interest. Moreover, these additional sources of variation are often unknown but still contribute to the total variation observed. A scree plot shows the cumulative proportion of explained variation for each component. In Figure 8 and Figure 9 below,
the number of PCs that explain a majority of the expression data was determined by the elbow method. The basic assumption is that the few largest PCs should capture the majority of any biological signal, here differences in gene expression, explaining significantly more variation than the remaining and increasingly smaller PCs. Graphically, this will appear as a sharp reduction in the cumulative variance explained (red line). The presence of strong expression-level variation in the first few PCs is desirable and will shift the elbow to the left (dashed vertical line). If all the components appear relatively flat, very little expression-level variation was detected.
Important: In figures 6 and 7, the scree plot is calculated using all samples. Contrast-level scree plots are available below.
Contrast | Analysis | geneLevel | isoformLevel |
---|---|---|---|
E08R-E18C | SRE | E08R-E18C | E08R-E18C |
E16R-E08R | SRE | E16R-E08R | E16R-E08R |
E16R-E18C | SRE | E16R-E18C | E16R-E18C |
E16R-E18R | SRE | E16R-E18R | E16R-E18R |
E18R-E18C | SRE | E18R-E18C | E18R-E18C |
While 2D plots of PC1 and PC2 often suffice for general inspection, all real RNASeq datasets exhibit additional sources of technical and biological variation and generally benefit from inclusion of PC3, shown below. As an added convenience, the 3D plot can be rotated to provide multiple ordination perspectives of the expression data. This capability is very useful to identify features like intra- and inter-condition cohesiveness, potential outlier samples, effect sizes etc. Simply place the cursor on the object, click and hold, and move the object in any direction. To help with orientation, two intersecting planes (0,0) are shown. The contributions of each component are shown and the plot axes.
To assess DGE, variability in the data must be estimated (modeled) appropriately. For count data obtained from RNASeq experiments, edgeR uses a negative binomial (NB) model. The reason for this is that RNASeq data usually has more variation than can be accounted for by using a Poisson model. This phenomenon is called over-dispersion. EdgeR estimates the mean \(\mu\) count number for each gene, which corresponds to the abundance of transcripts in the sample library. The mean for a gene is estimated as the library size x abundance. This model also has a second parameter, which reflects the degree of dispersion. This parameter determines how the variance for each gene is modeled. For each gene the variance is estimated as \(V =\mu*( 1 + dispersion*\mu)\). Each gene has a distinct value for \(\mu\). Under the “common dispersion model” the same dispersion value is used when modelling the variance for each gene. Under the “tagwise model” (see below)a different dispersion value is computed independently for each gene. Note that using a common dispersion model does not mean the variance is the same for each gene!
Interpretive caution: It is important to understand that common dispersion is completely different from correlation between two replicates (cf. MDS analysis). In edgeR, the square root of the dispersion parameter estimated with a negative binomial model is the BCV. Larger values for the BCV indicate that gene expression is more variable relative to the mean. For example, if dispersion = 0.25, then BCV = 0.50. This means expression values vary up and down by 50% between biological replicates.
The biological coefficient of variation (BCV) is the coefficient of variation (CV) with which the true abundance (unknown) of a gene varies between replicate RNASeq samples. Numerical results for the specified contrasts are shown in Table 8 below.
Contrast | Dispersion | BCV | Contrast | Dispersion | BCV |
---|---|---|---|---|---|
E16R-E08R | 0.01854 | 0.1362 | E16R-E08R | 0.08934 | 0.2989 |
E16R-E18C | 0.01694 | 0.1302 | E16R-E18C | 0.09930 | 0.3151 |
E16R-E18R | 0.00723 | 0.0850 | E16R-E18R | 0.07558 | 0.2749 |
E18R-E18C | 0.01670 | 0.1292 | E18R-E18C | 0.10139 | 0.3184 |
E08R-E18C | 0.02983 | 0.1727 | E08R-E18C | 0.11857 | 0.3443 |
Plots of BCV showing the common, trended and tagwise dispersion as a function of average logCPM for each contrast are located in the directory deliverable/DGE/edgeR_BCV They can also be accessed directly from Table 9 below.
The BCV reflects the coefficient of variation with which the true abundance of the gene varies among replicate RNASeq libraries. The common dispersion is the “squared coefficient of variation”, where the coefficient of variation gives the amount of variability in the true abundance for each gene between replicates (which cannot be measured directly). Quite simply, the higher the estimate of the common dispersion the higher the variability between replicates.
In contrast to common dispersion, use of (moderated) tagwise dispersion is based on the belief that every gene may not have the same common value for dispersion. Thus, it may be more realistic to estimate a distinct dispersion for each gene. However, one problem with this approach is that RNASeq experiments usually have small sample sizes (e.g. perhaps 2 or 3 samples in each of two groups; the BRC recommends at least 4 samples per group). This means that dispersion for each gene may be estimated on 3 or 5 degrees of freedom, which provides very unreliable estimates.
On the other hand, the common dispersion is estimated from perhaps many thousands of genes, so the estimate should be reliable. EdgeR moderates (“squeezes”) the tagwise dispersion estimates towards the common dispersion value. The result of this procedure provides dispersion estimates larger than the common dispersion value smaller and estimates smaller than the common dispersion value larger. This makes the tagwise dispersion estimates more stable and improves inference of DGE more than using unmoderated tagwise dispersion estimates.
In practice, it is not necessarily better to perform the tagwise procedure than to use only common dispersion. Moderated tagwise dispersion tends to rank more highly as differentially expressed genes that have more consistent counts within groups, whereas using a common dispersion model tends to rank genes with one extremely large count in one library as differentially expressed. At BRC, our default analysis uses tagwise dispersion estimates.
Contrast | Analysis | geneLevel | isoformLevel |
---|---|---|---|
E08R-E18C | BCV | E08R-E18C | E08R-E18C |
E16R-E08R | BCV | E16R-E08R | E16R-E08R |
E16R-E18C | BCV | E16R-E18C | E16R-E18C |
E16R-E18R | BCV | E16R-E18R | E16R-E18R |
E18R-E18C | BCV | E18R-E18C | E18R-E18C |
In other words, it represents the CV attributable to only biological replicates if sequencing depth was infinite; the technical CV decreases as the size of the counts increases whereas the BCV does not. Thus, the BCV is usually the primary source of uncertainty, at least for for the most highly expressed genes. Reliable estimation of the BCV is therefore required to assess DGE in RNASeq experiments.
The primary goal of any differential gene expression analysis is to quantitatively measure differences in the levels of transcripts between two or more treatments or groups. At a naive level, this could be as simple as comparing the counts for reads mapping to each transcript. In reality, simply counting reads is neither sufficient for identifying DE genes or for quantifying the degree of any differences between treatment groups. The reason for this is due largely because there is no way to know how accurately such counts reflect what was actually present in the original sample. Moreover, there may be little or no indication of how widely expression varies normally in a given group with or without any treatment.
To appropriately [statistically] interpret differences in read counts, information regarding the variances that are associated with different levels of the experimental design must be estimated. Many types of variance confound counts in RNASeq data including sampling, technical, and biological variance. Hence, DGE experiments must be designed to accommodate and accurately measure both transcript counts and the variances associated with those counts.
Two fundamental tasks are required of all differential gene expression (DGE) analyses. First, an estimation of the magnitude of differential expression between two or more conditions based on read counts from biologically replicated samples must be ascertained. This procedure requires calculation of the fold-change of read counts, taking into account the differences in sequencing depth and variation across samples and groups. Second, an estimation of the significance of the expression difference and a correction for multiple testing is required.
Analysis of differentially expressed genes is performed with a glm using the edgeR package (Robinson, McCarthy, and Smyth 2010). The purpose of this test for differential expression is to test whether the data provides sufficient evidence to conclude that this value is really different from zero.
Results from each independent contrast at the gene and isoform level are written to a MS Excel spreadsheet (.xlsx), located in the directory deliverable/DGE These files are named edgeRglm_GENE_X-Y.xlsx and edgeRglm_ISOFORM_X-Y.xlsx, where X= treatment designation and Y= control designation respectively. Comma-delimited files are also available.
Differential gene expression data from all contrasts are available in an interactive gene list (Table 10) that can be viewed in any web-browser. It contains summary information from the DGE analysis, which can be sorted by logFC and adjusted p-value. Ensembl geneIDs, gene symbols, and descriptions are also available. The suite of geneIDs is linked to the appropriate Ensembl site for your organism. Lastly, the entire list is searchable, making it easier to find specific information, especially if your genome is well annotated.
Contrast | Analysis | geneLevel | isoformLevel |
---|---|---|---|
E08R-E18C | LRT | E08R-E18C | E08R-E18C |
E16R-E08R | LRT | E16R-E08R | E16R-E08R |
E16R-E18C | LRT | E16R-E18C | E16R-E18C |
E16R-E18R | LRT | E16R-E18R | E16R-E18R |
E18R-E18C | LRT | E18R-E18C | E18R-E18C |
Results from each independent contrast at the gene and isoform level are written to a MS Excel spreadsheet (.xlsx), located in the directory deliverable/DGE These files are named edgeRglm_GENE_X-Y.xlsx and edgeRglm_ISOFORM_X-Y.xlsx, where X= treatment designation and Y= control designation respectively. Comma-delimited files are also available.
Important: In order to decide which genes are differentially expressed, the adjusted p-value - NOT the raw p-value - is typically defined to be 0.05. This is because tests are performed on thousands of genes/transcripts, and the chances of finding one that is differentially expressed increases substantially with the number of tests performed. To control the false discovery rate (FDR), a Benjamini-Hochberg correction is applied (Reiner, Yekutieli, and Benjamini 2003; Benjamini and Hochberg 1995).
FDR-controlling procedures are designed to control the expected proportion of rejected null hypotheses that are in fact false. The FDR is measure of statistical significance after correcting for multiple testing. It estimates the proportion of false discoveries in the final list of gene expression comparisons. In other words, if a statistical test called significant all genes with a p-value less than or equal to a given gene’s significance threshold, the FDR indicates the fraction of expected false positives.
This value is reported in the adjusted p-value column the interactive DGE results (Table 10). Among the various approaches for multiple testing correction, FDR estimation offers a balance between statistical stringency and rate of type II errors and therefore is widely used for high-throughput genomics data analysis.
Proper interpretation of the large data tables generated by the DGE method requires effective tools. Visualization is one way to achieve an overall summary of the DGE results. A smearplot (related to a MA-plot) provides a useful overview for a two-group comparison (contrast) in RNASeq experiments. A smearplot represents each gene with a gray • . Relative to the contrast direction, red • and blue • dots denote up- and down-regulated expression respectively, at a adjusted p-value (FDR) significance threshold of 0.05. The gray dots reflect those genes with no evidence of statistically significant differential expression.
The Y-axis (\(log_{2}\) foldchange) is the effect size estimate. It indicates how much expression has changed between the two groups defined by the contrast. This value is reported on a base 2 logarithmic scale. For example, a \(log_{2}\) fold change of 1 means that a genes expression is increased by a multiplicative factor of \(2^{1} = 2\). Similarly, a \(log_{2}\) fold change of -1 indicates a decrease in expression by a factor of \(2^{-1} = 0.5\). The two solid gray lines denote the boundary of a two-fold change.
Important: The comparison direction of each contrast (e.g. condition_01 - condition_02 or treatment_01 - control) is critically important in terms of identifying correctly the direction of transcriptional expression change. Each contrast is always indicated on relevant graphs and tabular information included in this report.
To interpret the logFC, recall that \(log_{2}FC = log_{2}(B) - log_{2}(A) = log_{2}(B/A)\). If you have two expression values, \(A\) and \(B\), the fold change from \(A\) to \(B\) is just \(B/A\). For example, if expression for a gene in condition \(A = 10\) and \(15\) in condition \(B\), then \(log_{2}FC = log_{2}(15)-log_{2}(10) = log_{2}(15/10) = 0.585\); the gene in condition \(B\) has more transcripts (cf. up-regulated) as the sign of \(log_{2}FC\) is positive. If the expression values were reversed, then \(log_{2}FC = log_{2}(10)-log_{2}(15) = log_{2}(10/15) = -0.585\); the gene in condition A has fewer transcripts (cf. down-regulated).
For contrasts reported here, we use the convention \(treatment-reference\). Thus, \(log_{2}FC = log_{2}(treatment) - log_{2}(reference) = log_{2}(treatment/reference)\).
Smearplots for each contrast are stored in the directory deliverable/DGE/edgeR_LFC They can also be accessed directly from Table 11.
Contrast | Analysis | geneLevel | isoformLevel |
---|---|---|---|
E08R-E18C | LFC | E08R-E18C | E08R-E18C |
E16R-E08R | LFC | E16R-E08R | E16R-E08R |
E16R-E18C | LFC | E16R-E18C | E16R-E18C |
E16R-E18R | LFC | E16R-E18R | E16R-E18R |
E18R-E18C | LFC | E18R-E18C | E18R-E18C |
Important: Acquiring transcriptome expression profiles requires that genomic elements be defined in the context of the genome. Therefore, when performing a typical RNASeq analysis, a genome annotation must be selected. An annotated genome provides a comprehensive catalog of genomic functional elements, a crucial component required to interpret a RNASeq experiment. The annotation source used in your RNASeq experiment is listed in Table 12.
## Ensembl site unresponsive, trying useast mirror
## Ensembl site unresponsive, trying asia mirror
dataset | description | version | host | biomaRt |
---|---|---|---|---|
ggallus_gene_ensembl | Chicken genes (bGalGal1.mat.broiler.GRCg7b) | bGalGal1.mat.broiler.GRCg7b | www.ensembl.org | ensembl |
The annotation of genes/transcripts presented in the interactive table was current at the time this report was created. But compared to the annotation used in the DGE analysis, there may be some differences between the two sources. A consequence of frequent updates, alterations often occur in gene names and geneIDs. If required, a copy of the actual annotation file used in the transcriptome analysis is available in the deliverable/data folder.
The scientific community invests substantial effort to produce accurate annotation of an ever-increasing number of genomes from a myriad of organisms. Multiple genome annotations for many organisms are publicly available, especially for the so-called “model” organisms. Currently, at least six curative methods supporting one or more genome annotations exist including AceView Genes, Ensembl Genes, H-InvDB Genes, RefSeq Genes, UCSC Known Genes, and Vega Genes.
However, the complexity of these annotations differ widely because of variations in annotation strategies and information sources used in their construction. For example, some annotation strategies rely on algorithmic prediction, often resulting in more complex gene models that contain more predictive or exploratory genomic elements. Other annotation strategies rely primarily on evidence-based methods that require more manual curation, leading to simpler gene models with fewer genes and isoforms.
It remains unclear how different choices of genome annotation may affect downstream expression estimates in RNASeq experiments, but it is likely that variation in genome annotation complexity can be non-trivial. First, less complex annotation should result in a higher percentage of uniquely mapped reads. But at the same time, the total number of reads mapping to the annotated genome is smaller. Second, more complex annotations increase the number of ambiguous mappings, which increases the ambiguity in expression quantification.
This phenomenon may ultimately increase expression variation, which may propagate to fold-change statistics and, subsequently, DGE detection. Any relationship between genome annotation complexity and gene expression quantification would be helpful in guiding the selection of a genome annotation for RNASeq studies. Unfortunately, no guidelines for selecting a specific genome annotation for RNASeq are available currently, and the effect of genome annotation choice on downstream data analysis still remains unknown.
At UW-Madison BRC, our default is to use Ensembl annotation. The annotation techniques used by ensembl are published and opensource, strike a balance of annotation complexity, include nearly all publicly available genomes, and have excellent community support including regular updates (Aken et al. 2016).
Differential gene expression experiments test each annotated gene by defining a null hypothesis that no differential gene expression exists across one or more sample groups. Even in Prokaryotes, this may involve testing thousands of genes. But as more and more genes are tested, the false positive (type I error) rate increases - the essence of the multiple testing (comparison) problem.
Enormous effort has been directed toward extending multiple comparison procedures to large-scale simultaneous inference questions intrinsic to RNASeq differential expression experiments. Given a list of independent p-values obtained from a given test, (Reiner, Yekutieli, and Benjamini 2003; Benjamini and Hochberg 1995) defined the concept of false discovery rate (FDR) and created a model to control the expected FDR below a specified level. Instead of controlling the error rate per se, they advocated to control the proportion of Type I error.
Their procedure is the most widely used correction in differential gene expression analyses and is most often defined to be 0.05. This means that the proportion of false positives expected from, for example, tests of differential gene expression is 5%. For example, if a FDR correction yields 100 genes with a FDR cutoff of 0.05, the expectation is that 5 of the 100 genes are false positives. Having this information is much better than attempting to determine which genes are true positives.
While application of the FDR correction is a critically important component of any differential gene expression experiment, its use without knowledge of the underlying p-value distribution does little to inform a researcher about the experiment performance as a whole. As an important diagnostic of multiple testing results, examination of a histogram of uncorrected p-values is useful to determine if the experiment and/or hypothesis test exhibits characteristics required for legitimate application of the FDR correction.
A histogram of unadjusted p-values, acquired here from the differential expression tests, is a bar graph of the number of p-values that fall within a specified number of non-overlapping sub-intervals (bins) ranging from 0 to 1; the RNASeq pipeline used at BRC uses 50 bins. This graphic assessment (think of it as a sanity check) is usually able to differentiate among a sufficiently powered RNASeq experiment and several (undesirable) alternatives.
In the simplest case, if all null hypotheses cannot be rejected (i.e. no differential expression is observed), p-values will follow a uniform distribution, which corresponds to a flat histogram.
This means that no statistically discernible effects were detected in the defined test. At best only a tiny percentage - if any - of the null hypotheses might be rejected after FDR correction. Under ideal conditions, the red line depicts the number of counts we would expect if all test results (p-values) were distributed uniformly as nulls. Technically this is a measure of \(\pi\)0, which is 1 in the BH model. There are other models in which \(\pi\)0 is estimated as more conservative, but these are not used widely in RNASeq data. The observation that the histogram is is not uniformly distributed (but pretty good) is due to the fact that read count data is far from ideal. Thus, estimates of the overall proportion of true null and true alternative hypotheses must be made. Identifying the true alternatives is the job of the Benjamini and Hochberg FDR correction.
Agreeably, this p-value distribution is not the outcome desired for most RNASeq designs and in fact, is generally uncommon for designs incorporating nearly any treatment (but true alternative detection remains an important aspect of any experiment design; see below).
In most RNASeq experiments, a success is determined by the presence of a discernible effect on transcriptome expression patterns, inferred by differential gene expression. In this case, uncorrected p-values will be comprised of true nulls (p > 0.05) and true alternatives (p < 0.05) and the shape of the histogram will change.
In this instance, the p-values are less uniformly distributed. The flat portion of the distribution contains p-values corresponding to the null hypothesis (true negatives). An enrichment of small p-values occurs near 0, with many exceeding \(\pi\)0. These are the alternative p-values, suggesting evidence against the null hypothesis (true positives). There are at least some alternative hypotheses even exhibit larger p-values: these are the hypotheses undetectable with the specified test (false negatives). While encouraging, this information alone still cannot distinguish if a p-value < 0.05 represents a true positive or a false positive until the FDR procedure is applied. But for RNASeq experiments in which an expectation of alternatives is proposed to exist, this is the most desirable distribution of p-values.
Distributions like the one shown above reflect an experimental design with moderate power, a hallmark of a good experimental design with perhaps 5-6 biological replicates and minimal technical variation. Very high power contrasts have even larger proportions of alternatives. In these cases, the number of differentially expressed genes is substantial, with exceedingly low FDR-corrected p-values.
In marked contrast, a p-value distribution from a low power contrast is shown below.
Just like the distribution in (Figure 13: Desired p-value distribution, sufficient power), there are alternative p-values exceeding our expectation at \(\pi\)0, but with many fewer occurrences. This type of shape suggest a statistically under-powered contrast, which is not uncommon in RNASeq experiments. Often due to insufficient biological replication, low power contrasts typically exhibit insufficient evidence to reject any null hypotheses, even when evidence indicates that some of the tests may be true alternatives. With insufficient power, many of the true positives turn out to have moderate, or even large p-values and can be found throughout all bins of the histogram.
Although perhaps disappointing, the small enrichment of p-values < 0.05 above our expectation implies that there is at least some merit contained in this experiment. Although the current design may have failed to find alternatives due to low power, perhaps another experiment with an improved experimental design (increased biological replication, reduced technical variation), might be successful. This advice is in marked contrast to the conclusion reached after acquiring the histogram with a shape depicted in (Figure 12: P-value distribution, uniform), which suggests almost no potential to conduct another experiment investigating the same biological question, as there is simply no discernible effect to be found.
The following examples illustrate aberrant p-value distributions, suggesting issues other than levels of statistical power may be influencing the results. It is critically important to understand that if these types of distributions are obtained, a high probability exists that something wrong occurred in the experiment. In these cases, the FDR correction does nothing to help. If not identified and corrected appropriately, flawed interpretations of the results may occur.
The first example shows a distribution with a deficit of small p-values, gradually increasing to form a peak and either beginning to decrease again or increasing further, perhaps reminiscent of a hill. Statistically, this type of distribution suggests that the estimated variance of the null expression distribution is over-dispersed, i.e. the variance of the null distribution is too high.
In this case, distributions with either shape are indicative of some systematic deviation from the theoretical null distributions of the test statistics has occurred. In RNASeq experiments either pattern may indicate the presence of confounding hidden variables, including so-called batch effects. Sources may be biological and/or technical. Age, lineage, gender, tissue type and ecotype are common biological sources. Technical sources often include laboratory personnel (very common), reagents, pipettors and sample collection times. While these sources of variation are always present to some degree, the point is to practice good laboratory practice (GLP) to minimize their effects as best as possible.
If you are certain no batch effect exists (not always easy to confirm) another cause may be attributed to sample inversion (i.e. mistakenly assigning samples to the incorrect contrast group) or including unrelated samples in any contrast group), However, with appropriate design and modeling considerations, these effects can be attenuated and/or eliminated. Examination of sample- and contrast-level MDS and CIM plots can also help determine sample inversion problems.
Lastly, a bimodal (U) shape may occur exhibiting enrichment of both small and large p-values. Statistically, this type of distribution suggests that the estimated variance of the null expression distribution is under-dispersed, i.e. the variance of the null distribution is too low.
This type of distribution often occurs often when low-count genes are not filtered appropriately (or at all). Since a majority of low-count genes are filtered in BRC’s analytical workflows, if a U-shaped histogram does occur, it may be due to an abundance of genes just above the filtering cutoff. P-values of these genes are at or near 1.0 and give rise to higher counts for the rightmost portion of the “U”. The effect may be exacerbated by poor sequencing coverage resulting from many phenomena including insufficient read coverage (too few reads), poor read and mapping qualities, exogenous DNA contamination from other species and high levels of rRNA, which reduces the proportion of mRNA sequences in the library.
While not an desirable result, provided some enrichment of p-values < 0.05 exist and there is a uniform distribution leading up to the p-value enrichment close to 1.0, the bimodal distribution will correspond approximately to the expected null distribution (Reiner, Yekutieli, and Benjamini 2003). The risk here is that you will acquire substantially more false negatives. It is a far better course of action to remedy the issues(s) responsible for the aberrant distribution.
Histogram plots of uncorrected and Benjamini-Hochberg corrected p-value for each contrast are contained in the directory deliverable/DGE/edgeR_PVD They can also be accessed directly from Table 13 below.
Contrast | Analysis | geneLevel | isoformLevel |
---|---|---|---|
E08R-E18C | PVD | E08R-E18C | E08R-E18C |
E16R-E08R | PVD | E16R-E08R | E16R-E08R |
E16R-E18C | PVD | E16R-E18C | E16R-E18C |
E16R-E18R | PVD | E16R-E18R | E16R-E18R |
E18R-E18C | PVD | E18R-E18C | E18R-E18C |
A common method of visualizing gene expression data is to display it as a heatmap so that you can simultaneously visualize clusters of samples (i.e. groups) and features (i.e. genes). This can be useful for identifying genes that are commonly regulated, or have biological signatures associated with a particular condition. Often, heatmaps are combined with clustering methods, which group genes and samples together based on the similarity of their gene expression pattern.
Heatmaps for RNASeq data are displayed usually as a grid where each row represents a gene and each column represents a sample. The color and intensity of each grid cell (sample x gene) is used to represent changes in transcript expression. Genes that are upregulated (high expression value) are colored differently than genes that are downregulated (low expression value), thus providing a simultaneous visual representation of gene expression levels across multiple different samples.
Contrast | Analysis | Heatmap |
---|---|---|
E08R-E18C | HMZ | E08R-E18C |
E16R-E08R | HMZ | E16R-E08R |
E16R-E18C | HMZ | E16R-E18C |
E16R-E18R | HMZ | E16R-E18R |
E18R-E18C | HMZ | E18R-E18C |
In this RNASeq report, a subset of up to 50 of the most differentially expressed genes with a FDR corrected p-value < 0.05 are selected. Next, both samples and genes are clustered using Euclidean distances. An additional elbow function is applied to estimate the number of gene clusters present. Calculated relationships are depicted by dendrograms drawn at the top (samples) and to the left (genes) of the heatmap. The gradation of color is determined by a Z-score that is computed and scaled across rows of genes normalized by TMM. The Z-score of a given expression value is the number of standard-deviations away from the mean of all the expression values for that gene.
When a gene is expressed differentially in samples among two groups, then the Z-scores will be (mostly) positive in one group and (mostly) negative in the other group, hence the contrast of colors. Clusters of genes with similar or very different expression values are easily visible when scaled in this manner. Depending on the degree of expression change, there should be sharp contrasts between groups; within-group Z-scores should be fairly uniform, hence color gradations should be relatively uniform. However, outliers are often readily apparent and distinguished by discontinuities in both color pattern and cluster relationship. Comparison of the Z-score heatmap with the sample- and contrast-level MDS and CIM plots described earlier may also be informative.
Although dependent on experimental design and the level of biological effect, your RNASeq experiment probably identified altered expression of dozens to hundreds (and often many more) of genes. Beyond the individual genes themselves and their [known or unknown] function, it is reasonable to ask in which metabolic or signaling pathways they may be involved with. This so-called pathway analysis (c.f. over-representation analysis [ORA]) has become the first choice to identify predominant biological themes in a collection of differentially expressed genes (DEG).
The basic assumption of this analysis is that genes involved in the same biological process, function, or localization likely exhibit correlated behaviors in terms of expression levels. Application of statistical tests to find correlation reduces the complexity of the primary list of DEGs and increases the overall biological explanatory power. Since the dimensionality of the DEG list is reduced by grouping them into pathways, components of biological processes, pathways and networks may often become more clear.
The KEGG database (Kyoto Encyclopedia of Genes and Genomes) (http://www.kegg.jp) is a collection of manually curated pathway maps representing hypotheses of the molecular interaction and reaction networks for many biological processes. The aim of its use is to provide a measure of support (usually a score and associated p-value) of pathways enriched for a set of genes. In other words, given a group of pathways, are differentially expressed genes in a given pathway more abundant than expected by chance?
To conduct this test, first an input gene list is created using some threshold measure. In this instance, a minimum of 10 and a maximum of 500 genes, sorted by increasing FDR-corrected p-value are selected. Next, for each KEGG pathway, selected genes that are part of the specific pathway are counted. Importantly, this step requires translation of ensembl geneIDs to Entrez IDs and is achieved with biomaRt. While a majority of genes will be translated, not all ensembl geneIDs have a corresponding Entrez ID that KEGG will recognize. Therefore, the list used to construct the enrichment is representative only of the translatable geneIDs. In general, KEGG pathway analysis is useful only for well-annotated genomes.
Lastly, the list of genes in every pathway is tested for over- or under-representation with respect to the input list of DE genes. The number of “background” genes is determined by counting the number of protein_coding genes [Biotype] in the annotation model. The most commonly used tests are based on the hypergeometric, chi-square or binomial distributions. For this report, we used the methodological framework described in (Yu et al. 2012).
Contrast | Analysis | Pathways |
---|---|---|
E08R-E18C | KPY | E08R-E18C |
E16R-E08R | KPY | E16R-E08R |
E16R-E18C | KPY | E16R-E18C |
E16R-E18R | KPY | E16R-E18R |
E18R-E18C | KPY | E18R-E18C |
Any significantly enriched KEGG pathways are ordered from most to least significant. The number of genes in the specified KEGG pathway are denoted by the size of a circle. The color filling each circle corresponds to the specific KEGG pathway description. The dashed line represent significance at p = 0.05. Remember that this report uses an over-representation analysis [ORA] and not a functional class scoring (e.g. GSEA) approach (Subramanian et al. 2005).
Moreover, be aware that dozens of over-representation applications, many of them web-based, exist. Unfortunately, many are not well documented (e.g. what version of the KEGG database or specific enrichment test used). Be careful about using such tests particularly as they often do not update very frequently. The KEGG pathway enrichment analysis presented with this RNASeq report updates the functional information from KEGG prior to performing the test.
Analysis and interpretation of genomic data generated by sequencing technologies, such as RNASeq, are among the most complex problems in genomic sciences. Since its initial deployment in ca. 2008, RNASeq technology has matured considerably but still remains a technically demanding type of NGS procedure. Significantly, the number of questions addressable with RNASeq data is essentially limitless; vastly more analytic facets related to transcriptome analysis exist than do ways of generating the RNASeq data itself. One consequence of this reality is that analysis and interpretation of RNASeq data is far from routine. Analytic “best practices” in RNASeq remain very dynamic and reflect our current limited understanding of transcriptional processes including technical aspects of data collection.
In brief, the process used in the UW-Madison BRC RNASeq pipeline was to first demultiplex, filter, and trim sequencing reads. Next, reads were mapped to the specified genome and transcriptome, the latter derived from the accompanying ensembl annotation model. Uniquely mapping reads (primary) were annotated and counted to estimate abundance. Each sample was normalized and statistical analyses were applied to identify differential expression between one or more contrasts, of which the raw p-values were adjusted by a FDR correction.
As with any experiment that is intended to test a null hypothesis of no difference between or among groups of individuals, making sense of RNASeq data depends on the scientific hypothesis of interest. A primary goal of any DGE analysis is to highlight genes that have changed significantly in abundance across different experimental conditions. You now posses a ranked list of genes with FDR corrected p-values, fold changes, and in many cases, functionality.
The adjusted statistical significance (FDR) of individual transcripts in which you are most interested will undoubtedly dictate a measure of “success” or “failure” regarding your RNASeq study. For example, if you had no expectations a priori for any differential gene expression and simply wanted a candidate list of genes that were or were not differentially expressed, no matter the outcome, you would likely deem the experiment a success.
Alternatively, if you were interested in testing for differential gene expression in specific sets of transcripts known previously to differ in abundance among treatment groups (e.g. determined by qPCR or perhaps reported in the literature), but the RNASeq analysis showed no evidence of any DGE, you may hastily - and perhaps erroneously - conclude that the RNASeq experiment was a “failure”.
To evaluate how confident you can be in the results of the DGE analysis, you must go beyond the DGE list itself, and evaluate additional aspects of the analyses and perform some basic checks. The following list is by no means exhaustive, and not all will apply to every RNASeq experimental design, but it is a good place to begin.
The effect size will to some degree also be affected by sampling variance, and must therefore depend on read coverage and the number of biological replicates. In RNASeq experiments, there are also important biological reasons to expect different effect sizes when comparing transcript counts or specific types of samples. Some treatments will affect transcript expression more directly than others, so large differences between treatments for these transcripts may be expected.
Some transcriptional responses to a given treatment may be more variable among individuals, owing to greater genetic variation at regions of the genome responsible for mediating the response. Such a phenomenon will ultimately manifest itself as a smaller effect size. In some instances, the effect size may be so small that reconsideration of your experimental design is warranted, perhaps requiring adjustments of sample sizes, treatments and/or collection procedures, or adoption of more complicated designs that include time-series or paired-sample elements.
Many other phenomena associated with treatments and effect sizes vary widely in biological systems and often at multiple levels (e.g. cellular hierarchies, biochemical cascades, temporal dependencies). Moreover, factors including treatment and sample type(s), and/or the ability to control the environment of test subjects may differ tremendously from experiment to experiment, cf. batch effects. This can mean that published data from other experimental systems - even simulated data - may not be informative enough to establish a reliable RNASeq design a priori. In such cases, perhaps the best solution is to design a simpler experiment that focuses first on establishing magnitudes of effect sizes.
In nearly all cases, creating lists of DE genes is not the final step of a RNASeq analysis. At a minimum, further biological insight into an experiment may be gained by examination of expression changes in gene sets, particularly if the results from your RNASeq DGE analysis are substantially different from expectations based on [accurate] prior knowledge. For example, the starting point for many additional downstream and exploratory analyses includes gene ontogeny (GO) and biochemical pathway (KEGG) enrichment inquiries. A simple, but robust, KEGG enrichment analysis was already computed for you in this report. You may wish to explore additional methodologies like GSEA (Subramanian et al. 2005). Additional insight may also be gained by establishing patterns of expression changes within gene sets by integrating the RNASeq data with data from other sources, such as microarray or metabolomics data.
Please feel free to contact our office should you have any questions regarding your data.
This report was generated by computational pipelines developed at the Bioinformatics Resource Center (BRC), part of the Advanced Genome Analysis Resource unit of the Biotechnology Center at the University of Wisconsin - Madison.
Bioinformatics Resource Center
Genetics & Biotechnology Center, Room 1274
425 Henry Mall, Madison WI 53706
Email: brc@biotech.wisc.edu
Phone: (608) 265-8262
R version 4.1.1 (2021-08-10)
Platform: x86_64-conda-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C
attached base packages: grid, stats4, parallel, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: rgl(v.0.107.14), kableExtra(v.1.3.4), tufte(v.0.10), captioner(v.2.2.3), pander(v.0.6.4), png(v.0.1-7), xtable(v.1.8-4), pheatmap(v.1.0.12), biomaRt(v.2.48.0), hwriter(v.1.3.2), ReportingTools(v.2.32.0), tibble(v.3.1.4), rbamtools(v.2.16.17), refGenome(v.1.7.7), RSQLite(v.2.2.5), doBy(v.4.6.11), plyr(v.1.8.6), RColorBrewer(v.1.1-2), batchtools(v.0.9.15), edgeR(v.3.34.0), limma(v.3.48.0), gridExtra(v.2.3), ggrepel(v.0.9.1), gtable(v.0.3.0), ggthemes(v.4.2.4), ggplot2(v.3.3.5), ShortRead(v.1.50.0), GenomicAlignments(v.1.28.0), SummarizedExperiment(v.1.22.0), Biobase(v.2.52.0), MatrixGenerics(v.1.4.0), matrixStats(v.0.60.1), Rsamtools(v.2.8.0), GenomicRanges(v.1.44.0), Biostrings(v.2.60.0), GenomeInfoDb(v.1.28.0), XVector(v.0.32.0), IRanges(v.2.26.0), S4Vectors(v.0.30.0), BiocParallel(v.1.26.0), BiocGenerics(v.0.38.0), tufterhandout(v.1.2.1), knitr(v.1.33) and rmarkdown(v.2.10)
loaded via a namespace (and not attached): utf8(v.1.2.2), R.utils(v.2.10.1), tidyselect(v.1.1.1), AnnotationDbi(v.1.54.0), htmlwidgets(v.1.5.3), munsell(v.0.5.0), base64url(v.1.4), withr(v.2.4.2), colorspace(v.2.0-2), Category(v.2.58.0), filelock(v.1.0.2), OrganismDbi(v.1.34.0), highr(v.0.9), rstudioapi(v.0.13), GenomeInfoDbData(v.1.2.6), bit64(v.4.0.5), curry(v.0.1.1), vctrs(v.0.3.8), generics(v.0.1.0), xfun(v.0.25), biovizBase(v.1.40.0), BiocFileCache(v.2.0.0), R6(v.2.5.1), locfit(v.1.5-9.4), AnnotationFilter(v.1.16.0), bitops(v.1.0-7), cachem(v.1.0.6), reshape(v.0.8.8), DelayedArray(v.0.18.0), assertthat(v.0.2.1), BiocIO(v.1.2.0), scales(v.1.1.1), nnet(v.7.3-16), ggbio(v.1.40.0), ensembldb(v.2.16.0), rlang(v.0.4.11), genefilter(v.1.74.0), systemfonts(v.1.0.2), splines(v.4.1.1), rtracklayer(v.1.52.0), lazyeval(v.0.2.2), dichromat(v.2.0-0), broom(v.0.7.9), brew(v.1.0-6), checkmate(v.2.0.0), BiocManager(v.1.30.16), yaml(v.2.2.1), reshape2(v.1.4.4), GenomicFeatures(v.1.44.0), backports(v.1.2.1), Hmisc(v.4.5-0), RBGL(v.1.68.0), tools(v.4.1.1), ellipsis(v.0.3.2), Rcpp(v.1.0.7), base64enc(v.0.1-3), progress(v.1.2.2), zlibbioc(v.1.38.0), purrr(v.0.3.4), RCurl(v.1.98-1.4), prettyunits(v.1.1.1), rpart(v.4.1-15), cluster(v.2.1.2), magrittr(v.2.0.1), data.table(v.1.14.0), ProtGenerics(v.1.24.0), hms(v.1.1.0), evaluate(v.0.14), XML(v.3.99-0.7), jpeg(v.0.1-8.1), compiler(v.4.1.1), crayon(v.1.4.1), R.oo(v.1.24.0), htmltools(v.0.5.2), GOstats(v.2.58.0), Formula(v.1.2-4), tidyr(v.1.1.3), geneplotter(v.1.70.0), DBI(v.1.1.1), dbplyr(v.2.1.1), MASS(v.7.3-54), rappdirs(v.0.3.3), Matrix(v.1.3-4), R.methodsS3(v.1.8.1), pkgconfig(v.2.0.3), foreign(v.0.8-81), microbenchmark(v.1.4-7), xml2(v.1.3.2), svglite(v.2.0.0), annotate(v.1.70.0), webshot(v.0.5.2), AnnotationForge(v.1.34.0), rvest(v.1.0.1), stringr(v.1.4.0), VariantAnnotation(v.1.38.0), digest(v.0.6.27), graph(v.1.70.0), htmlTable(v.2.2.1), Deriv(v.4.1.3), GSEABase(v.1.54.0), restfulr(v.0.0.13), curl(v.4.3.2), rjson(v.0.2.20), jsonlite(v.1.7.2), lifecycle(v.1.0.0), PFAM.db(v.3.13.0), viridisLite(v.0.4.0), BSgenome(v.1.60.0), fansi(v.0.4.2), pillar(v.1.6.2), lattice(v.0.20-44), GGally(v.2.1.2), KEGGREST(v.1.32.0), fastmap(v.1.1.0), httr(v.1.4.2), survival(v.3.2-13), GO.db(v.3.13.0), glue(v.1.4.2), bit(v.4.0.4), Rgraphviz(v.2.36.0), stringi(v.1.7.4), blob(v.1.2.2), DESeq2(v.1.32.0), latticeExtra(v.0.6-29), memoise(v.2.0.0) and dplyr(v.1.0.7)
Application | Version |
---|---|
skewer | 0.2.2b |
STAR | 2.5.3a |
RSEM | 1.2.31 |