1 Sample definitions

1.1 Input data and environment settings

Specific information pertaining to your RNASeq project 20191121_RetinaTest is shown in Figure 1 below. Check first to ensure that the designations in the SampleName and Factor columns match with the associated fastq file in column FileName1. It is critically important that these designations are correct as any contrasts made by the differential gene expression (DGE) analyses are conditioned on these values. If sequencing included a paired-end read, an additional column named FileName2 will be present and should also be inspected for accuracy. If technical replicates were including in your experiment, grouping information will be contained in the replicate column. The two columns labeled SampleLong and Experiment are not used currently and may be ignored.

The directory named deliverable contains numerous files of expression and DGE results including quality control (QC) metrics and data files. These files can be accessed directly and many are also are accessible from hyperlinks contained within this html document. Content description and explanations of these files are given in the appropriate sections below.

The designated contrasts made during the DGE analysis are shown in Figure 2 below. Each specific contrast is considered independent from the entire dataset, which is referred to as a sample. Please check to ensure that all of the contrasts requested are present.

2 FASTQ preprocessing

Due to the imperfect nature of the sequencing process and limitations of the optical instruments (batch effects, sequencing error, GC bias, duplicate reads, copy number variation, mapability), base calling has inherent uncertainty associated with it. The magnitude of uncertainty in each base call is represented by an error probability or Phred (pronounced like “fred”) score. This parameter is denoted formally as \(Q\) and is proportional to the probability \(p\) that a base call is incorrect, where \(Q = -10log~10~(p)\).

Any biological inference or conclusion depends on the overall quality of your data. These following quality control (QC) statistics computed on your data are designed to help you evaluate the overall technical quality of your experiment. You should always examine if your raw sequence data exhibit good quality and lack obvious problems or biases, which may affect how you can ultimately use it in your research.

2.1 Read trimming

The trimming software skewer (Jiang et al. 2014) was used to preprocess raw fastq files. Skewer implements an efficient dynamic programming algorithm designed to remove adapters from Illumina-generated sequence reads.

During sequencing library construction, short oligonucleotides are ligated to the ends DNA fragments to be sequenced, so that they can be combined with primers for PCR amplification. While sequencing the fragments, if the read length is greater than that of the target DNA, the adapter sequence next to the unknown DNA sequence of interest is also sequenced, sometimes only partially. To recover the true target DNA sequence, it is required to identify the adapter sequence and remove it. Moreover, to improve further the quality of the sequence reads, skewer also trims the 3’ end of the fragment until a Phred quality of 20 is reached. This is equivalent to a base call accuracy of 99%. More information regarding Phred scores can be obtained from Ewing et al. (1998) and in Ewing and Green (1998).

2.2 FASTQ quality report

After adapter trimming, the BRC RNASeq pipeline computes eight QC statistics on your read data including: (1) combined per cycle base quality, (2) per cycle base frequencies, (3) per cycle average base quality, (4) relative 3k-mer diversity, (5) Phred quality distribution, (6) mean quality distribution, (7) read length distribution and (8) read occurrence distribution.

The QC results for the FASTQ data are accessible from the link in Table 1 below.

Table 1: FASTQ read quality control
Experiment Analysis Report
20191121_RetinaTest RQC FASTQ quality

To gauge the overall quality of the read data, scan down the rows for each quality metric (a-h, described below). Clicking on any image will enlarge it. Graphs of summarized data should be fairly uniform for all samples in each QC metric. If one or more metrics in a sample [library] are considerably different, or if they are all highly variable, it may be indicative of a problem that warrents additional attention. Important note: becuase the number of reads generated in a typical RNASeq experiment is substantial (ca. 30M), read-level QC metrics are based on a randomly sampled subset comprising 5% of the total read number.

(a) combined per cycle base quality: This graph shows Phred-scaled quality scores of reads with respect to counts. The Y-axis depicts the quality scores for each base at each sequencing cycle (X-axis; 1 cycle = 1 base). Higher qualities indicate more reliable base calls. The solid blue-gray and jade-green lines depict the mean and Q50 quantile Phred scores, respectively. The two dashed jade-green lines represent the Q25 (lower) and Q75 (upper) quantiles.

Ideally, a single reddish band would occur at a quality score of 40 along the full range of sequenced reads, except for the first few cycles. In practice, this is not often realized as the quality of calls on NGS sequencing platforms degrade as the number of cycles increases. Therefore, it is not uncommon to observe a reduction in base quality (bands become increasingly yellow) toward the end of a read. As a general rule, Phred scores > 30 are sufficient for use in RNASeq analyses. This is equivalent to a base call accuracy of 99.9%.

(b) per cycle base frequencies: This graph depicts the frequency of each nucleotide base position across the length of the read. The relative frequency of each base should reflect that of the entire genome. During RNASeq library construction, targets are primed with random hexamers and will influence the base composition during early sequence cycles, especially in cycles 1-10. Enrichment with random hexamers is not entirely random as there is an incomplete sampling of all possible hexamers. This effect is known and does not affect the base content of the read. In other words, it does not represent any individually biased sequences.

(c) per cycle average base quality: This graph is similar to the representation of the per cycle base quality. On the Y-axis, the Phred scaled quality is plotted as a function of cycle number (X-axis). In contrast to the per cycle base quality graph, average qualities for each nucleotide base are plotted separately. Quality scores > 30 are good for RNASeq analyses and equivalent to 99.9% base call accuracy.

(d) trinucleotide frequency: The relative frequency of trinucleotides (3-mer) may indicate the presence of over-represented sequences in your library. This is based on the assumption that any small sequence should not have a compositional bias within a diverse library of sequences. However, this can vary significantly depending on the organism or targets sequenced. If one or more 3-mers dominate the remaining 3-mers, it may indicate the presence of a contaminating sequence, perhaps an adapter dimer (rare) or biased sequence composition (more common).

(e) Phred quality distribution: This graph is related to graphs 1 (per cycle base quality) and 2 (per cycle base frequencies) but instead depicts the percentage of reads above a minimum Phred quality.

(f) read quality distribution: This graph shows the distribution of mean quality in terms of read percentage. A right skewed distribution of Phred scores greater than 30 is preferred, and equivalent to at least a 99.9% base call accuracy.

(g) read length distribution: When the RNASeq libraries were sequenced, a targeted read length was was chosen, e.g. 50 bp, 75 bp, 100 bp etc. Ideally, all reads would reflect this length but multiple phenomena can occur that result in a mixture of shorter read lengths. This graph shows the distribution of read lengths as a proportion of all reads. A successful sequencing run should produce a majority of reads equal to the number of cycles performed.

(h) read occurrence distribution: This graph summarizes the proportion over all reads that are represented among each sequence exactly N times. Occurrences represented by 0.1k, 1k, 10k, and greater than 10k denote ranges corresponding to 11-100, 101-1,000, 1,001-10,000, and greater than 10,000 instances.

Unbiased library preparation and sequencing should yield a majority of unique reads (N=1) and rapidly decrease to zero. If strong spikes occur in any of the defined ranges, it may reflect a technical issue, often when sample preparation resulted in biased PCR amplification of specific targets. In RNASeq, replicate reads may be due to artifacts of sample preparation, rather than differential representation of target sequence in the original unamplified sample. Over-representation (duplication) is often observed in low input library preparations.

Condensed to a single page, QC results for readpair 1 are also located here: Readpair1. If sequencing included a paired-end read, the QC results for readpair 2 are located here: Readpair2. These fastq quality statistics are estimated from a randomly sampled portion (20%) of the total reads contained in the fastq file.

3 Preliminary data inspection

For any differential gene expression analysis to succeed, the expression values of individual genes should be fairly similar across biological replicates. The following two simple visualization tools can be very important for troubleshooting fundamental aspects of any RNASeq experiment. Perhaps most importantly, both can detect sample mislabeling (inversion) very quickly. Such an occurrence can have profound influences on a differential gene expression analysis. In fact, sample inversion is a primary explanation for why histograms of raw p-values computed in DGE analyses acquire a peculiar shape (see below).

Another issue easily detected (but not always easily classified) are so-called batch effects. A batch effect is a source of variation due to partitioning of experiments into subgroups, most often during the collection phase (often the field or wet-lab phase). Batch effects commonly include - but are not limited to - variation in sample collection and storage techniques, personnel, and reagents. Importantly, batch effects can be taken into account only if they do not confound with another technical or biological factor included in the experimental design. Therefore, rigorous experimental control is essential for robust RNASeq experiments (Leek et al. 2010).

3.1 Read filtering

Genes with zero or low-abundance counts create problems in RNASeq DGE analyses because small counts do not contain enough information for reliable statistical inference (Bourgon, Gentleman, and Huber 2010). For analysis methods that require a false discovery rate (FDR; see below), the proportion of genes with very low expression out of the total set of all genes being tested influence considerably the power of detection after multiple testing correction; low-expression genes are usually indistinguishable from sampling noise. Removing (filtering) genes that are expressed at or near zero prior to any DGE analysis reduces the severity of the correction and improves the power of detection. In this RNASeq pipeline, low-abundance genes are defined as those with an average read count below a threshold of 1.0 in two or more samples.

3.2 Gene expression normalization

To compare gene expression between two conditions (treatments), the fraction of reads assigned to each gene and/or transcript relative to the total number of reads and with respect to the entire group of RNASeq samples compared must be estimated. While the number of sequenced reads is known in RNASeq experiments, it is still only a sample of the total mRNA captured in the original library and varies, often considerably, among samples. The purpose of normalization is to eliminate systematic effects not associated with the biological differences of interest.

A key assumption in RNASeq experiments is that the majority of genes are not differentially expressed. This dictates that all samples should have a similar range and distribution of expression values. However, during the processes of sample preparation and sequencing, extrinsic factors not of biological origin may affect perceived expression levels of individual samples in RNASeq experiments. For example, samples processed in the first stage of an experiment may have been sequenced more deeply than samples processed at another time. Moreover, the total read count is often dependent on a limited number of highly expressed transcripts. Thus, some form of sample normalization is required to ensure that the distribution of gene expression in each sample is similar statistically across the entire experiment.

Samples in your RNASeq experiment were normalized by the method of trimmed mean of M-values (TMM), where the product of this factor and the library sizes defines the effective library size (Robinson and Oshlack 2010). For each sample contrast, simple boxplots of per sample expression distributions were constructed before and after TMM normalization. Successful application of the TMM procedure should yield more uniform distributions centered on a common median. These figures are stored in the directory deliverable/DGE/edgeR_NRM. They can also be accessed directly from Table 2 below.

Table 2: Contrast-level normalization plots
Contrast Analysis geneLevel isoformLevel
E08R-E18C NRM E08R-E18C E08R-E18C
E16R-E08R NRM E16R-E08R E16R-E08R
E16R-E18C NRM E16R-E18C E16R-E18C
E16R-E18R NRM E16R-E18R E16R-E18R
E18R-E18C NRM E18R-E18C E18R-E18C

Caution: In edgeR, a pseudo-count is a type of normalized count of transcripts. The pseudo-counts represent the equivalent counts that would have been observed if the library sizes had been equal, assuming the fitted normalization TMM model. Therefore, do not interpret the psuedo-counts as general-purpose normalized counts, e.g. library size factors, normalizing for sequencing depth and gene length with RPKM, FPKM, TPM etc.

3.3 Unsupervised clustering of samples: MDS

One of the most important exploratory plots to examine prior to any differential gene expression (DGE) analysis is an unsupervised multidimensional scaling (MDS) plot. An MDS plot is a visualization of a principle components analysis and shows similarities [dissimilarities] between samples as sources of variation in the data. It is different from a principle components analysis (PCA) in that MDS project data to (usually) a 2D space and focuses on relationships among scaled objects. In contrast, PCA project a multi-dimensional space to the directions of maximum variation using the correlation between samples and variables (i.e. genes). Despite subtle differences, inspection of MDS (or PCA) patterns, with respect to your sample groupings, can provide clues to upstream processing procedures (e.g. sample collection; see below), which may affect downstream analyses and/or interpretation.

It also provides an indication prior to performing a formal statistical analysis to identify which contrasts may yield significant differences. Distances on the plot may be interpreted as leading log2 fold-change, meaning the log2 fold-change between the samples for the genes that distinguish those samples. If your experiment is well controlled (e.g. uniform sample acquisition and processing yielding a measurable effect), expect to observe the greatest sources of variation between the groups (factors) you are contrasting.

Even if technically partitioned variation is minimal, to produce a discernible expression effect, there must still be present a source of biological variation.

Ideally, each factor in the MDS plot will cluster well within the primary condition of interest and be separated from other conditions. This indicates that differences between groups (effect size) are larger than differences within groups. In other words, the between-group variance of gene expression is greater than the within-group variance and therefore can be detected. If grouped samples are widely scattered or if one or more samples of an otherwise [expected] cohesive group are distant from each other, these samples can be examined further for sources of error (e.g. sample mix-up) or additional and possibly unknown variation (e.g. batch effects, experimental design or possibly true biological variation). If present, technical replicates should be very close to one another.

The plots in Figure 3 and Figure 4 show the MDS for all samples together at the gene- and transcript-level, respectively. The directory deliverable/DGE/edgeR_MDS contains individual MDS plots computed for each contrast. They can also be accessed directly from Table 3.

Table 3: Contrast-level multidimensional scaling plots
Contrast Analysis geneLevel isoformLevel
E08R-E18C MDS E08R-E18C E08R-E18C
E16R-E08R MDS E16R-E08R E16R-E08R
E16R-E18C MDS E16R-E18C E16R-E18C
E16R-E18R MDS E16R-E18R E16R-E18R
E18R-E18C MDS E18R-E18C E18R-E18C

3.4 Unsupervised clustering of samples: Heatmap

Hierarchical clustering is a complementary approach to MDS and also seeks to determine whether samples display greater variability between experimental conditions than between replicates of the same treatment. While many ways exist to view clustering relationships, heatmaps (sometimes called Clustered Image Maps; CIMs) work well for RNASeq data. This type of representation is based on a hierarchical clustering of dissimilarities in read counts across samples.

Sample relationships are graphically depicted in a 2D image, where each entry is colored on the basis of its dissimilarity to other samples, and where the rows and columns are reordered according to the hierarchical clustering. A dendrogram shows the hierarchical arrangement of the samples produced by the hierarchical clustering (UPGMA).

The heatmaps in Figure 5 and Figure 6 represent graphically the entire expression data set (gene- and transcript-level, respectively) where the individual values contained in a matrix are represented as colors, themselves representative of expression dissimilarity. To create the heatmap, the estimated read counts were log2 transformed into pseudo-counts and each gene was scaled to have mean zero and standard deviation of one. The Euclidean distance between the expression values of each sample was computed.

As the rows of the logCPM matrix were standardized, The Euclidean distance is proportional to the Pearson correlation coefficient between any two genes. Therefore, the heatmap will cluster together samples that have positively correlated expression values, because large positive correlations correspond to small distances. As in the MDS plot, branches of the dendrograms should reflect sample clustering with correlated expression patterns.