UW Biotechnology Center Bioinformatics Resource Center
Please direct questions to:


1 Summary

1.1 Project summary

PI Name Name
Project Description Mouse fecal samples 16S (V3-V4)
Submission 16S_Illumina
Number of samples 8
Report generation date 2024-11-12
Number of Features 542
Total Frequency 1615142

\(~\)

1.2 Sequence data

Table 1.1: Read counts at each stage of sequence data processing.
Sample ID Input Filtered Retained (%) Denoised Retained (%) Non-chimeric Retained (%) SampleID
C109 210,517 186,907 88.78 186,321 88.51 185,196 87.97 C109
C110 258,515 230,223 89.06 226,183 87.49 214,232 82.87 C110
D111 251,222 223,516 88.97 219,611 87.42 211,114 84.03 D111
D112 321,828 287,150 89.22 282,927 87.91 269,369 83.70 D112
G058 290,016 254,425 87.73 253,183 87.30 249,013 85.86 G058
G059 309,802 274,197 88.51 270,486 87.31 265,006 85.54 G059
H060 321,432 285,725 88.89 281,822 87.68 271,868 84.58 H060
H061 30,370 24,285 79.96 23,528 77.47 23,489 77.34 H061

The table above summarizes the number of reads retained at each processing step. Input lists the total raw reads per sample. Filtered shows remaining reads after removing reads with low-quality scores. Denoised indicates the number of reads retained after applying the DADA2 algorithm, which corrects sequencing errors and removes artifacts by inferring biological sequences from the data. Non-chimeric contains reads present after excluding potential chimeric reads formed during PCR amplification. Retention percentages at each step are calculated relative to the initial number of reads.

\(~\)

Figure 1.1: Initial quantity of unique and duplicate sequence reads for each sample.

The barplot above illustrates the overall number of sequencing reads for each sample. Overall read counts are broken down by unique (light blue) and duplicate reads (dark blue).

\(~\)

Initial quantity of unique and duplicate sequence reads for each sample.

Figure 1.2: Initial quantity of unique and duplicate sequence reads for each sample.

The line plot above displays the sequencing quality scores for each sample, as determined by Phred scores on the y-axis. This visualization offers insights into the overall performance of the sequencing run and highlights samples with low sequencing quality. Each line represents the forward or reverse read for a given sample, facilitating comparative evaluation of sequencing quality of paired-end sequencing data These scores reflect the accuracy of base calls across the read length represented on the x-axis. High-quality scores (typically above 30, shaded green) indicate a high level of accuracy, while scores below this threshold (shaded yellow and red) may suggest potential quality issues. Reports for individual reports can be found in the directory ../fastqc and a combined report is located in the directory ../multiqc.

\(~\)

2 Microbial diversity

2.1 Taxonomic classification

Figure 2.1: Relative abundance and NCBI taxonomy ID for each taxa identified.

The table above contains relative abundance of taxa across samples. Each row represents a distinct taxonomic group. The first column corresponds to the NCBI taxonomy ID (if present). Subsequent columns represent the taxonomic hierarchy, including domain, phylum, class, order, family, and genus, allowing for a clear understanding of taxonomic relationships. A total of 65 genera and 10 phyla were classified. Sample IDs are provided with corresponding relative abundance values for each taxon. Higher values indicate taxa with greater abundance in a given sample. Both relative and abundance data at each taxonomic level is located in the directory ../analysis/rel_table.

\(~\)

2.2 Microbial community composition at phylum level

Figure 2.2: Stacked barplot depicting relative abundance of major phyla.

The barplot above illustrates the distribution of taxa at the phylalevel across different samples. The number of taxa is restricted to a maximum of 30. The height of each color segment reflects the proportional abundance of each taxon within the overall community. Samples are organized based on the values in the Genotype, Sex, and DOB variables from metadata.

\(~\)

2.3 Microbial community composition at genera level

Figure 2.3: Heatmap depicting taxa abundance using absolute read counts.

The heatmap above illustrates the abundance of taxa across samples. The number of genera is restricted to a maximum of 30genera across all samples. Each cell in the heatmap represents the unnormalized absolute read counts of a specific genera, with color gradients indicating different levels of abundance. Samples are organized based on the values in the Genotype, Sex, and DOB variables from metadata.

\(~\)

2.4 Alpha diversity

Figure 2.4: Boxplot of three different alpha diversity indices.

Boxplots of alpha diversity metrics — Chao1, Shannon, and Inverse Simpson indexes—across different groups or treatment conditions. The boxplots display the median, interquartile range, and potential outliers, providing a visual summary of the diversity patterns and variability within the sampled communities. Boxes are color-coded based on the values in the Genotype variable from metadata.
The Chao1 index is a species richness estimator, which accounts for the number of observed taxa and adjusts for the presence of rare, undetected taxa. It provides an estimate of the total species richness, including species that may have been missed due to undersampling. The Shannon index combines both species richness and evenness, measuring the diversity of the community by considering not only how many species are present but also how evenly distributed they are. A higher Shannon index reflects a community where species are more evenly represented, whereas lower values indicate a community dominated by a few species. The Inverse Simpson index emphasizes dominance and gives more weight to common species, with higher values indicating greater diversity and less dominance by a single species. It is particularly sensitive to the presence of dominant taxa, making it useful for understanding the balance of species in the community. Additional alpha diversity information can be found in the directory ../analysis/alpha_diversity.

2.5 Beta diversity

Figure 2.5: Beta diversity visualized using principal coordinates analysis (PCoA) scatter plot.

The scatterplot above illustrates beta diversity of samples, based on the Bray-Curtis dissimilarity metric. The Bray-Curtis metric measures differences in species abundance between samples, focusing on both the presence/absence and the abundance of shared taxa. This metric is particularly sensitive to changes in dominant species, making it effective for highlighting differences in community structure. Each point in the scatter plot represents a sample, and the positioning reflects pairwise dissimilarities in microbial community composition. The axes represent the principal components that explain the greatest variance in the dataset, with the percentage of variance explained by each axis indicated.
Points are color-coded based on the values in the Genotype variable from metadata. Confidence ellipses are shown around groups with three or more members, representing the variation within each group. Clustering of points suggests similarities in community composition and ecological characteristics, while greater distances between points indicate more pronounced differences in structure. This visualization helps identify patterns of similarity and dissimilarity, providing insights into how environmental or treatment conditions affect microbial community composition. Additional beta diversity information can be found in the directory ../analysis/beta_diversity.

Figure 2.6: Beta diversity visualized using non-metric multidimensional scaling (nMDS) scatter plot.

The scatter plot above illustrates beta diversity among samples based on community composition. NMDS is an ordination technique that arranges samples in reduced-dimensional space while preserving the rank order of dissimilarities.
Each point represents a sample, with the distances between points reflecting differences in community composition. Closer points indicate more similar communities, while greater distances represent more dissimilar communities. Points are color-coded based on the values in the Genotype variable from metadata. Confidence ellipses are shown around groups with three or more members, representing the variation within each group.

\(~\)

2.6 Phylogeny

Phylogenetic relationships among major phyla

Figure 2.7: Phylogenetic relationships among major phyla

\(~\)
Phylogenetic relationships among selected genera

Figure 2.8: Phylogenetic relationships among selected genera

Phylogenetic trees above illustrate the evolutionary relationships among phyla (top) and genera (bottom). A maximum of 100 taxa were included in each plot. The branching structure reflects the inferred evolutionary history, with shorter branch lengths between tips indicating more closely related genera. Each tip of the tree is labeled with a genus or phylum and bubble sizes at the tips correspond to the relative abundance of each taxon across samples, integrating both phylogenetic and ecological information. Larger bubbles represent more abundant taxa; whereas smaller bubbles highlight those present at lower abundances. Bubbles are color-coded based on the values in the Genotype variable from metadata.
This visualization emphasizes the phylogenetic diversity within the community, as well as the distribution of taxa abundances different taxa levels. The tree provides insights into which taxa are more evolutionarily distinct and how these relate to their ecological prevalence in the dataset. Variations in abundance patterns among among closely or distantly related genera may suggest ecological or evolutionary factors influencing community structure. Additional phylogeny information can be found in the directory ../analysis/phylogeny.

\(~\)

2.7 Rarefaction

Figure 2.9: Alpha diversity rarefaction curves

The line plot above shows rarefaction data for QIIME2-classified features present in each sample across different sequencing depths. Each curve represents the cumulative number of features observed as the sampling effort increases, providing insight into community richness within the samples. The asymptotic behavior of the curves indicates whether sampling has reached saturation, meaning additional sequencing is unlikely to reveal new taxa. Curves that plateau suggest sufficient sampling depth, while those continuing to rise indicate that further sampling could uncover additional taxa.

Figure 2.9: Feature count rarefaction curves

The line plot above shows alpha diversity rarefaction curves for Chao1, Shannon, and Inverse Simpson indices across varying sample depths. Each line represents the diversity estimate as a function of sequencing depth, with Chao1 reflecting species richness (including rare species), Shannon representing both richness and evenness, and Inverse Simpson accounting for species dominance. The curves indicate how diversity estimates stabilize or fluctuate as sequencing effort increases, providing insight into whether sufficient sampling depth has been achieved. Comparing the three indices offers a comprehensive view of community structure, with Chao1 capturing potential unseen species, Shannon balancing richness and evenness, and Inverse Simpson highlighting dominant taxa within the community. Lines are color-coded based on the values in the Genotype variable from metadata. Additional rarefaction information can be found in the directory ../analysis/rarefaction.

\(~\)

3 Methods

3.1 Summary

Microbiome analysis was conducted using Quantitative Insights Into Microbial Ecology (QIIME2) version 2 [1]. QIIME2 is a bioinformatics tool designed for the analysis and interpretation of microbial community data. It provides a comprehensive set of tools for processing and analyzing high-throughput sequencing data from microbial communities. Initial assessment of the sequencing data quality was performed using FastQC [2]. FastQC is a widely used tool for evaluating the quality of high-throughput sequencing data. It generates detailed reports and visualizations to identify potential issues such as sequencing errors, adapter contamination, and other quality-related metrics. Sequencing reads underwent denoising and quality filtering using the DADA2 algorithm [3]. DADA2 is a bioinformatics pipeline that accurately models and corrects Illumina-sequenced amplicon errors to identify and infer the true biological sequence variants (ASVs) present in the data. This step helps to reduce noise and improve the accuracy of downstream analyses.

Sequence variants obtained from the denoised data were aligned and masked using Mafft. Mafft is a multiple sequence alignment program commonly used for aligning nucleotide and amino acid sequences. A phylogenetic tree of the Amplicon Sequence Variants (ASVs) was subsequently constructed using FastTree. This tree provides insights into the evolutionary relationships among the identified microbial sequences. Taxonomy was assigned to the ASVs using a Bayesian classifier based on a pretrained Silva database. The Silva database is a comprehensive reference database containing ribosomal RNA (rRNA) gene sequences and associated taxonomy. This step enables the classification of microbial taxa based on their genetic signatures.

Alpha diversity was calculated using Chao1, Shannon, and inverse Simpson indices. Beta diversity was calculated using the Bray-Curtis dissimilarity applied to ASV data. Bray-Curtis dissimilarity measures the compositional dissimilarity between microbial communities. Ordination plots were generated to visualize and interpret the relationships among samples based on their beta diversity. Alpha rarefaction curves were generated for all samples, using the observed features metric. This analysis helps to assess the sequencing depth and coverage by plotting the number of observed features (ASVs) against the sequencing depth. The rarefaction analysis ensures that diversity metrics are not biased by differences in sequencing depth.\(~\)

Figure 3.1: Diagram of bioinformatic pipeline used for analysis

\(~\)

3.2 Packages and versions

Table 3.1: Software used for analysis including version and database information
Software Version
QIIME2 2024.5
Denoising Dada2
Alignment PyNAST
Reference Database silva-138-99-nb-classifier.qza
Taxonomy Assignment classify-sklearn
Phylogeny Generation FastTree

This table lists the software packages and corresponding versions used in the analysis. The analysis was conducted using QIIME2, an open-source platform for microbial community analysis, alongside DADA2, which performs high-resolution sample inference by denoising amplicon sequence data. The table provides details on the core packages used for each step, including sequence processing, quality filtering, taxonomic classification, and diversity analysis, ensuring reproducibility in the computational workflow.

\(~\)

4 References

  1. Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al‐Ghalith GA, Alexander H, Alm EJ, Arumugam M, Asnicar F, Bai Y, Bisanz JE, Bittinger K, Brejnrod A, Brislawn C, Brown CT, Callahan BJ, Caraballo‐Rodríguez AM, Chase J, Cope EK, Da Silva RR, Diener C, Dorrestein PC, Douglas GM, Durall DM, Duvallet C, Edwardson CF, Ernst M, Estaki M, Fouquier J, Gauglitz JM, Gibbons SM, Gibson DL, González A, Gorlick K, Guo J, Hillmann B, Holmes S, Holste H, Huttenhower C, Huttley GA, Janssen S, Jarmusch AK, Jiang L, Kaehler B, Kang KB, Keefe CR, Keim P, Kelley ST, Knights D, Koester I, Kościółek T, Kreps J, Langille MGI, Lee J, Ley RE, Liu Y, Loftfield E, Lozupone C, Maher M, Marotz C, Martin BD, McDonald D, McIver LJ, Melnik AV, Metcalf JL, Morgan S, Morton J, Naimey AT, Navas-Molina JA, Nothias LF, Orchanian SB, Pearson T, Peoples SL, Petras D, Preuss ML, Pruesse E, Rasmussen LB, Rivers AR, Robeson MS, Rosenthal P, Segata N, Shaffer MJ, Shiffer A, Sinha R, Song SJ, Spear JR, Swafford AD, Thompson LR, Torres PJ, Trinh P, Tripathi A, Turnbaugh PJ, Ul-Hasan S, Van Der Hooft JJJ, Vargas F, Vázquez-Baeza Y, Vogtmann E, Von Hippel M, Walters WA, Wan Y, Wang M, Warren J, Weber KC, Williamson CHD, Willis A, Xu Z, Zaneveld J, Zhang Y, Zhu Q, Knight R, Caporaso JG. 2019. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology 37:852–857. https://doi.org/10.1038/s41587-019-0209-9
  2. S-Andrews. GitHub - s-andrews/FastQC: A quality control analysis tool for high throughput sequencing data. GitHub. https://github.com/s-andrews/FastQC.
  3. Callahan BJ, McMurdie PJ, Rosen MJ, Han A, Johnson AJA, Holmes S. 2016. DADA2: High-resolution sample inference from Illumina amplicon data. Nature Methods 13:581–583. https://doi.org/10.1038/nmeth.3869.
    \(~\)\(~\)\(~\)

Please acknowledge BRC in your manuscript or presentation. If you think our analysis contributes to your research intellectually please consider authorship for our bioinformaticians.
\(~\)