UW Biotechnology Center Bioinformatics Resource Center
Please direct questions to: brc@biotech.wisc.edu
PI Name | Name |
Project Description | Mouse fecal samples 16S (V3-V4) |
Submission | 16S_Illumina |
Number of samples | 8 |
Report generation date | 2024-11-12 |
Number of Features | 542 |
Total Frequency | 1615142 |
\(~\)
Sample ID | Input | Filtered | Retained (%) | Denoised | Retained (%) | Non-chimeric | Retained (%) | SampleID |
---|---|---|---|---|---|---|---|---|
C109 | 210,517 | 186,907 | 88.78 | 186,321 | 88.51 | 185,196 | 87.97 | C109 |
C110 | 258,515 | 230,223 | 89.06 | 226,183 | 87.49 | 214,232 | 82.87 | C110 |
D111 | 251,222 | 223,516 | 88.97 | 219,611 | 87.42 | 211,114 | 84.03 | D111 |
D112 | 321,828 | 287,150 | 89.22 | 282,927 | 87.91 | 269,369 | 83.70 | D112 |
G058 | 290,016 | 254,425 | 87.73 | 253,183 | 87.30 | 249,013 | 85.86 | G058 |
G059 | 309,802 | 274,197 | 88.51 | 270,486 | 87.31 | 265,006 | 85.54 | G059 |
H060 | 321,432 | 285,725 | 88.89 | 281,822 | 87.68 | 271,868 | 84.58 | H060 |
H061 | 30,370 | 24,285 | 79.96 | 23,528 | 77.47 | 23,489 | 77.34 | H061 |
The table above summarizes the number of reads
retained at each processing step. Input lists the total raw
reads per sample. Filtered shows remaining reads after removing
reads with low-quality scores. Denoised indicates the number of
reads retained after applying the DADA2 algorithm, which corrects
sequencing errors and removes artifacts by inferring biological
sequences from the data. Non-chimeric contains reads present
after excluding potential chimeric reads formed during PCR
amplification. Retention percentages at each step are calculated
relative to the initial number of reads.
\(~\)
Figure 1.1: Initial quantity of unique and duplicate sequence reads for each sample.
The barplot above illustrates the overall number of sequencing
reads for each sample. Overall read counts are broken down by unique
(light blue) and duplicate reads (dark blue).
\(~\)
Figure 1.2: Initial quantity of unique and duplicate sequence reads for each sample.
The line plot above displays the sequencing quality scores for
each sample, as determined by Phred scores on the y-axis. This
visualization offers insights into the overall performance of the
sequencing run and highlights samples with low sequencing quality. Each line represents the forward or reverse read for a
given sample, facilitating comparative evaluation of sequencing
quality of paired-end sequencing data These scores reflect the accuracy of base calls across the
read length represented on the x-axis. High-quality scores (typically
above 30, shaded green) indicate a high level of accuracy, while scores
below this threshold (shaded yellow and red) may suggest potential
quality issues. Reports for individual reports can be found in the
directory ../fastqc and a combined report is located in the
directory ../multiqc.
\(~\)
Figure 2.1: Relative abundance and NCBI taxonomy ID for each taxa identified.
The table above contains relative abundance of taxa across
samples. Each row represents a distinct taxonomic group. The first
column corresponds to the NCBI taxonomy ID (if present). Subsequent
columns represent the taxonomic hierarchy, including domain, phylum,
class, order, family, and genus, allowing for a clear understanding of taxonomic
relationships. A total of 65 genera and 10 phyla were classified. Sample IDs are provided with corresponding relative
abundance values for each taxon. Higher values indicate taxa with
greater abundance in a given sample. Both relative and abundance data at each taxonomic level is
located in the directory ../analysis/rel_table.
\(~\)
Figure 2.2: Stacked barplot depicting relative abundance of major phyla.
The barplot above illustrates the distribution of
taxa at the phylalevel across different samples. The number of
taxa is restricted to a maximum of 30. The height of each
color segment reflects the proportional abundance of each taxon within
the overall community. Samples are organized based on the values in
the Genotype, Sex, and DOB variables from metadata.
\(~\)
Figure 2.3: Heatmap depicting taxa abundance using absolute read counts.
The heatmap above illustrates the abundance of taxa
across samples. The number of genera is restricted to a maximum
of 30genera across all samples. Each cell in the heatmap
represents the unnormalized absolute read counts of a specific genera, with color gradients
indicating different levels of abundance. Samples are organized based on the values in
the Genotype, Sex, and DOB variables from metadata.
\(~\)
Figure 2.4: Boxplot of three different alpha diversity indices.
Boxplots of alpha diversity metrics — Chao1, Shannon, and
Inverse Simpson indexes—across different groups or treatment conditions.
The boxplots display the median, interquartile range, and potential
outliers, providing a visual summary of the diversity patterns and
variability within the sampled communities. Boxes are color-coded based on the values in the Genotype variable from metadata.
The Chao1 index is a species richness estimator,
which accounts for the number of observed taxa and adjusts for the
presence of rare, undetected taxa. It provides an estimate of the total
species richness, including species that may have been missed due to
undersampling. The Shannon index combines both species richness and
evenness, measuring the diversity of the community by considering not
only how many species are present but also how evenly distributed they
are. A higher Shannon index reflects a community where species are more
evenly represented, whereas lower values indicate a community dominated
by a few species. The Inverse Simpson index emphasizes dominance and
gives more weight to common species, with higher values indicating
greater diversity and less dominance by a single species. It is
particularly sensitive to the presence of dominant taxa, making it
useful for understanding the balance of species in the community. Additional alpha diversity information can be found in the
directory ../analysis/alpha_diversity.
Figure 2.5: Beta diversity visualized using principal coordinates analysis (PCoA) scatter plot.
The scatterplot above illustrates beta diversity of
samples, based on the Bray-Curtis dissimilarity metric. The Bray-Curtis
metric measures differences in species abundance between samples,
focusing on both the presence/absence and the abundance of shared taxa.
This metric is particularly sensitive to changes in dominant species,
making it effective for highlighting differences in community structure.
Each point in the scatter plot represents a sample, and the
positioning reflects pairwise dissimilarities in microbial community
composition. The axes represent the principal components that explain
the greatest variance in the dataset, with the percentage of variance
explained by each axis indicated.
Points are color-coded based on the values in
the Genotype variable from metadata. Confidence ellipses are shown around groups with three or
more members, representing the variation within each group. Clustering of points suggests similarities in community
composition and ecological characteristics, while greater distances
between points indicate more pronounced differences in structure. This
visualization helps identify patterns of similarity and dissimilarity,
providing insights into how environmental or treatment conditions affect
microbial community composition. Additional beta diversity information
can be found in the directory ../analysis/beta_diversity.
Figure 2.6: Beta diversity visualized using non-metric multidimensional scaling (nMDS) scatter plot.
The scatter plot above illustrates beta diversity among
samples based on community composition. NMDS is an ordination technique
that arranges samples in reduced-dimensional space while preserving the
rank order of dissimilarities.
Each point represents a sample, with the distances between points
reflecting differences in community composition. Closer points indicate
more similar communities, while greater distances represent more
dissimilar communities. Points are color-coded based on the values in the Genotype variable from metadata. Confidence ellipses are shown around groups with three or
more members, representing the variation within each group.
\(~\)
Figure 2.7: Phylogenetic relationships among major phyla
Figure 2.8: Phylogenetic relationships among selected genera
Phylogenetic trees above illustrate the evolutionary
relationships among phyla (top) and genera (bottom). A maximum of 100 taxa were included in each
plot. The branching structure reflects the inferred evolutionary
history, with shorter branch lengths between tips indicating more
closely related genera. Each tip of the tree is labeled with a genus or
phylum and bubble sizes at the tips correspond to the relative abundance
of each taxon across samples, integrating both phylogenetic and
ecological information. Larger bubbles represent more abundant taxa;
whereas smaller bubbles highlight those present at lower abundances. Bubbles are color-coded based on the values in
the Genotype variable from metadata.
This visualization emphasizes the phylogenetic diversity within the
community, as well as the distribution of taxa abundances different taxa
levels. The tree provides insights into which taxa are more
evolutionarily distinct and how these relate to their ecological
prevalence in the dataset. Variations in abundance patterns among
among closely or distantly related genera may suggest ecological or
evolutionary factors influencing community structure. Additional
phylogeny information can be found in the directory
../analysis/phylogeny.
\(~\)
Figure 2.9: Alpha diversity rarefaction curves
Figure 2.9: Feature count rarefaction curves
The line plot above shows alpha diversity rarefaction curves
for Chao1, Shannon, and Inverse Simpson indices across varying sample
depths. Each line represents the diversity estimate as a function of
sequencing depth, with Chao1 reflecting species richness (including
rare species), Shannon representing both richness and evenness, and
Inverse Simpson accounting for species dominance. The curves indicate
how diversity estimates stabilize or fluctuate as sequencing effort
increases, providing insight into whether sufficient sampling depth
has been achieved. Comparing the three indices offers a comprehensive
view of community structure, with Chao1 capturing potential unseen
species, Shannon balancing richness and evenness, and Inverse Simpson
highlighting dominant taxa within the community. Lines are color-coded based on the values in the Genotype variable from metadata. Additional rarefaction information can be found in the
directory ../analysis/rarefaction.
\(~\)
Microbiome analysis was conducted using Quantitative
Insights Into Microbial Ecology (QIIME2) version 2 [1]. QIIME2 is a
bioinformatics tool designed for the analysis and interpretation of
microbial community data. It provides a comprehensive set of tools for
processing and analyzing high-throughput sequencing data from microbial
communities. Initial assessment of the sequencing data quality was
performed using FastQC [2]. FastQC is a widely used tool for evaluating
the quality of high-throughput sequencing data. It generates detailed
reports and visualizations to identify potential issues such as
sequencing errors, adapter contamination, and other quality-related
metrics. Sequencing reads underwent denoising and quality filtering
using the DADA2 algorithm [3]. DADA2 is a bioinformatics pipeline that
accurately models and corrects Illumina-sequenced amplicon errors to
identify and infer the true biological sequence variants (ASVs) present
in the data. This step helps to reduce noise and improve the accuracy of
downstream analyses.
Sequence variants obtained from the denoised data were
aligned and masked using Mafft. Mafft is a multiple sequence alignment
program commonly used for aligning nucleotide and amino acid sequences.
A phylogenetic tree of the Amplicon Sequence Variants (ASVs) was
subsequently constructed using FastTree. This tree provides insights into
the evolutionary relationships among the identified microbial sequences.
Taxonomy was assigned to the ASVs using a Bayesian classifier based on a
pretrained Silva database. The Silva database is a comprehensive
reference database containing ribosomal RNA (rRNA) gene sequences and
associated taxonomy. This step enables the classification of microbial
taxa based on their genetic signatures.
Alpha diversity was calculated using Chao1, Shannon, and inverse
Simpson indices. Beta diversity was calculated using the Bray-Curtis
dissimilarity applied to ASV data. Bray-Curtis dissimilarity measures
the compositional dissimilarity between microbial communities. Ordination
plots were generated to visualize and interpret the relationships among
samples based on their beta diversity. Alpha rarefaction curves were
generated for all samples, using the observed features metric. This
analysis helps to assess the sequencing depth and coverage by plotting
the number of observed features (ASVs) against the sequencing depth.
The rarefaction analysis ensures that diversity metrics are not biased
by differences in sequencing depth.\(~\)
Figure 3.1: Diagram of bioinformatic pipeline used for analysis
\(~\)
Software | Version |
---|---|
QIIME2 | 2024.5 |
Denoising | Dada2 |
Alignment | PyNAST |
Reference Database | silva-138-99-nb-classifier.qza |
Taxonomy Assignment | classify-sklearn |
Phylogeny Generation | FastTree |
This table lists the software packages and
corresponding versions used in the analysis. The analysis was
conducted using QIIME2, an open-source platform for microbial
community analysis, alongside DADA2, which performs
high-resolution sample inference by denoising amplicon sequence
data. The table provides details on the core packages used for
each step, including sequence processing, quality filtering,
taxonomic classification, and diversity analysis, ensuring
reproducibility in the computational workflow.
\(~\)
Please acknowledge BRC in your manuscript or presentation. If
you think our analysis contributes to your research intellectually
please consider authorship for our bioinformaticians.
\(~\)