UW Biotechnology Center Bioinformatics Resource Center
Please direct questions to: brc@biotech.wisc.edu
PI Name | Name |
Project Description | Mouse fecal samples shotgun |
Submission | WGS_Illumina |
Number of samples | 8 |
Report generation date | 2024-11-12 |
Species identified | 127 |
\(~\)
SampleID | Raw | Trimmed | Retained (%) | Filtered | Retained (%) | diff_trimmed | diff_filtered | Sample ID |
---|---|---|---|---|---|---|---|---|
C109 | 111,843,417 | 99,846,432 | 89.27 | 98,720,036 | 88.27 | % | % | C109 |
C110 | 90,457,933 | 79,098,890 | 87.44 | 77,974,534 | 86.20 | % | % | C110 |
D111 | 75,252,117 | 67,225,199 | 89.33 | 65,920,407 | 87.60 | % | % | D111 |
D112 | 75,281,562 | 66,790,513 | 88.72 | 65,453,826 | 86.95 | % | % | D112 |
G058 | 65,318,143 | 58,329,423 | 89.30 | 56,432,454 | 86.40 | % | % | G058 |
G059 | 66,321,451 | 57,883,013 | 87.28 | 54,979,500 | 82.90 | % | % | G059 |
H060 | 72,512,627 | 62,524,240 | 86.23 | 51,392,279 | 70.87 | % | % | H060 |
H061 | 78,616,179 | 71,171,535 | 90.53 | 62,733,110 | 79.80 | % | % | H061 |
The table above summarizes the number of sequencing
reads at three stages of processing: raw, trimmed, and filtered.
Raw represents the initial number of sequencing reads
generated for each sample. Trimmed shows the number
of reads remaining after removing adapter sequences, primers,
and low-quality bases from the ends of reads using Trimmomatic.
Filtered contains the number of reads following
alignment to reference genomes for decontamination using
Bowtie2. The percentages of reads retained at each step are
calculated relative to the original number of reads. Trimmed and
filtered reads used for downstream analysis can be found in the
directory ../bowtie2.
\(~\)
Figure 1.1: Initial quantity of unique and duplicate sequence reads for each sample.
The barplot above illustrates the overall number of sequencing
reads for each sample. Overall read counts are broken down by unique
(light blue) and duplicate reads (dark blue).
\(~\)
Figure 1.2: Initial quantity of unique and duplicate sequence reads for each sample.
The line plot above displays the sequencing quality scores for
each sample, as determined by Phred scores on the y-axis. This
visualization offers insights into the overall performance of the
sequencing run and highlights samples with low sequencing quality. Each line represents the forward or reverse read for a
given sample, facilitating comparative evaluation of sequencing
quality of paired-end sequencing data These scores reflect the accuracy of base calls across the
read length represented on the x-axis. High-quality scores (typically
above 30, shaded green) indicate a high level of accuracy, while scores
below this threshold (shaded yellow and red) may suggest potential
quality issues. Reports for individual reports can be found in the
directory ../fastqc and a combined report is located in the
directory ../multiqc.
\(~\)
Figure 1.3: Binning and taxonomic classification of MAGs.
This table presents a summary of metagenome-assembled
genomes (MAGs). Following assembly, MAGs are clusered into draft genome
bins. Refinement steps to optimize genome completeness and minimize
contamination were conducted. A total of 638 refined MAGs were
produced. Following refinement, quality assessment and taxonomic
classification of MAGs was performed. A more comprehensive table can be
found at ../analysis/checkm_gtdbtk.tsv.
Refined MAGs can be found in the directory ../metawrap.
The column labeled ‘Bin ID’ lists the name for each MAG bin. Marker lineage
contains the taxonomic lineage assigned based on specific marker genes
found within the MAG. The subsequent columns labeled 0, 1, 2, 3, 4, 5+
identify the distribution of unique and shared single-copy markers
identified. These markers are used to estimate MAG completeness and
contamination. Completeness indicates the estimated percentage of the
genome that is represented in the MAG. Higher completeness values
suggest that most of the expected genes for that genome are present.
Contamination represents the percentage of the genome that may contain
sequences from other organisms, signaling potential contamination in the
binning process. Lower contamination is ideal, as high values suggest
non-target or mixed-species sequences within the MAG. Strain
hetrozygosity measures variability within a MAG that could indicate
mixed strains or species. High strain heterogeneity may reflect issues
with the binning process or genomic diversity within a species. The
column labeled Classification provides the hierachical taxonomy assigned
to the MAG by GTDB-Tk.
\(~\)
Figure 2.1: Relative abundance and NCBI taxonomy ID for each taxa identified.
The table above contains relative abundance of taxa across
samples. Each row represents a distinct taxonomic group. The first
column corresponds to the NCBI taxonomy ID (if present). Subsequent
columns represent the taxonomic hierarchy, including domain, phylum,
class, order, family, genus, and species, allowing for a clear understanding of taxonomic
relationships. A total of 127 species, 80 genera, and 11 phyla were classified. Sample IDs are provided with corresponding relative
abundance values for each taxon. Higher values indicate taxa with
greater abundance in a given sample.
\(~\)
Figure 2.2: Stacked barplot depicting relative abundance of major phyla.
The barplot above illustrates the distribution of
taxa at the phylalevel across different samples. The number of
taxa is restricted to a maximum of 30. The height of each
color segment reflects the proportional abundance of each taxon within
the overall community. Samples are organized based on the values in
the Genotype, Sex, and DOB variables from metadata. Interactive barplots can be viewed using the file
../analysis/taxa/taxa-bar-plots.qzv and the website
https://view.qiime2.org/ Merged taxonomic output can be found in the directory
../analysis. Merged output is divided by taxonomic
rank and presented as read counts and normalized to relative abundance. MetaPhlAn output can be found with the prefix
metaphlan_*. Kraken2 and Bracken output can be found with the prefix
kraken2_* and bracken_*, respecitvely.
\(~\)
Figure 2.3: Heatmap depicting taxa abundance using absolute read counts.
The heatmap above illustrates the abundance of taxa
across samples. The number of species is restricted to a maximum
of 30species across all samples. Each cell in the heatmap
represents the unnormalized absolute read counts of a specific species, with color gradients
indicating different levels of abundance. Samples are organized based on the values in
the Genotype, Sex, and DOB variables from metadata.
\(~\)
Figure 2.4: Boxplot of three different alpha diversity indices.
Boxplots of alpha diversity metrics — Chao1, Shannon, and
Inverse Simpson indexes — across different groups or treatment conditions.
The boxplots display the median, interquartile range, and potential
outliers, providing a visual summary of the diversity patterns and
variability within the sampled communities. Boxes are color-coded based on the values in the Genotype variable from metadata.
The Chao1 index is a species richness estimator,
which accounts for the number of observed taxa and adjusts for the
presence of rare, undetected taxa. It provides an estimate of the total
species richness, including species that may have been missed due to
undersampling. The Shannon index combines both species abundance and
evenness, measuring the diversity of the community by considering not
only how many species are present but also how evenly distributed they
are. The Inverse Simpson index emphasizes dominance and
gives more weight to common species, with higher values indicating
greater diversity and less dominance by a single species.
Figure 2.5: Beta diversity visualized using principal coordinates analysis (PCoA) scatter plot.
The scatter plot above illustrates beta diversity of
samples, based on the Bray-Curtis dissimilarity metric. The Bray-Curtis
metric measures differences in species abundance between samples,
focusing on both the presence/absence and the abundance of shared taxa.
The axes represent the principal components that explain
the greatest variance in the dataset, with the percentage of variance
explained by each axis indicated.
Points are color-coded based on the values in
the Genotype variable from metadata. Confidence ellipses are shown around groups with three or
more members, representing the variation within each group. Clustering of points suggests similarities in community
composition and ecological characteristics, while greater distances
between points indicate more pronounced differences in structure. This
visualization helps identify patterns of similarity and dissimilarity,
providing insights into how environmental or treatment conditions affect
microbial community composition.
Figure 2.6: Beta diversity visualized using non-metric multidimensional scaling (nMDS) scatter plot.
The scatter plot above illustrates beta diversity among
samples based on community composition. NMDS is an ordination technique
that arranges samples in reduced-dimensional space while preserving the
rank order of dissimilarities.
Each point represents a sample, with the distances between points
reflecting differences in community composition. Closer points indicate
more similar communities, while greater distances represent more
dissimilar communities. Points are color-coded based on the values in the Genotype variable from metadata. Confidence ellipses are shown around groups with three or
more members, representing the variation within each group.
\(~\)
Metagenomic analysis was conducted through a
structured pipeline with quality control, taxonomic classification and metagenome assembly/binning stages:
Quality Control: Raw sequence reads underwent a series of quality
control steps to ensure data reliability. Initially, low-quality bases
and adapters were trimmed using Trimmomatic to maintain high read
integrity [1]. Following trimming, contamination was screened by
aligning reads against a custom reference database using Bowtie2, and
alignments were processed with SAMtools to remove any reads potentially
originating from contaminant sources [2][3]. The quality of raw,
trimmed, and filtered reads was then evaluated using FastQC, which
allowed for visualization of read quality metrics to confirm the
effectiveness of preprocessing steps and identify any residual issues [4].
Taxonomic Classification and Abundance Estimation: Kraken2 was used to perform an initial classification of
reads to taxonomic ranks [5]. This was followed by abundance estimation via Bracken,
which refines Kraken2’s classifications to provide more accurate
species-level abundance profiles [6]. MetaPhlAn was applied to classify microbial taxa based on
unique clade-specific markers [7].
Metagenome Assembly and Binning: Metagenomic assembly was
performed to reconstruct genome bins, representing draft genomes for
microbial community members. Initial metagenomic assembly was performed
using SPAdes [8]. Initial assembly quality was evaluated using QUAST [9]. Assembly outputs were refined using
MetaWRAP, which employs multiple binning algorithms to produce optimized
bins by consolidating and improving initial assemblies [10]. The quality of each bin was subsequently assessed using
CheckM, which evaluates bin completeness and contamination levels to
ensure that high-quality, near-complete bins are retained for downstream
analysis [11]. Finally, each bin was assigned taxonomy through the
GTDB-Tk tool, which classifies bins to standardized taxonomic ranks
based on the Genome Taxonomy Database, providing taxonomic resolution
for the assembled bins [12]. \(~\)
Figure 3.1: Diagram of bioinformatic pipeline used for analysis
\(~\)
Package | Version | Database |
---|---|---|
Bowtie2 | 2.5.1 | |
FastQC | 0.11.9 | |
samtools | 1.16.1 | |
Trimmomatic | 0.39 | |
MetaPhlAn | 4.1.1 | mpa_vJun23_CHOCOPhlAnSGB_202403 |
Kraken2 | 2.1.3 | k2_standard_20240605 |
SPAdes | 3.15.2 | |
QUAST | 5.2.0 | |
MetaWRAP | v1.3–a7eb9af | |
CheckM | 1.2.2 | checkm_data_2015_01_16 |
GTDB-Tk | 2.4.0–pyhdfd78af_1 | gtdbtk_r220_data |
This table provides an overview of the software
packages and their respective versions used in the metagenomic
analysis. Each package played a specific role in different steps
of the workflow, including sequence quality control, taxonomic
classification, and diversity analysis. Listing the package
versions ensures reproducibility of the analysis and allows for
comparison with other studies, offering transparency in the
computational processes applied to the metagenomic data.
\(~\)
Genome ID | FASTA filename |
---|---|
hg38 | GCF_000001405.26_GRCh38_genomic.fna.gz |
phiX | GCF_000819615.1_ViralProj14015_genomic.fna.gz |
GRCm39 | GCF_000001635.27_GRCm39_genomic.fna.gz |
This table presents the genomes utilized as reference
sequences for filtering metagenomic reads during processing. Each entry
lists the internal genome ID and FASTA filename.
\(~\)
Please acknowledge BRC in your manuscript or presentation. If
you think our analysis contributes to your research intellectually
please consider authorship for our bioinformaticians.
\(~\)