UW Biotechnology Center Bioinformatics Resource Center
Please direct questions to:


1 Summary

1.1 Project summary

PI Name Name
Project Description Mouse fecal samples shotgun
Submission WGS_Illumina
Number of samples 8
Report generation date 2024-11-12
Species identified 127

\(~\)

1.2 Sequence data

Table 1.1: Read counts at each stage of sequence data processing.
SampleID Raw Trimmed Retained (%) Filtered Retained (%) diff_trimmed diff_filtered Sample ID
C109 111,843,417 99,846,432 89.27 98,720,036 88.27 % % C109
C110 90,457,933 79,098,890 87.44 77,974,534 86.20 % % C110
D111 75,252,117 67,225,199 89.33 65,920,407 87.60 % % D111
D112 75,281,562 66,790,513 88.72 65,453,826 86.95 % % D112
G058 65,318,143 58,329,423 89.30 56,432,454 86.40 % % G058
G059 66,321,451 57,883,013 87.28 54,979,500 82.90 % % G059
H060 72,512,627 62,524,240 86.23 51,392,279 70.87 % % H060
H061 78,616,179 71,171,535 90.53 62,733,110 79.80 % % H061

The table above summarizes the number of sequencing reads at three stages of processing: raw, trimmed, and filtered. Raw represents the initial number of sequencing reads generated for each sample. Trimmed shows the number of reads remaining after removing adapter sequences, primers, and low-quality bases from the ends of reads using Trimmomatic. Filtered contains the number of reads following alignment to reference genomes for decontamination using Bowtie2. The percentages of reads retained at each step are calculated relative to the original number of reads. Trimmed and filtered reads used for downstream analysis can be found in the directory ../bowtie2.

\(~\)

Figure 1.1: Initial quantity of unique and duplicate sequence reads for each sample.

The barplot above illustrates the overall number of sequencing reads for each sample. Overall read counts are broken down by unique (light blue) and duplicate reads (dark blue).

\(~\)

Initial quantity of unique and duplicate sequence reads for each sample.

Figure 1.2: Initial quantity of unique and duplicate sequence reads for each sample.

The line plot above displays the sequencing quality scores for each sample, as determined by Phred scores on the y-axis. This visualization offers insights into the overall performance of the sequencing run and highlights samples with low sequencing quality. Each line represents the forward or reverse read for a given sample, facilitating comparative evaluation of sequencing quality of paired-end sequencing data These scores reflect the accuracy of base calls across the read length represented on the x-axis. High-quality scores (typically above 30, shaded green) indicate a high level of accuracy, while scores below this threshold (shaded yellow and red) may suggest potential quality issues. Reports for individual reports can be found in the directory ../fastqc and a combined report is located in the directory ../multiqc.

\(~\)

1.3 Metagenomic assembly and binning

Figure 1.3: Binning and taxonomic classification of MAGs.

This table presents a summary of metagenome-assembled genomes (MAGs). Following assembly, MAGs are clusered into draft genome bins. Refinement steps to optimize genome completeness and minimize contamination were conducted. A total of 638 refined MAGs were produced. Following refinement, quality assessment and taxonomic classification of MAGs was performed. A more comprehensive table can be found at ../analysis/checkm_gtdbtk.tsv. Refined MAGs can be found in the directory ../metawrap.

The column labeled ‘Bin ID’ lists the name for each MAG bin. Marker lineage contains the taxonomic lineage assigned based on specific marker genes found within the MAG. The subsequent columns labeled 0, 1, 2, 3, 4, 5+ identify the distribution of unique and shared single-copy markers identified. These markers are used to estimate MAG completeness and contamination. Completeness indicates the estimated percentage of the genome that is represented in the MAG. Higher completeness values suggest that most of the expected genes for that genome are present. Contamination represents the percentage of the genome that may contain sequences from other organisms, signaling potential contamination in the binning process. Lower contamination is ideal, as high values suggest non-target or mixed-species sequences within the MAG. Strain hetrozygosity measures variability within a MAG that could indicate mixed strains or species. High strain heterogeneity may reflect issues with the binning process or genomic diversity within a species. The column labeled Classification provides the hierachical taxonomy assigned to the MAG by GTDB-Tk.

\(~\)

2 Microbial diversity

2.1 Taxonomic classification

Figure 2.1: Relative abundance and NCBI taxonomy ID for each taxa identified.

The table above contains relative abundance of taxa across samples. Each row represents a distinct taxonomic group. The first column corresponds to the NCBI taxonomy ID (if present). Subsequent columns represent the taxonomic hierarchy, including domain, phylum, class, order, family, genus, and species, allowing for a clear understanding of taxonomic relationships. A total of 127 species, 80 genera, and 11 phyla were classified. Sample IDs are provided with corresponding relative abundance values for each taxon. Higher values indicate taxa with greater abundance in a given sample.

\(~\)

2.2 Microbial community composition at phylum level

Figure 2.2: Stacked barplot depicting relative abundance of major phyla.

The barplot above illustrates the distribution of taxa at the phylalevel across different samples. The number of taxa is restricted to a maximum of 30. The height of each color segment reflects the proportional abundance of each taxon within the overall community. Samples are organized based on the values in the Genotype, Sex, and DOB variables from metadata. Interactive barplots can be viewed using the file ../analysis/taxa/taxa-bar-plots.qzv and the website https://view.qiime2.org/ Merged taxonomic output can be found in the directory ../analysis. Merged output is divided by taxonomic rank and presented as read counts and normalized to relative abundance. MetaPhlAn output can be found with the prefix metaphlan_*. Kraken2 and Bracken output can be found with the prefix kraken2_* and bracken_*, respecitvely.

\(~\)

2.3 Microbial community composition at species level

Figure 2.3: Heatmap depicting taxa abundance using absolute read counts.

The heatmap above illustrates the abundance of taxa across samples. The number of species is restricted to a maximum of 30species across all samples. Each cell in the heatmap represents the unnormalized absolute read counts of a specific species, with color gradients indicating different levels of abundance. Samples are organized based on the values in the Genotype, Sex, and DOB variables from metadata.

\(~\)

2.4 Alpha diversity

Figure 2.4: Boxplot of three different alpha diversity indices.

Boxplots of alpha diversity metrics — Chao1, Shannon, and Inverse Simpson indexes — across different groups or treatment conditions. The boxplots display the median, interquartile range, and potential outliers, providing a visual summary of the diversity patterns and variability within the sampled communities. Boxes are color-coded based on the values in the Genotype variable from metadata.
The Chao1 index is a species richness estimator, which accounts for the number of observed taxa and adjusts for the presence of rare, undetected taxa. It provides an estimate of the total species richness, including species that may have been missed due to undersampling. The Shannon index combines both species abundance and evenness, measuring the diversity of the community by considering not only how many species are present but also how evenly distributed they are. The Inverse Simpson index emphasizes dominance and gives more weight to common species, with higher values indicating greater diversity and less dominance by a single species.

2.5 Beta diversity

Figure 2.5: Beta diversity visualized using principal coordinates analysis (PCoA) scatter plot.

The scatter plot above illustrates beta diversity of samples, based on the Bray-Curtis dissimilarity metric. The Bray-Curtis metric measures differences in species abundance between samples, focusing on both the presence/absence and the abundance of shared taxa. The axes represent the principal components that explain the greatest variance in the dataset, with the percentage of variance explained by each axis indicated.
Points are color-coded based on the values in the Genotype variable from metadata. Confidence ellipses are shown around groups with three or more members, representing the variation within each group. Clustering of points suggests similarities in community composition and ecological characteristics, while greater distances between points indicate more pronounced differences in structure. This visualization helps identify patterns of similarity and dissimilarity, providing insights into how environmental or treatment conditions affect microbial community composition.

Figure 2.6: Beta diversity visualized using non-metric multidimensional scaling (nMDS) scatter plot.

The scatter plot above illustrates beta diversity among samples based on community composition. NMDS is an ordination technique that arranges samples in reduced-dimensional space while preserving the rank order of dissimilarities.
Each point represents a sample, with the distances between points reflecting differences in community composition. Closer points indicate more similar communities, while greater distances represent more dissimilar communities. Points are color-coded based on the values in the Genotype variable from metadata. Confidence ellipses are shown around groups with three or more members, representing the variation within each group.

\(~\)

3 Methods

3.1 Summary

Metagenomic analysis was conducted through a structured pipeline with quality control, taxonomic classification and metagenome assembly/binning stages:
Quality Control: Raw sequence reads underwent a series of quality control steps to ensure data reliability. Initially, low-quality bases and adapters were trimmed using Trimmomatic to maintain high read integrity [1]. Following trimming, contamination was screened by aligning reads against a custom reference database using Bowtie2, and alignments were processed with SAMtools to remove any reads potentially originating from contaminant sources [2][3]. The quality of raw, trimmed, and filtered reads was then evaluated using FastQC, which allowed for visualization of read quality metrics to confirm the effectiveness of preprocessing steps and identify any residual issues [4].
Taxonomic Classification and Abundance Estimation: Kraken2 was used to perform an initial classification of reads to taxonomic ranks [5]. This was followed by abundance estimation via Bracken, which refines Kraken2’s classifications to provide more accurate species-level abundance profiles [6]. MetaPhlAn was applied to classify microbial taxa based on unique clade-specific markers [7].
Metagenome Assembly and Binning: Metagenomic assembly was performed to reconstruct genome bins, representing draft genomes for microbial community members. Initial metagenomic assembly was performed using SPAdes [8]. Initial assembly quality was evaluated using QUAST [9]. Assembly outputs were refined using MetaWRAP, which employs multiple binning algorithms to produce optimized bins by consolidating and improving initial assemblies [10]. The quality of each bin was subsequently assessed using CheckM, which evaluates bin completeness and contamination levels to ensure that high-quality, near-complete bins are retained for downstream analysis [11]. Finally, each bin was assigned taxonomy through the GTDB-Tk tool, which classifies bins to standardized taxonomic ranks based on the Genome Taxonomy Database, providing taxonomic resolution for the assembled bins [12]. \(~\)

Figure 3.1: Diagram of bioinformatic pipeline used for analysis

\(~\)

3.2 Packages and versions

Table 3.1: Software used for analysis including version and database information
Package Version Database
Bowtie2 2.5.1
FastQC 0.11.9
samtools 1.16.1
Trimmomatic 0.39
MetaPhlAn 4.1.1 mpa_vJun23_CHOCOPhlAnSGB_202403
Kraken2 2.1.3 k2_standard_20240605
SPAdes 3.15.2
QUAST 5.2.0
MetaWRAP v1.3–a7eb9af
CheckM 1.2.2 checkm_data_2015_01_16
GTDB-Tk 2.4.0–pyhdfd78af_1 gtdbtk_r220_data

This table provides an overview of the software packages and their respective versions used in the metagenomic analysis. Each package played a specific role in different steps of the workflow, including sequence quality control, taxonomic classification, and diversity analysis. Listing the package versions ensures reproducibility of the analysis and allows for comparison with other studies, offering transparency in the computational processes applied to the metagenomic data.

\(~\)

Table 3.2: Genomes used for contamination filtering
Genome ID FASTA filename
hg38 GCF_000001405.26_GRCh38_genomic.fna.gz
phiX GCF_000819615.1_ViralProj14015_genomic.fna.gz
GRCm39 GCF_000001635.27_GRCm39_genomic.fna.gz

This table presents the genomes utilized as reference sequences for filtering metagenomic reads during processing. Each entry lists the internal genome ID and FASTA filename.

\(~\)

4 References

  1. Bolger A, Lohse M, Usadel B. 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120. https://doi.org/10.1093/bioinformatics/btu170.
  2. Langmead B, Salzberg SL. 2012. Fast gapped-read alignment with Bowtie 2. Nature Methods 9:357–359. https://doi.org/10.1038/nmeth.1923.
  3. Li H, Handsaker RE, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis GR, Durbin R. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078–2079. https://doi.org/10.1093/bioinformatics/btp352
  4. S-Andrews. GitHub - s-andrews/FastQC: A quality control analysis tool for high throughput sequencing data. GitHub. https://github.com/s-andrews/FastQC.
  5. Wood DE, Lu J, Langmead B. 2019. Improved metagenomic analysis with Kraken 2. Genome Biology 20. https://doi.org/10.1186/s13059-019-1891-0.
  6. Lu J, Breitwieser FP, Thielen P, Salzberg SL. 2017. Bracken: estimating species abundance in metagenomics data. PeerJ 3:e104. Lu J, Breitwieser FP, Thielen P, Salzberg SL. 2017. Bracken: estimating species abundance in metagenomics data. PeerJ 3:e104. https://doi.org/10.7717/peerj-cs.104.
  7. Blanco-Míguez A, Beghini F, Cumbo F, McIver LJ, Thompson KN, Zolfo M, Manghi P, Dubois L, Huang K, Thomas AM, Nickols WA, Piccinno G, Piperni E, Punčochář M, Valles‐Colomer M, Tett A, Giordano F, Davies R, Wolf J, Berry S, Spector TD, Franzosa EA, Pasolli E, Asnicar F, Huttenhower C, Segata N. 2023. Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4. Nature Biotechnology 41:1633–1644. https://doi.org/10.1038/s41587-023-01688-w.
  8. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA. 2012. SPADES: A new genome assembly algorithm and its applications to Single-Cell sequencing. Journal of Computational Biology 19:455–477. https://doi.org/10.1089/cmb.2012.0021.
  9. Gurevich A, Saveliev V, Vyahhi N, Tesler G. 2013. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29:1072–1075. 10.1093/bioinformatics/btt086.
  10. Uritskiy GV, DiRuggiero J, Taylor J. 2018. MetaWRAP—a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome 6. https://doi.org/10.1186/s40168-018-0541-1.
  11. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2015. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research 25:1043–1055. https://doi.org/10.1101/gr.186072.114.
  12. Chaumeil P-A, Mussig AJ, Hugenholtz P, Parks DH. 2019. GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics 36:1925–1927. https://doi.org/10.1093/bioinformatics/btz848.
    \(~\)\(~\)\(~\)

Please acknowledge BRC in your manuscript or presentation. If you think our analysis contributes to your research intellectually please consider authorship for our bioinformaticians.
\(~\)