User ID: UWBC
Submission ID: M000000
Sample ID:
Prokaryote_test
Sanitized sample ID: Prokaryote_test
Barcode
ID: barcode54
Run ID: run4440P
Forward primer:
ACGTAACGCCAACCTTTTGC
Reverse primer: CTGGTTTTCTGGTGCCATGC
Observed reads: 4064
Filtered reads: 2940
Number of primary alignments to reference: 2707
Number of
aligned nucleotides: 2656289
Reference length: 2054
Consensus length: 2060
The consensus sequence includes PASSing variants listed in the tables below.
CHROM | POS | REF | ALT | QUAL | FILTER | P | F |
---|---|---|---|---|---|---|---|
reference | 139 | G | C | 30.16 | PASS | FALSE | TRUE |
reference | 140 | A | T | 28.86 | PASS | FALSE | TRUE |
reference | 141 | A | G | 22.68 | PASS | TRUE | FALSE |
reference | 246 | G | T | 35.92 | PASS | FALSE | TRUE |
reference | 283 | C | CACAGCT | 24.97 | PASS | FALSE | TRUE |
POS | BC | Sample | gt_GT | gt_DP | gt_AF | gt_GT_alleles | IUPAC |
---|---|---|---|---|---|---|---|
139 | barcode54 | Prokaryote_test | 1/1 | 1206 | 0.9320 | C/C | C |
140 | barcode54 | Prokaryote_test | 1/1 | 1206 | 0.9320 | T/T | T |
141 | barcode54 | Prokaryote_test | 1/1 | 1206 | 0.9420 | G/G | G |
246 | barcode54 | Prokaryote_test | 1/1 | 1291 | 0.9744 | T/T | T |
283 | barcode54 | Prokaryote_test | 1/1 | 1283 | 0.9260 | CACAGCT/CACAGCT | CACAGCT/CACAGCT |
sample ID: Prokaryote_test
These results pertain to the consensus sequence including all PASSing variants listed below.query | sTaxonID | Accession | Description | E_value | Bit_Score | Qstart | Qend | Sstart | Send | Identity | Aln_length | Gaps | Mismatches |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
consensus | 562 | CP054214 | Escherichia coli strain EcPF40 chromosome, co<…> | 0 | 3716 | 1 | 2060 | 4058103 | 4060162 | 100 | 2060 | 0 | 0 |
consensus | 562 | CP054828 | Escherichia coli strain SCU-397 chromosome | 0 | 3716 | 1 | 2060 | 1366159 | 1364100 | 100 | 2060 | 0 | 0 |
consensus | 562 | CP053251 | Escherichia coli strain SCU-204 chromosome, c<…> | 0 | 3716 | 1 | 2060 | 4153650 | 4155709 | 100 | 2060 | 0 | 0 |
consensus | 562 | CP054343 | Escherichia coli strain SCU-164 chromosome | 0 | 3716 | 1 | 2060 | 2491559 | 2489500 | 100 | 2060 | 0 | 0 |
consensus | 562 | CP054556 | Escherichia coli strain LWY24 chromosome, com<…> | 0 | 3716 | 1 | 2060 | 3854610 | 3852551 | 100 | 2060 | 0 | 0 |
consensus | 562 | CP053384 | Escherichia coli strain SCU-107 chromosome, c<…> | 0 | 3716 | 1 | 2060 | 3757759 | 3755700 | 100 | 2060 | 0 | 0 |
consensus | 562 | CP026932 | Escherichia coli strain CFS3273 chromosome, c<…> | 0 | 3716 | 1 | 2060 | 827677 | 825618 | 100 | 2060 | 0 | 0 |
consensus | 562 | CP026929 | Escherichia coli strain CFS3246 chromosome, c<…> | 0 | 3716 | 1 | 2060 | 746936 | 744877 | 100 | 2060 | 0 | 0 |
consensus | 562 | CP047609 | Escherichia coli strain NMBU_ W06E18 chromoso<…> | 0 | 3716 | 1 | 2060 | 3482001 | 3484060 | 100 | 2060 | 0 | 0 |
consensus | 562 | CP044403 | Escherichia coli strain NMBU-W10C18 chromosom<…> | 0 | 3716 | 1 | 2060 | 4353508 | 4355567 | 100 | 2060 | 0 | 0 |
Observed reads
Filtered reads
Name | Length | Motif | Repetitions | Start | End |
---|---|---|---|---|---|
reference | 2054 | No SSRs detected | NA | NA | NA |
The use of Oxford Nanopore Technologies third-generation sequencing technology as a substitute for Sanger-based sequencing provides many advantages including: (1) direct PCR to sequence data without additional steps; (2) sequence composition of individual DNA molecules (from both DNA strands) rather than a per-sample average consensus; (3) allele identification, including allele phasing; (4) a drop-in replacement technology at a reduced price without need to modify existing laboratory workflows. These enhancements increase the information content in each sample well beyond that obtainable with Sanger-based sequencing.
Sanger sequencing is consensus based. The output sequence is determined by the average signal generated for each extension-terminating di-deoxy nucleotide incorporated into all DNA molecules during each cycle. If the target molecule was not the most abundant sequence (c.f. the brightest band on the gel) the Sanger reaction would often generate an uninterpretable consensus sequence because different DNA molecules were sequenced.
In contrast, amplicon sequencing with Oxford Nanopore Technologies PromethION platform acquires sequence information independently for each DNA molecule (both DNA strands are sequenced in approximately equal proportions), with no explicit coverage limit. Importantly, the number of molecules sequenced will be proportional to the abundance of DNA molecules in your PCR sample. So even if your sample contains multiple amplification products resulting from non-specific or even multiple target sites, our workflow is able to partition them accordingly. The ability to separate a mixture of potentially different DNA molecules increases the information content available and broadens the functionality of the resulting data.
This information designates user and sample identification, as provided by you during sample submission to UWBC-NGS, and a barcode tag (assigned by UWBC-NGS), which identify uniquely each sample in a given run. The item named “Sanitized sample ID:” reflects textual alterations applied to your sample to remove certain characters that may interfere with with our workflows. The sanitized version of your sample name is pre-pended to the barcode for easy identification. See FAQ entry #1 for more information.
Provides observed (raw) and filtered (pre-alignment; adapter, barcode, length, chimeric) reads
This simply reports the reference sequence length, the number of primary alignments and the aggregate number of nucleotides. By default, alignments are filtered with -F 3844 thereby removing all reads that are unmapped, secondary or supplementary.
Lastly, the reference and consensus sequence lengths are reported,
including a MAFFT alignment of the consensus to the reference. NB: prior
to read alignment, the reference sequence was filtered from all
non-IUPAC characters. The consensus sequence contains variants
identified in the workflow, specifically those designated as
PASS in the FILTER column of the observed variant and
genotype tables. A fasta formatted file of both sequences is saved in
fasta format in the directory errorCorrection with filenames
reference.fasta and
Summary: This region contains information for each variant including CHROM (the reference sequence you provided), POS, REF, ALT, QUAL and FILTER. The REF and ALT columns indicate the reference and alternate allelic states. When multiple alternate allelic states are present they are delimited with commas. The QUAL column attempts to summarize the quality of each variant over all samples. The FILTER column indicates whether a variant has passed the quality assessment implemented by clair3.
Importantly, the consensus portion of the workflow will integrate all PASSing variants discovered; no other variant filtering is performed. As no variant caller is perfect, it is imperative that you review each variant individually. Viewing the alignments with IGV is an excellent means inspect first-hand sequence alignments. Columns P (pileup) and F (full) indicate the particular model used by clair3.
QUAL is the Phred-scaled probability that the site has no variant and is computed as QUAL = -10*log10 (posterior genotype probability of a homozygous-reference genotype (GT=0/0)). More simply, QUAL = GP(GT=0/0), where GP = posterior genotype probability in a Phred scale. In this report, QUAL = 20 means there is 99% probability that there is a variant at the site. As the QUAL score increases, the probability of the called variant increases. NB: QUAL values reported are unfiltered.
A RefCall entry occurs when a candidate variant is proposed and then specifically rejected as non-variant site. Despite the status of non-variant, the Clair3 model still provides an estimate of its confidence expressed as QUAL and GQ. The inclusion of RefCall sites allows you to have a full record of every variant that was considered. This is useful for quality control as you can distinguish among cases were there was not enough coverage (AD tag) to nominate a candidate variant vs. some other issue making the variant much more difficult to call (e.g. in homo- or hetero-polymeric regions). Essentially, the reporting of RefCalls allows users to select for different filtering criteria in the event they want to increase recall.
The P and F columns reflect two steps in calling variants. In the former case (P), a heuristic identifies positions that are potentially variant. Clair3 then creates a pileup of all observations at that site and uses information derived from the pileup to decide whether or not that site is a variant. In the latter case (F), a neural network classifies those positions as true variants and if so, genotypes them using a full model (see Clair3 documentation). When the allele depth is very large (as is generally the case in amplicon sequencing) only the pileup model is used and all the Full-model instances are false.
Genotypes: The gt_GT (genotype), gt_DP (depth), gt_AF (frequency) gt_GT_alleles (allele content) designations report the inferred genotype, coverage depth and frequency for each variant. For heterozygous genotypes, the IUPAC code is provided. NB: all PASSing variants are incorporated into the consensus sequence and reflect the IUPAC codes when appropriate.
This table contains a BLAST summary of the consensus sequence, reporting up to ten (10) significant matches (hits) in the NCBI nt database. BLAST statistics for each hit are available by sliding the window contents to the left. Direct access to NCBI information for each hit is available by clicking on the highlighted link provided in the Accession column. NB: the nt database used in the homology search is local to our infrastructure. While reasonably current (20221003), it does not reflect the day-to-day updates made at NCBI. For the most current information, we encourage you to use NCBI’s BLAST interface, available at https://blast.ncbi.nlm.nih.gov/Blast.cgi.
Quality control metrics for observed (unfiltered; top set of three graphs), and filtered (adapter, barcode, length and chimera removal; bottom set of three graphs) reads are shown. The minimum read length threshold is 0.10 * reference length for references < 1000 bp or 0.20 * reference length for references >= 1000. An upper limit to read length was defined as 1.50 * reference length. Passing reads are sorted using a 2:1 weighted preference (length:quality) and the top 90% are kept for further analysis.
The mean peak of filtered read lengths is expected to be approximately 0.5 * reference length but can vary. Mean quality should be at least 14. NB: Nanopore quality values (QV) are not directly comparable to Illumina (or Pacific Biosciences) Phred scores; they tend to be over-estimated. However, Nanopore quality scores are correlated with error rates. The phrase “Upper length read exclusions:” refers to the number of reads longer than shown on the graph. These are excluded to maintain sensible graphical perspectives.
The 2D scatterplot reflects the distribution of both read length and quality. The red lines denote means for each. The vertical black line indicates the reference length and is labeled with its length (nucleotides).
The filtered read-alignment coverage graph plots coverages separated by both forward- and reverse-aligning reads. The total and mean coverages (with respect to alignment direction) are also shown. For amplicon sequences not containing indels, the forward and reverse coverages should appear as a relatively symmetrical “X” accompanied by generally flat total and mean coverages, an indication of random tagmentation. NB: 5’ and 3’ ends will often have slightly reduced [total & mean] coverages as very short fragments will be filtered.
Small and abrupt changes in coverage often indicate homopolymeric regions but can also be caused by indels ranging from small (a few nucleotides) to large (hundreds of nucleotides). In sequences containing indels, more attention must be paid to 1) the variant calling results, 2) the read alignments to the reference sequence (use IGV with the supplied self-contained XML session file) and 3) the genetic composition (zygosity, including ploidy) of the region. A common occurrence is a drop to zero coverage, extending to the end of the reference sequence. This phenomenon is caused often by a deletion that extends beyond the reference sequence. It could also be due to insufficient flanking coverage possibly because the event occurred too close to the primer sequence (see FAQ #11).
Curiously, we have sometimes observed strand-specific biases where coverages in the forward and reverse stands are very different. We do not have a clear explanation for this occurrence but it may be due to asymmetric PCR amplification perhaps attributable to unbalanced primer concentrations.
The coverage and homopolymer site graph shows a total coverage plot with homopolymeric [GATC] regions super-imposed. At a typical coverage, the effects of homopolymers are usually reflected as precipitous coverage reductions. A barplot of base-level homopolymeric occurrences for the reference sequence is provided. See FAQ #10 for a means to view directly the sequence alignments and whether or not homopolymeric regions are affecting variant calls.
Performing this follow-up analysis is crucial to a legitimate variant analysis using nanopore data. In other words, unless you have a well-validated system, do not simply rely on the default PASSing cutoff. Inspection of variants is almost always necessary and a non-trivial task to automate. Homopolymers (and heteropolymers - see FAQ #5-8) are a feature of nanopore data that influences directly the accuracy of base calling and hence variant calling. Information provided in this report will allow you to make an informed judgement especially when viewing directly the alignments.
As heteropolymeric (c.f. simple sequence repeats; SSRs) regions may also influence basecalling accuracy, they are also reported in the lower table for the reference sequence. Both homopolymeric and heteropolymeric regions are included as an IGV track. See FAQ #10.
This section contains additional information regarding the UWBC Oxford Nanopore Amplicon Sequencing service.
If you have any questions regarding the contents of this report or how to interpret results from any analyses contained therein, please contact us at the email provided below. Feedback regarding the report and how the standard analysis performed with your sample(s) is useful to us and appreciated.
Computer and software systems differ in their ability to recognize and process textual information. Many non-alphanumeric characters (there are 33 in ASCII) are not accepted as valid input. While UWBC has established sample naming guidelines, clients generally do not follow them and software from third-party vendors changes behavior without warning. It is inefficient for us to request changes and definitely not good laboratory practice for us to alter the names you supplied. As a viable compromise, we prepend to the Oxford Nanopore Technologies amplicon sequencing report a “sanitized” version of your sample name (one that removes unacceptable characters) to the unique barcode. As a convenience, the modified designation is also linked to the unaltered sample name in the read summary portion of this report. Of course, you are free to rename the report and sequence files as you wish. We recommend that you perform the renaming action on a copy of the original file(s).
Library preparation of amplicons uses a Tn5 transposase to randomly cleave template molecules and attach barcodes to both ends of the DNA molecule at the cleavage site. While multiple cleavages on the same DNA molecule do occur, we have observed that many of fragments are cut only once. Once reads are trimmed of adapters and barcodes, the filtered fragment distribution matches better the expected amplicon size but will still always be smaller than the observed amplicon size. Note that non-target amplicons may also affect the observed size distribution (see FAQ #9). Examine information contained in the read quality control section for sample-specific information.
There is no upper limit to the size - if you can amplify large targets, we can handle them. In general, the entire sequence including most - if not all - of the PCR primer sequence is recovered. There does appear to be a lower size limit of approximately 500 - 600 bp, the results of which exhibit increasingly reduced coverage at the 5’ and 3’ sequence ends. We have observed excellent results sequencing amplicons of approximately 1000 bp up to 22.9 Kb.
NB: If you have amplicons in the size-range of 500 bp or smaller, we recommend redesigning your forward and/or reverse primers accordingly to generate an amplicon with a minimum size of approximately 1000 bp. Since the Amplicon Sequencing service is very inexpensive, the addage “it doesn’t hurt to try” applies here.
Yes. However, we cannot perform the phasing function automatically as variant calls usually require additional filtering and scrutiny. We do provide you with the alignment file (see FAQ #10), which should be reviewed to verify PASSing variants and eliminate those deemed to be false positives.
NB: the reference file provided was used to perform the variant analysis. If you would like your sequences phased, please contact the BRC at brc@biotech.wisc.edu or complete a service request on our website at https://bioinformatics.biotech.wisc.edu. Phasing requires at least two variant sites be present.
A k-sized homopolymer is a consecutive repetition of k times the same nucleotide base [GATC], with k ≥ 2. See also FAQ #6
Reduced coverage usually does not become apparent until 5 or more homopolymer bases. For the current Oxford Nanopore Technologies chemistry (10.4), the minimum length we chose was 4 nucleotides as indel effects begin to occur at this length, albeit at low frequency.
Nanopore sequencing is affected by low complexity regions due to sustained small variations in the electrical signal of the pore when the nucleotide base reamains the same (or has repetitive structure; see FAQ #8). Moreover, the translocation speed of DNA through the pore is not constant. These phenomena increase the difficulty to determine the exact length of a homo (or hetero) polymer. In nearly all cases, the targeted coverage depth should be high enough to make an unambiguous interpretative decision. Updates to both the sequencing chemistry and basecalling algorithms occur frequently and continue to reduce these problems.
While the precise terminology requires refinement, other more complicated repetitive regions also exist, which we elect to call n-mer repeats. These are commonly known as simple-sequence (SSR) or short-tandem (STR) repeats. For example, di-nucleotide repeat (a consecutive repetition of length 2k with k times the doublet of bases XY with X ≠ Y and k ≥ 2) or a tri-nucleotide repeat (a consecutive repetition of length 3k with k times the triplet of bases XYZ such that k ≥ 2 and X = Y ⇒ Y ≠ Z ), etc. are common in genomes of most organisms. These types of heteropolymer repeats also influence the accuracy of basecalling, most often manifesting themselves as regions of reduced coverage just like homopolymers.
A reference sequence provides a means to a) separate a potentially complex mixture of different (non-target) amplicons b) identify homopolymeric and heteropolymeric regions and c) establish the presence (or absence) of variants specific to your region(s) of interest, with respect to the reference itself. More precisely, the reference serves as a hypothesis to which sequenced reads from your sample are tested for identity, which ultimately reflect the consensus sequence.
NB: the reference need not be “exact” per se, but rather homologous to your target region. Importantly, the reference supplied should be inclusive of the forward (F) and reverse (R) primers. Because this workflow contains a variant calling step, any statistically significant variation will be identified and reported as the consensus sequence. This feature is very useful when you may not have sufficient knowledge of the nucleotide content or structure of your target region because you are working in a different organism that is genetically less characterized. Or perhaps it is a technical issue such that no matter what PCR optimization(s) employed, you can never acquire a single amplicon. In any case, provided the PCR amplification generates amplicons with sufficient efficiency (i.e. they can be resolved on an electrophoretic gel), this workflow should successfully identify the presence of any variants.
While a reference sequence is preferred, we are creating a version of the workflow that does not rely on an extrinsic reference but rather aggregates into clusters closely related amplicons, from which a consensus-based reference will be defined. Our Pacific Biosciences amplicon sequencing workflow (SASSy) does this already by design. This feature will be made an option for Oxford Nanopore Technologies amplicon sequencing in the near future. Either approach is very different from an assembly-only perspective, which, depending on the nature of your target sequence and how the assembly is performed, is likely to provide erroneous results. One important reason for this is that the rapid barcoding library preparation technology cleaves the amplicon into two (primarily) or possibly more fragments and cannot be unambiguously re-assigned easily. Features such as insertions, deletions and phasing potential will be affected.
Yes. We recommend Integrative Genomics Viewer (IGV; https://software.broadinstitute.org/software/igv/). Choose the latest version for your computer platform. Unless you know for certain that java is installed correctly on your computer (it is often not and IGV will complain), choose a download bundled with Java. It will not interfere with any existing Java installations. For Mac and PC platforms, simply open the IGV application (hint: keep pace with the updates as they occur often) and drag the file named IGV_session_\(\lt\)barcode\(\gt\)_\(\lt\)sample\(\gt\).xml into the primary window. It should open automatically.
Alternatively, open the IGV application and then select “File” on the
menu bar followed by “Open session…”. When the dialog box opens,
navigate to the folder containing the amplicon assembly (it will have
the format
Yes, although it depends on the size and positional context (location in the PCR fragment) of the indel. Detection of SVs is non-trivial and nuanced. It depends heavily on the aligner and variant caller used. The methodology we use responds well to small indels of one (1) to those less than 50 bp, which are flanked by at least 200 bp of reference sequence with sufficient coverage. If your project design involves indels, and the default workflow is not providing acceptable results, please contact us as we can make adjustments specific to your needs. NB: indels located near the ends of your amplicon generally cannot be mapped well and usually manifest themselves as false negatives in the variant analysis phase.
The presence of insertions or deletions in either the reference or consensus sequence will change the indexing provided in the sequence alignment.
No, not at this time.
Yes. Please see Plasmid Sequencing and Assembly at the UWBC NGS site: https://dnaseq.biotech.wisc.edu/services/plasmids/
Yes. When creating reports such as this one, we strive to create a balance between primary and ancillary information that will serve a majority of our clients. If you have specific suggestions, please let us know at brc@biotech.wisc.edu.
This report was generated by computational pipelines developed at the Bioinformatics Resource Center (BRC), part of the Advanced Genome Analysis Resource unit of the Biotechnology Center at the University of Wisconsin - Madison.
Bioinformatics Resource Center
Genetics & Biotechnology Center, Room 2130
425 Henry Mall, Madison WI 53706
Email: brc@biotech.wisc.edu
Website: https://bioinformatics.biotech.wisc.edu