This report contains information about the structure and composition of one or more contigs assembled from and polished with long-read sequence data. It also provides various meta-data about each assembly including simple statistical reports of (e.g. type and number of contigs, input sequence lengths and quality, G+C content, N50). This resource also contains comparative information regarding the similarity of assembled contigs to reference sequences contained in the NCBI Reference Sequence (RefSeq) database.
The following four graphs provide important information regarding the characteristics of the sequence reads used in genome assembly. Summary statistics for Figures 1, 2, and 3 are denoted by colored and dashed vertical lines and described above the respective figure. Note that the reported summary statistics are on a linear scale.
Table 1 (below) indicates general features of all contigs created during sequence assembly. The graph used to construct each contig is available in the directory graph. A single circular contig with a length in the 1+ Mbase range suggests a successful and complete assembly of the primary chromosome. Even if circular, the presence of substantially smaller contigs are often - but not always - spurious assembly constructs that can be safely ignored. This phenomenon occurs commonly in Class II genomes, which often contain many interspersed repetitive regions. In Class I genomes, shorter and circular elements are often true plasmids. However, please be aware that this assembly workflow does not specifically target plasmid-derived sequences for assembly. The UW Biotechnology Center is implementing pulsed-field gel electrophoresis as an independent means to identify both the primary chromosome size and smaller extra-chromosomal elements like plasmids. If it is determined that plasmids do exist, specific changes may be made to accommodate assembly of plasmid sequences.
contig | length | coverage | circular |
---|---|---|---|
contig_1 | 4787802 | 118 | yes |
The following results contain structural (Figure 6) and NCBI RefSeq comparative genome information (Figure 7) for each contig generated in the assembly process. High-quality PDF versions suitable for printing are available here.
Comparisons of each polished contig are made to to a BRC-curated subset of RefSeq NCBI Prokaryotic accessions. The top-five best matches, sorted by a BLAST+ v2.8.0 bit score (blastn), are reported in Table 2. The higher the bit-score, the better the contig sequence similarity is to the matched NCBI accession. Hyperlinks to the corresponding NCBI genome resource (sacc) and the specific sequence version (saccver) are provided for your convenience. Be cautioned that entries populating the scientific nomenclature (ssciname) column come from NCBI and are often abridged, incomplete, or simply erroneous. As NCBI sequence databases (including taxonomic meta-data) are updated daily, we recommend additional and periodic homology searching if needed.
As our assembly workflows use long reads to both assemble (Oxford Nanopre Technologies; ONT) and error-correct (Pacific Biosciences; PacBio) genomes and/or additional extrachromosomal elements, we have very strong confidence in the structural and sequence quality of our genome assemblies. Nevertheless, we do not recommend creating any conclusions based on consistency with NCBI sequence accessions alone, as long-read assemblies are not immune from error. Thus, while assembly congruency with accessioned reference sequences (or even different assembly reconstruction methods) is desirable, it should not be expected or accepted/rejected without further investigation. In fact, due to the plasticity inherent to most, if not all, genomes (particularly in Prokaryotes), the use of the term “reference” is misleading as there is in fact no true, but only relative, references. To help resolve these complex issues, the UW Biotechnology Center is evaluating alternative mapping techniques to simultaneously - and independently - assess assembly quality.
qseqid | staxid | sacc | saccver | ssciname | pident | length | mismatch | evalue | bitscore |
---|---|---|---|---|---|---|---|---|---|
8739_c_0 | 481805 | CP000946 | CP000946.1 | Escherichia coli ATCC 8739 | 99.999 | 3170013 | 15 | 0 | 5854000 |
8739_c_0 | 562 | CP007390 | CP007390.1 | Escherichia coli | 99.453 | 292007 | 1573 | 0 | 530400 |
8739_c_0 | 562 | CP007265 | CP007265.1 | Escherichia coli | 99.453 | 292009 | 1573 | 0 | 530400 |
8739_c_0 | 562 | CP007391 | CP007391.1 | Escherichia coli | 99.470 | 258455 | 1348 | 0 | 469700 |
8739_c_0 | 562 | CP027140 | CP027140.1 | Escherichia coli | 99.634 | 255923 | 911 | 0 | 467400 |
Some basic features to check for include 1. a genome size commensurate with expectation and/or PFGE results; 2. good coverage for both ONT and PacBio reads; 3. generally obvious GC/AT skew changes at the boundaries of the two replichores, which correspond to DNA replication origin or terminus. This boundary is defined by the gene dnaA, a protein that activates initiation of DNA replication in nearly all bacteria (the two genes dnaN and gyrB are usually associated with dnaA too); and 4. the presence of additional “essential” genes. Customized gene lists are available for inclusion on request.
The following table contains a brief summary of features identified in contig 8739_c_0. Depending on the particular species, a reasonable estimate for the number of coding sequences (CDS) is a few thousand. Also reported are identified transfer RNA (tRNA), ribosomal RNA (rRNA; also depicted in red on the annotation ring of the assembly and comparative features plots), and transfer-messenger RNA (tmRNA).
bases | CDS | tRNA | tmRNA | rRNA |
---|---|---|---|---|
4746155 | 4384 | 88 | 1 | 22 |
Error rate and error pattern influence the level of accuracy achievable at the resolution of single nucleotides. Compared to relatively low error in short-read sequence data, the higher error rates of long-read sequence data pose challenges to consensus-based error correction models and must be addressed to achieve accurate genome assemblies at coarse (contiguity) and fine (nucleotide) levels. Consensus error correction is performed with PacBio reads that were mapped against the assembly created from ONT reads. Each platform has distinct error characteristics, such as homopolymer error in PacBio and context-specific (systematic) error in ONT reads. However, when used together, they provide the means to generate extremely low error-rate assemblies. The classification of three error types are shown in the table below.
type | Freq |
---|---|
insertion | 2647 |
deletion | 35404 |
substitution | 11201 |
Coverage. Shown below is a plot of the PacBio read coverage (0 - 100X) and consensus-correction confidence (< 93; Phred score) for three types of ONT/PacBio alignment discrepancies: deletions, substitutions, and insertions. Of all the corrections made, only those with a confidence < 93 are shown. Substitutions are the easiest corrections made, and are usually sparsely represented in this graph. In contrast, insertion and deletion events have more complicated error profiles and therefore more difficult to correct. The use of PacBio reads for correction yields excellent results in most cases. For comparison, appreciate that a consensus accuracy score of 50 is among the lower confidence achieved with this method but it still equates to 99.999% accuracy (i.e. Pr(error) = 0.00001). The maximum reported Phred score of 93 equates to greater than 99.99999999% accuracy.
Position. The plot shown below is very similar to the one above, but substitues base position for read coverage. Regardless of the specific type of correction made, the distribution of confidence should be random with respect to the base position and relatively sparse overall. If not, this may be indicative of an assembly error and should be investigated further.
Detected variants. The Arrow algorithm was used with PacBio reads to error-correct the ONT assembly. The results of this correction are available here in a gff file. Below are two graphs summarizing the corrections made.
This figure shows the relative proportion of specific sequences in three variant types for the ONT reference assembly. Note: the period (.) character denotes an insertion event.A comparative analysis of each assembled contig and the highest scoring NCBI match is made using MUMmer4. Each contig x NCBI reference MUM comparison was filtered requiring an exact match length of at least 2Kb. A plot of percent similarity for filtered data is also presented. It is important to understand that these two graphs provide comparative information regarding the similarity of a contig assembly and a BLAST match for an accession in the NCBI RefSeq database. Long stretches of sequence contiguity between the contig and NCBI accession are desirable in terms of contig assembly consistency with respect to a specific RefSeq accession. However, levels of consistency among long- and short-read assemblies can and do vary considerably. The increased use of long reads for genome assembly has brought a substantially higher level of quality in genome assemblies. It is also providing increasing amounts of evidence showing that short-read assemblies (used for a majority of accessioned NCBI RefSeq Prokaryotic reference genomes) are often an unreliable means of genome reconstruction, particularly in terms of structural contiguity.
Assembly comparison to NCBI RefSeq accessions. Genome information depicted in this plot shows BLAST-derived similarity to the top-five NCBI RefSeq hits shown above in Table 2. The ring order mirrors that shown in the contig table and is arranged from the smallest bit score (innermost lavender-colored ring) to the largest bit score (outermost blue-colored ring adjacent to the red assembled contig ring. This information is useful to identify assembly inconsistencies with NCBI RefSeq accessions, particularly those based on short-read assemblies. One of the most common discrepancies are assembly gaps resulting from very high GC/AT base content, regions that are poorly sequenced by, for example, Illumina short-read sequencers.The dotplot is generated with MUMmer4 by computing maximal exact matching, match clustering, and alignment extension between the contig and the single best-match NCBI sequence and filtered to show only exact matches greater than 2 Kb (i.e. regions with at least one aligned segment larger than 2 Kb). A useful reference to help interpret the output of dotplots is given here. Note that the contig is designated as the query sequence and located on the vertical “Y”-axis. Additionally, apparent indel events at the beginning and end of the diagonal often reflect sequence start-point differences and should be ignored.