1 Introduction

This report contains information about the structure and composition of one or more contigs assembled from and polished with long-read sequence data. It also provides various meta-data about each assembly including simple statistical reports of (e.g. type and number of contigs, input sequence lengths and quality, G+C content, N50). This resource also contains comparative information regarding the similarity of assembled contigs to reference sequences contained in the NCBI Reference Sequence (RefSeq) database.

2 Input sequence quality information

The following four graphs provide important information regarding the characteristics of the sequence reads used in genome assembly. Summary statistics for Figures 1, 2, and 3 are denoted by colored and dashed vertical lines and described above the respective figure. Note that the reported summary statistics are on a linear scale.

2.1 ONT read length distribution

Black= mean read length; green= N50, blue= most frequent read length

Figure 1: Read lengths for sample 8739

2.2 ONT read quality distribution

Black= mean read quality (Phred scale), blue= most frequent read quality

Figure 2: Read qualities for sample 8739

2.3 ONT read GC distribution

Black (left)= A+T percentage, black (right)= G+C percentage

Figure 3: Read GC for sample 8739

2.4 ONT read length x quality

Black dot= single read occurrence representing both sequence length and quality. The histograms and corresponding heatmap show the overall distribution of read lengths and qualities used as input to assemble the genome. Due to selective and hierarchical filtering, not all reads are used in the assembly process.

Figure 4: Read length x quality for sample 8739

2.5 PB read length distribution

Black= mean read length; green= N50, blue= most frequent read length Note: minimum read length is 2000 bp.

Figure 5: PB read lengths for sample 8739

3 Sequence assembly contig information

Table 1 (below) indicates general features of all contigs created during sequence assembly. The graph used to construct each contig is available in the directory graph. A single circular contig with a length in the 1+ Mbase range suggests a successful and complete assembly of the primary chromosome. Even if circular, the presence of substantially smaller contigs are often - but not always - spurious assembly constructs that can be safely ignored. This phenomenon occurs commonly in Class II genomes, which often contain many interspersed repetitive regions. In Class I genomes, shorter and circular elements are often true plasmids. However, please be aware that this assembly workflow does not specifically target plasmid-derived sequences for assembly. The UW Biotechnology Center is implementing pulsed-field gel electrophoresis as an independent means to identify both the primary chromosome size and smaller extra-chromosomal elements like plasmids. If it is determined that plasmids do exist, specific changes may be made to accommodate assembly of plasmid sequences.

contig	length	coverage	circular
contig_1	4787802	118	yes

3.1 Contig structure and homology to accessioned NCBI RefSeq sequences

The following results contain structural (Figure 6) and NCBI RefSeq comparative genome information (Figure 7) for each contig generated in the assembly process. High-quality PDF versions suitable for printing are available here.

Comparisons of each polished contig are made to to a BRC-curated subset of RefSeq NCBI Prokaryotic accessions. The top-five best matches, sorted by a BLAST+ v2.8.0 bit score (blastn), are reported in Table 2. The higher the bit-score, the better the contig sequence similarity is to the matched NCBI accession. Hyperlinks to the corresponding NCBI genome resource (sacc) and the specific sequence version (saccver) are provided for your convenience. Be cautioned that entries populating the scientific nomenclature (ssciname) column come from NCBI and are often abridged, incomplete, or simply erroneous. As NCBI sequence databases (including taxonomic meta-data) are updated daily, we recommend additional and periodic homology searching if needed.

As our assembly workflows use long reads to both assemble (Oxford Nanopre Technologies; ONT) and error-correct (Pacific Biosciences; PacBio) genomes and/or additional extrachromosomal elements, we have very strong confidence in the structural and sequence quality of our genome assemblies. Nevertheless, we do not recommend creating any conclusions based on consistency with NCBI sequence accessions alone, as long-read assemblies are not immune from error. Thus, while assembly congruency with accessioned reference sequences (or even different assembly reconstruction methods) is desirable, it should not be expected or accepted/rejected without further investigation. In fact, due to the plasticity inherent to most, if not all, genomes (particularly in Prokaryotes), the use of the term “reference” is misleading as there is in fact no true, but only relative, references. To help resolve these complex issues, the UW Biotechnology Center is evaluating alternative mapping techniques to simultaneously - and independently - assess assembly quality.

3.2 Contig 8739 0

qseqid	staxid	sacc	saccver	ssciname	pident	length	mismatch	bitscore
8739_c_0	481805	CP000946	CP000946.1	Escherichia coli ATCC 8739	99.999	3170013	15	5854000
8739_c_0	562	CP007390	CP007390.1	Escherichia coli	99.453	292007	1573	530400
8739_c_0	562	CP007265	CP007265.1	Escherichia coli	99.453	292009	1573	530400
8739_c_0	562	CP007391	CP007391.1	Escherichia coli	99.470	258455	1348	469700
8739_c_0	562	CP027140	CP027140.1	Escherichia coli	99.634	255923	911	467400

3.2.1 Assembly features

The FASTA file containing the polished assembly is availble in the directory assembly. Assembly information for Prokaryotes uses CIRCOS to depict many important features. Starting from the innermost ring, the figure depicts: 1. Oxford Nanopore Technologies read coverage; 2. Pacific Bioscience read coverage for error-correction; 3. sequence GC content; 4. sequence GC skew (useful for quickly identifying isochores); 5. contig assembly; 6/7. annotation of coding regions (genes on forward and reverse strand) including ORFs as determined by PROKKA; 8. sequence length (clockwise, in Kb). Along the outer edge are locations of specific genes anchored by orange lines. This small subset of genes, including dnaA, are considered as essential minimal replication genes, as defined by Gil et al.,2004. Lines anchored with blue lines shown regions of the assembly that may exhibit evidence of structural variation (sv). These designations include svDEL (deletion), svDUP (duplication), svINS (insertion), svINV (inversion), and svTRA (translocation). Although possible in any of the three classes, categories DEL, INS, and DUP are characteristic of Class II genomes.

Figure 6: Features for polished assembly 8739_c_0

Some basic features to check for include 1. a genome size commensurate with expectation and/or PFGE results; 2. good coverage for both ONT and PacBio reads; 3. generally obvious GC/AT skew changes at the boundaries of the two replichores, which correspond to DNA replication origin or terminus. This boundary is defined by the gene dnaA, a protein that activates initiation of DNA replication in nearly all bacteria (the two genes dnaN and gyrB are usually associated with dnaA too); and 4. the presence of additional “essential” genes. Customized gene lists are available for inclusion on request.

3.2.2 Annotation features

The following table contains a brief summary of features identified in contig 8739_c_0. Depending on the particular species, a reasonable estimate for the number of coding sequences (CDS) is a few thousand. Also reported are identified transfer RNA (tRNA), ribosomal RNA (rRNA; also depicted in red on the annotation ring of the assembly and comparative features plots), and transfer-messenger RNA (tmRNA).

bases	CDS	tRNA	tmRNA	rRNA
4746155	4384	88	1	22

3.2.3 Correction features

Error rate and error pattern influence the level of accuracy achievable at the resolution of single nucleotides. Compared to relatively low error in short-read sequence data, the higher error rates of long-read sequence data pose challenges to consensus-based error correction models and must be addressed to achieve accurate genome assemblies at coarse (contiguity) and fine (nucleotide) levels. Consensus error correction is performed with PacBio reads that were mapped against the assembly created from ONT reads. Each platform has distinct error characteristics, such as homopolymer error in PacBio and context-specific (systematic) error in ONT reads. However, when used together, they provide the means to generate extremely low error-rate assemblies. The classification of three error types are shown in the table below.

type	Freq
insertion	2647
deletion	35404
substitution	11201

Coverage. Shown below is a plot of the PacBio read coverage (0 - 100X) and consensus-correction confidence (< 93; Phred score) for three types of ONT/PacBio alignment discrepancies: deletions, substitutions, and insertions. Of all the corrections made, only those with a confidence < 93 are shown. Substitutions are the easiest corrections made, and are usually sparsely represented in this graph. In contrast, insertion and deletion events have more complicated error profiles and therefore more difficult to correct. The use of PacBio reads for correction yields excellent results in most cases. For comparison, appreciate that a consensus accuracy score of 50 is among the lower confidence achieved with this method but it still equates to 99.999% accuracy (i.e. Pr(error) = 0.00001). The maximum reported Phred score of 93 equates to greater than 99.99999999% accuracy.

Figure 7: Coverage/correction confidence for polished assembly 8739_c_0

Position. The plot shown below is very similar to the one above, but substitues base position for read coverage. Regardless of the specific type of correction made, the distribution of confidence should be random with respect to the base position and relatively sparse overall. If not, this may be indicative of an assembly error and should be investigated further.

Figure 8: Position/correction confidence for polished assembly 8739_c_0

Detected variants. The Arrow algorithm was used with PacBio reads to error-correct the ONT assembly. The results of this correction are available here in a gff file. Below are two graphs summarizing the corrections made.

This figure shows the relative proportion of specific sequences in three variant types for the ONT reference assembly. Note: the period (.) character denotes an insertion event.

Figure 9: Base(s) for the reference site in the polished assembly 8739_c_0

This figure shows the relative proportion of specific sequence in three variant types in the PacBio variant. Note: the period (.) character denotes an deletion event.

Figure 10: Base(s) for the variant site in the polished assembly 8739_c_0

3.2.4 Comparative features

A comparative analysis of each assembled contig and the highest scoring NCBI match is made using MUMmer4. Each contig x NCBI reference MUM comparison was filtered requiring an exact match length of at least 2Kb. A plot of percent similarity for filtered data is also presented. It is important to understand that these two graphs provide comparative information regarding the similarity of a contig assembly and a BLAST match for an accession in the NCBI RefSeq database. Long stretches of sequence contiguity between the contig and NCBI accession are desirable in terms of contig assembly consistency with respect to a specific RefSeq accession. However, levels of consistency among long- and short-read assemblies can and do vary considerably. The increased use of long reads for genome assembly has brought a substantially higher level of quality in genome assemblies. It is also providing increasing amounts of evidence showing that short-read assemblies (used for a majority of accessioned NCBI RefSeq Prokaryotic reference genomes) are often an unreliable means of genome reconstruction, particularly in terms of structural contiguity.

Assembly comparison to NCBI RefSeq accessions. Genome information depicted in this plot shows BLAST-derived similarity to the top-five NCBI RefSeq hits shown above in Table 2. The ring order mirrors that shown in the contig table and is arranged from the smallest bit score (innermost lavender-colored ring) to the largest bit score (outermost blue-colored ring adjacent to the red assembled contig ring. This information is useful to identify assembly inconsistencies with NCBI RefSeq accessions, particularly those based on short-read assemblies. One of the most common discrepancies are assembly gaps resulting from very high GC/AT base content, regions that are poorly sequenced by, for example, Illumina short-read sequencers.

Figure 11: Assembly features of 8739_c_0compared to the top five NCBI Prokaryotic RefSeq matches

3.2.5 Dotplot to CP000946.1

The dotplot is generated with MUMmer4 by computing maximal exact matching, match clustering, and alignment extension between the contig and the single best-match NCBI sequence and filtered to show only exact matches greater than 2 Kb (i.e. regions with at least one aligned segment larger than 2 Kb). A useful reference to help interpret the output of dotplots is given here. Note that the contig is designated as the query sequence and located on the vertical “Y”-axis. Additionally, apparent indel events at the beginning and end of the diagonal often reflect sequence start-point differences and should be ignored.

Figure 12: Assembly homology to best NCBI match CP000946.1

3.2.6 Similarity to CP000946.1

This plot represents the position of each aligned contig segment (> 2 Kb) and its percent similarity to the corresponding position on the single best-match NCBI reference.

Figure 13: Assembly similarity to best NCBI match CP000946.1

3.2.7 Synteny to CP000946.1

This plot is essentially an amalgamation of the two previous figures. It depicts non-overlapping, highly conserved segments of DNA - synteny blocks - between the assembled contig and its single best-match NCBI reference. The differently colored ribbons show shared synteny blocks with sizes > 2Kb between the two genomes. The width of the ribbon depicts the relative size of the synteny block, the scale of which is indicated on the outer ring in 50 Kb increments. The green and orange bars depict the direction of synteny blocks on the positive and negative strands, respectively. Ideally, there should be small numbers of large synteny blocks. However, depending on the number of structural variants like insertions, deletions, duplications, and translocations detected, the complexity of the graph can increase considerably. For example, it is not uncommon to have numerous very thin (i.e. single pixel width) ribbons traversing the graph. These often correspond to small repetitive elements like insertion sequences.

Figure 14: Synteny block of assembly and best NCBI match CP000946.1

Assembly report for sample 8739

UW-Madison Biotechnology Center

Bioinformatics Resource Center

Date: 11 February, 2019