1 Sample Definitions
1.1 Input Data
Specific information pertaining to the input data for your project is shown in Table 1 below. Check first to ensure that the designations in the Sample Name column match with the associated fastq file(s) in column File Name(s). It is critically important that these designations are correct. If sequencing included a paired-end read, both fastq files should be present in the list of file names used. If data were your data uses a Unique Molecular Index (UMI) to mark duplicate sequencing fragments, then an additional column will describe the source of UMI information.
Sample | File Directory | File Name | Read | UMI |
---|---|---|---|---|
A-12878-A1 | 30M | A-12878-A1_R1.fastq.gz | R1 | none |
A-12878-A1 | 30M | A-12878-A1_R2.fastq.gz | R2 | none |
A-12878-C3 | 30M | A-12878-C3_R1.fastq.gz | R1 | none |
A-12878-C3 | 30M | A-12878-C3_R2.fastq.gz | R2 | none |
A-12878-C5 | 30M | A-12878-C5_R1.fastq.gz | R1 | none |
A-12878-C5 | 30M | A-12878-C5_R2.fastq.gz | R2 | none |
I-12878-C4 | 30M | I-12878-C4_R1.fastq.gz | R1 | none |
I-12878-C4 | 30M | I-12878-C4_R2.fastq.gz | R2 | none |
T-12878-C1 | 30M | T-12878-C1_R1.fastq.gz | R1 | none |
T-12878-C1 | 30M | T-12878-C1_R2.fastq.gz | R2 | none |
The deliverable directory contains numerous files consisting of alignment maps, variant calls, and quality control metrics. These files contain the core information from your project. Content description and explanations of these files are given in the appropriate sections below. Many of these files contain metadata which contain additional information which we do not explicitly pull into this report. Information such as the sequencing flowcell, instrument ID, and cluster are embedded in the names of the sequence alignments.
1.2 Fastq Format
A FASTQ file normally uses four lines per sequence.
Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description.
Line 2 is the raw sequence letters.
Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.
For Illumina FASTQ files the sequence identifier contains information that can be used to identify metadata about the run.
@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<sample number>
Element | Requirements | Description |
---|---|---|
@ | @ | Each sequence identifier line starts with @ |
<instrument> | Characters allowed: a–z, A–Z, 0–9, _ | Instrument ID |
<run number> | Numerical | Run number on instrument |
<flowcell ID> | Characters allowed: a–z, A–Z, 0–9 | A Unique string of characters that identifies the flowcell. |
<lane> | Numerical | Lane number |
<tile> | Numerical | Tile number |
<x_pos> | Numerical | X coordinate of cluster |
<y_pos> | Numerical | Y coordinate of cluster |
<read> | Numerical | Read number. 1 can be single read or Read 2 of paired-end |
<is filtered> | Y or N | Y if the read is filtered (did not pass), N otherwise |
<control number> | Numerical | 0 when none of the control bits are on, otherwise it is an even number. |
<sample number> | Numerical | Sample number from sample sheet |
2 Quality Control
Raw sequence data may contain residual sequence that is artificially introduced as a result of the laboratory methods used to prepare the DNA for sequencing. These sequence artifacts may introduce bias an increase the complexity of downstream data processing. Additionally, sequencing quality can be affected by various factors such as variation in chemical reagents, instrument platform, temperature or humidity changes, initial DNA quality or concentration. Due to the imperfect nature of the sequencing process and limitations of the optical instruments (batch effects, sequencing error, GC bias, duplicate reads, copy number variation, mapability), base calling has inherent uncertainty associated with it. The magnitude of uncertainty in each base call is represented by an error probability or Phred (pronounced like “fred”) score. This parameter is denoted formally as \(Q\) and is proportional to the probability \(p\) that a base call is incorrect, where \(Q = -10log~10~(p)\).
Any biological inference or conclusion depends on the overall quality of your data. Therfore a quality control (QC) step is essential for providing high quality results. The following QC metrics calculated from your data are designed to help you evaluate the overall technical quality of your experiment. You should always review these metrics to ensure your data are of good quality and lack obvious problems or biases, which may affect how you can ultimately use these data.
2.1 Read Trimming
During sequencing library construction, short oligonucleotides are ligated to the ends DNA fragments to be sequenced, so that they can be combined with primers for PCR amplification. While sequencing the fragments, if the read length is greater than that of the target DNA, the adapter sequence next to the unknown DNA sequence of interest is also sequenced, sometimes only partially. To recover the true target DNA sequence, it is required to identify the adapter sequence and remove it. Moreover, to improve further the quality of the sequence reads, skewer also trims the 3’ end of the fragment until a Phred quality of 20 is reached. This is equivalent to a base call accuracy of 99%. More information regarding Phred scores can be obtained from Ewing et al. (1998) and in Ewing and Green (1998).
The Bioinformatics Resource Center performs a quality control step to remove any sequencing adapters and low quality bases. The trimming software skewer (Jiang et al. 2014) was used to pre-process raw fastq files. Skewer implements an efficient dynamic programming algorithm designed to remove adapters and/or primers from sequence reads. Reads that are trimmed too short to be of usable length are discarded from downstream analysis along with the corresponding read pair, in the case of paired-end sequencing.
Sample | Input Reads | Passing | Data Source |
---|---|---|---|
A-12878-A1 | 30000000 | 29968027 (99.89%) | 30M/A-12878-A1_R1.fastq.gz |
A-12878-C3 | 30000000 | 29939682 (99.80%) | 30M/A-12878-C3_R1.fastq.gz |
A-12878-C5 | 30000000 | 29929623 (99.77%) | 30M/A-12878-C5_R1.fastq.gz |
I-12878-C4 | 30000000 | 29907377 (99.69%) | 30M/I-12878-C4_R1.fastq.gz |
T-12878-C1 | 30000000 | 29934501 (99.78%) | 30M/T-12878-C1_R1.fastq.gz |
2.2 Sequence Quality Statistics
After quality filtering, BRC computes QC statistics on your read data including: read length and GC content distribution.
The graphs of summarized data should be fairly uniform for all samples in each QC metric. If one or more metrics in a sample [library] are considerably different, or if they are all highly variable, it may be indicative of a problem that warrants additional attention.
3 Sequence Alignment
The assignment of sequencing reads to the most likely locus of origin is called read alignment or mapping. One challenge of short-read alignment is to map millions of reads accurately in a reasonable amount of time, but in the presence of sequencing errors, genomic variation, incomplete sequencing annotation etc. Many alignment programs exist and employ various strategies meant to find a balance between mapping fidelity, error tolerance, and time. The BRC uses the latest algorithm in the Burrows-Wheeler Aligner (BWA-MEM) to map the sequence data to the reference genome, Homo sapiens (1kg_v37).
3.1 Read Mapping
The trimmed single- or paired-end reads are aligned against the selected reference genome sequence using BWA-MEM (Li 2013). BWA is a software package for mapping low-divergent sequences against a large reference genome. The BWA-MEM algorithm works by seeding alignments with maximal exact matches (MEMs) and then extending seeds with the affine-gap Smith-Waterman algorithm (SW) through local sequence alignment. BWA may produce multiple primary alignments for different part of a query sequence. This is a crucial feature for long sequences. However, some downstream tools do not work with split alignments.
BWA also finds all hits containing one more mismatches. If the best hit is highly repetitive, then BWA finds all equally best hits only. Base quality is NOT considered in evaluating hits. In the paired-end mode, BWA pairs all hits that it finds. It further performs Smith-Waterman alignment for unmapped reads to rescue reads with a high error rate, and for high-quality anomalous pairs to fix potential alignment errors. The final output from sequence alignment will be a Sequence Alignment Map (SAM)) file. BRC provides the binary (BAM) equivalent of these files, which has a smaller file size but contains the same information as SAM. The binary BAM file can be converted back into a human readable text format by using the SAMTOOLS view command.
Col | Field | Description |
---|---|---|
1 | QNAME | Query (pair) NAME |
2 | FLAG | bitwise FLAG |
3 | RNAME | Reference sequence NAME |
4 | POS | 1-based leftmost POSition/coordinate of clipped sequence |
5 | MAPQ | MAPping Quality (Phred-scaled) |
6 | CIGAR | extended CIGAR string |
7 | MRNM | Mate Reference sequence NaMe (‘=’ if same as RNAME) |
8 | MPOS | 1-based Mate Posistion |
9 | ISIZE | Inferred insert SIZE |
10 | SEQ | query SEQuence on the same strand as the reference |
11 | QUAL | query QUALity (ASCII-33 gives the Phred base quality) |
12 | OPT | variable OPTional fields in the format TAG:VTYPE:VALUE |
3.2 Regions of Interest
The input reads were mapped to the full reference genonme. However, many experiments use capture or amplicon enrichment methods to selectivly target regions of the geneome that are most relavent to the project. For most projects these regions represent a the exons plus some padding bases containing the surrounding introninc sequence. We report metrics for only the regions of interest specifed. The metrics will be highly skewed if we were to report on regions of the genome for which there were no data. It is also important to know that we call variants from only the regions of interst. We provide the a gzipped file with the merged regions of interest for this project.
Regions of interest | Number of targets | Total bp targeted | Genome Size |
---|---|---|---|
exome_roi.bed | 192823 | 34008842 | 3137454505 |
3.3 Mapping statistics
While informative, the sequence-only metrics reported above are alone insufficient to assess the usability of sequencing data. Data summaries derived from read mapping to the a reference provide additional QC information. For example, sequencing depth should be sufficient to perform basic downstream analyses. As part of this report, mapping statistics for each fastq file were also computed to help gauge various QC metrics including comparisons at the (1) sample-level (2) chromosome-level and (3) aggregate gene-level.
Sample-level information provides the broadest level of mapping QC and is shown in Table 3. It includes the total number of alignments and percentages distributed across primary and secondary alignments. This report is best used to gauge how well reads from each sample mapped (aligned) to the specific reference assembly chosen. Poor primary mapping (unique alignments) may indicate an inappropriate or poorly assembled reference genome. It may also suggest significant repeated regions, which would be reflected in an increase in secondary mapping (non-unique alignments).
The BRC recommends that you check visually the results of the sample-level mappings to ensure the reads align to the expected regions. The Integrative Genomics Viewer (IGV) application is a software that can be downloaded with an academic email address from IGV. It is platform independent and only requires an up-to-date Java installation.
Indexed BAM files of all alignments are available in the delivery directory. To view the coverages, first start IGV and load your genome of interest (IGV has most genomes available for download from within the IGV browser). Second, select Load from File… from the File menu. A dialog box will open. Navigate to and select the BAM file corresponding to the specific sampleName you are interested in viewing.
Sample | 1x | 5x | 10x | 20x | 30x | 40x | 50x |
---|---|---|---|---|---|---|---|
A-12878-A1 | 98.28 | 97.75 | 96.83 | 93.59 | 88.44 | 81.85 | 74.36 |
A-12878-C3 | 98.55 | 98.21 | 97.76 | 95.82 | 91.69 | 85.4 | 77.53 |
A-12878-C5 | 98.56 | 98.02 | 96.68 | 90.43 | 80.13 | 68.24 | 56.69 |
I-12878-C4 | 98.26 | 98 | 97.79 | 96.64 | 92.17 | 83.19 | 71.08 |
T-12878-C1 | 97.95 | 97.24 | 96.45 | 94.59 | 92.24 | 89.22 | 85.08 |
3.4 Post Alignment Processing
After sequencing reads are aligned to the reference, they are then sorted by position using samtools (Li et al. 2009). Position sorted SAM/BAM files can be indexed for random access so that the entire file does not have to be read into memory when displaying sequence alignments for a given locus. Libraries with high sequencing depth may sample the same fragment of DNA multiple times leading to duplication of reads. Furthermore, the PCR amplification step in the library preparation may preferentially amplify certain fragments due to their length and/or sequence context. BRC marks these duplicate reads using the Picard markduplicates function. Picard uses the position of the fragment ends to determine if alignments are likely to come from the same original fragment of input DNA.
Some projects use of a Unique Molecular Index (UMI). UMIs tag each molecule of input DNA with a unique barcode of random nucleotides so that the UMI barcode can be used to distinguish if alignments with the same positional ends actually come from a two separate input DNA molecules or if the alignments are duplicate reads from a single input fragment. The use of UMIs is important if you wish to have accurate depth of coverage for projects with extremely high sequencing depth or projects that utilize non-random fragment selection (i.e. amplicon based enrichment or enzymatic shearing) during library construction. BRC uses Je, a tool that uses the Picard API, to process the UMI information if applicable (Girardot et al. 2016).
Sample | Total | Mapped Reads | Non-Duplicate | On-Target | Mean Depth |
---|---|---|---|---|---|
A-12878-A1 | 0 | 57,167,303 | 45,488,367 | 33446632 (73.53%) | 94.646 |
A-12878-C3 | 0 | 58,229,969 | 47,645,184 | 32346054 (67.89%) | 92.817 |
A-12878-C5 | 0 | 58,601,929 | 44,536,039 | 24267507 (54.49%) | 68.177 |
I-12878-C4 | 0 | 55,601,268 | 47,397,109 | 25047244 (52.85%) | 72.415 |
T-12878-C1 | 0 | 57,953,614 | 50,601,430 | 32473984 (64.18%) | 94.698 |
The "On-Target" percentage and "Mean Depth" count only the sequence reads that overlap with the regions of interest specified for this report (see above). Therefore these values are highly dependent on what regions of interest are defined for this project. For example, if the regions of interest represent a subset of genes from an exome test you can expect the On-Target percentage to be a reflection of how much of your total sequenced data corresponds to your subsetted genes. Likewise, the mean coverage represents the mean coverage depth for only the regions defined in the regions of interest.
4 Variant Detection
BRC uses the GATK HaplotypeCaller (McKenna et al. 2010) to detect small variants from the prepared alignment files. HaplotypeCaller will scan the alignment file for a regions which show signs of variation from the reference. The variant caller then uses a local de-novo assembly using a De Bruijn-like graph to reassemble the active region and identify possible haplotypes. The program then realigns each haplotype against the reference using the Smith-Waterman algorithm. HaplotypeCaller then uses the read data to build a likelihood matrix for each potential variant site using the PairHMM algorithm. For each potentially variant site, the program applies Bayes' rule, using the likelihoods of alleles to calculate the likelihoods of genotypes per sample given the read data that was observed for the sample. The most likely genotype is then assigned to the sample.
4.1 Variant Call Format (VCF)
A variant call format (VCF) file is the most common way to communicate a list of variations from the reference genome. VCF is a flexible text based format that is often stored in a gzip compressed format (*.vcf.gz extension). A detailed description of the nuances of this file format can be found at here. Briefly, the format contains meta-information lines, a header line, and then data lines. Each data line is tab-delimited with 8 fixed fields per record, described in the table below. The INFO field is the fixed field that contains a large amount of metadata about the variant. The data contained in the INFO field is variable and will be specified in the VCF header section.
Following the 8 fixed fields, there are optional fields describing the genotype information for each sample (the same types of genotype data must be present for all samples). A FORMAT field is given specifying the order of the colon-separated genotype data for each sample field. See the VCF documentation for more information about genotype fields.
Col | Field | Description |
---|---|---|
1 | CHROM | Identifies the chromosome on the reference |
2 | POS | The reference position, with the 1st base having position 1. |
3 | ID | Semi-colon separated list of unique identifiers where available |
4 | REF | Reference base(s) |
5 | ALT | Comma separated list of alternate non-reference alleles |
6 | QUAL | Phred-scaled quality score for the assertion made in ALT |
7 | FILTER | A semicolon-separated list of codes for failed filters (PASS or . indicate that the variant passed quality checks) |
8 | INFO | Additional information encoded as a semicolon-separated series of short keys with optional values |
9 | FORMAT | A colon-separated field that specifies the data types and order for the following <SAMPLE> fields |
10+ | <SAMPLE> | The genotype information for the sample specifed in the header line |
4.2 Small Variant Statistics
The Single nucleotide variant (SNV) constitutes the largest class of variation from a reference genome. The most frequent variations in a population are known as polymorphisms. Thus, this type of variant is sometimes referred to as a single nucleotide polymorphism (SNP). A typical human genome has almost 4 million SNPs. Only a small fraction of these occur the coding region of the genome, commonly known as the exome.
Another common class of genomic variation is the small insertion, deletion, or InDel (insertion and deletion). These variants can have implications for significant changes to the transcripts and/or proteins if they occur in a coding region or splice site. If the length of an InDel is not a multiple of three nucleotides, the reading frame of the the transcript will be altered with a frame shift. The statistics for small InDel mutations are shown in the table below.
Sample | Filtered | Passed | SNPs | Indels | Het/Hom ratio | Indel/SNP ratio |
---|---|---|---|---|---|---|
A-12878-A1 | 915 | 22147 | 21504 | 642 | 1.54 (13444/8702) | 0.03 (642/21504) |
A-12878-C3 | 1099 | 22269 | 21587 | 682 | 1.57 (13594/8675) | 0.03 (682/21587) |
A-12878-C5 | 1161 | 22183 | 21524 | 657 | 1.56 (13512/8669) | 0.03 (657/21524) |
I-12878-C4 | 1286 | 22211 | 21500 | 710 | 1.57 (13583/8627) | 0.03 (710/21500) |
T-12878-C1 | 1283 | 21989 | 21296 | 693 | 1.56 (13402/8587) | 0.03 (693/21296) |
4.3 Large Variant Detection
A significant portion of genetic differences between genomes comes in the form of gross changes to the number and/or position of genomic regions. Copy number variations (CNVs) are a class of genomic variant that leads to a gain or loss of a relatively large fragment of the genome. Many CNVs are the result of structural variations. In general, a structural variation can be a deletion, duplication, insertion, inversion, or translocation event that changes the chromosome on a relatively large scale. Large genomic variants can be more difficult to detect with short read technologies. Many methods have been developed to try to address this problem, several of which use a matched control sample (which may not be available for every project).
4.4 Variant Annotation
Following variant detection, BRC annotates the resulting list of mutations with the variant annotation tool SNPeff (Cingolani et al. 2013). SnpEff uses the known gene annotations of the reference genome to predict coding effects such as: synonymous or non-synonymous amino acid changes, gains or losses of start/stop codons, and frame shifts due to InDels. Annotations also include introns, upstream/downstream untranslated regions, splice sites, or intergenic regions. The tool classifies the predicted mutation according to its predicted impact (HIGH, MODERATE or LOW) on gene function. The information from the SNPeff annotation can be found in the INFO field for each variant in the VCF file.
5 Interpreting Results
Analysis and interpretation of genomic data generated by sequencing technologies are among the most complex problems in genomic sciences. One consequence of this reality is that analysis and interpretation of sequencing data is far from routine. Analytic “best practices” remain very dynamic and reflect our current limited understanding of genomics including technical aspects of data collection.
In brief, the process used in the UW-Madison BRC pipeline was to first demultiplex, filter, and trim sequencing reads. Next, reads were mapped to a reference genome and duplicate alignments were marked to prevent artificial bias during variant calling. Variations from the reference were then identified and annotated according to the known reference annotation.
5.1 Did it work?
The identification of mutations in which you are most interested will undoubtedly dictate a measure of “success” or “failure” regarding your experiment. For example, if you had no expectations a priori for any specific mutations and simply wanted a candidate list of variants found in your sample, no matter the outcome, you would likely deem the experiment a success.
Alternatively, if you were interested in testing for specific mutations among a known group of candidate genes, but the analysis showed no evidence of any mutations in your candidate list, you may hastily - and perhaps erroneously - conclude that the experiment was a “failure”.
To evaluate how confident you can be in the results, you must go beyond the variant list itself, and evaluate additional aspects of the analyses and perform some basic checks. Make sure that you review the QC metrics and check that you had sufficient coverage in the regions of interest. There are limitations to calling variants when insufficient data is available for the entire region of interest. It is also possible that the experiment methodology was not appropriate to detect the specific type of mutations needed. For example, detecting structural variants using short read sequencing data is complicated and does not always allow for the detection of complex variations from the reference.
5.2 Where do I go from here?
In nearly all cases, creating a list of variants is not the final step of the analysis. The American College of Medical Genetics and Genomics (ACMG) has published recommendations on how to best interpret the pathogenicity of variants (Richards et al. 2015). At a minimum, further biological insight into an experiment may be gained by examination of genes identified with suspected deleterious mutations, particularly if the prior knowledge of genes or gene pathways involved may highlight specific variants that deserve further scrutiny. You may want to consider the expected inheritance pattern for the phenotype. For a simple recessive phenotype with a Mendelian inheritance pattern, you would expect to find a homozygous mutation or multiple deleterious heterozygous mutations in the responsible gene. Complex traits may be expected to have contributions from several loci. You may consider using Phenotype Ontology terms to narrow your candidate gene search to a list of genes which are known to be associated with the observed phenotype. You may wish to explore additional methodologies like Gene Set Enrichment Analysis (GESA) (Subramanian et al. 2005). Additional insight may also be gained by establishing connections with other experimental data such as RNAseq.
Please feel free to contact our office should you have any questions regarding your data.
6 Credits
6.1 Bioinformatics Resource Center
Bioinformatics Resource Center
Genetics & Biotechnology Center, Room 2130
425 Henry Mall, Madison WI 53706
Email: brc@biotech.wisc.edu
Website: https://bioinformatics.biotech.wisc.edu
6.2 Software
Platform: x86_64-pc-linux-gnu (64-bit)
6.3 References
Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, R. D., Cingolani, P., Platts, A., Wang, L. L. L., Coon, M., Nguyen, T., … Ruden, D. M. (2013). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118 ; iso-2; iso-3. Fly, 5(1), 29–30. https://doi.org/10.1016/S1877-1203(13)70353-7
Ewing, B., Hillier, L., Wendl, M. C., & Green, P. (1998). Base-Calling of Automated Sequencer Traces Using Phred. I. Accuracy Assessment. Genome Research, 8(3), 175–185. https://doi.org/10.1101/gr.8.3.175
Ewing, B., & Green, P. (1998). Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Research, 8(3), 186–194. https://doi.org/10.1101/gr.8.3.186
Girardot, C., Scholtalbers, J., Sauer, S., Su, S. Y., & Furlong, E. E. M. (2016). Je, a versatile suite to handle multiplexed NGS libraries with unique molecular identifiers. BMC Bioinformatics, 17(1), 4–9. https://doi.org/10.1186/s12859-016-1284-2
Jiang, H., Lei, R., Ding, S. W., & Zhu, S. (2014). Skewer: A fast and accurate adapter trimmer for next-generation sequencing paired-end reads. BMC Bioinformatics, 15(1), 1–12. https://doi.org/10.1186/1471-2105-15-182
Li H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997.
Li, H., Handsaker, B., et al. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England), 25(16), 2078–9. https://doi.org/10.1093/bioinformatics/btp352
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., … DePristo, M. A. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), 1297–303. https://doi.org/10.1101/gr.107524.110
Richards, S., Aziz, N., Bale, S., Bick, D., & Das, S. (2015). Standards and guidelines for the interpretation of sequence variants : a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology, (December 2014), 1–20. https://doi.org/10.1038/gim.2015.30
Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., … Mesirov, J. P. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43), 15545–50. https://doi.org/10.1073/pnas.0506580102