# UW Bioinformatics Resource Center Tassel v2 pipeline GBS report

Report generated: 01/09/2020

# 0Genotype By Sequencing

Genotype by Sequencing (GBS): is a molecular technique that allows genomewide genotyping of a population of organisms by performing reduced representation sequencing of their genome. GBS is a robust, simple and affordable means of SNP discovery and mapping. Users of GBS often aim to find Single Nucleotide Polymorphisms (SNPs) within a population to: promote understanding of the relationships between the organisms within, or utilize the genotyping information for genomic selection, prediction or trait analysis.

## 0.1 Methods

Generating a reduced representation of a genome requires using restriction endonucleases to cleave the double stranded genomic DNA at any position the enzymes unique cut site is found. By utilizing one or more restriction endonucleases, the genomic DNA can be cleaved into manageable fragments. Utilizing different restriction enzymes or combinations of restriction enzyme(s) allow a GBS user to sequence different sites throughout the genome. After enzymatic cleavage the sticky ends of DNA that are left freely floating are used as anchors for the ligation of barcoded adaptors.

The barcoded adaptors serve several important functions:

1. A way to indentify each sample uniquely from the others in pooled DNA sequencing
2. A DNA template to use for PCR amplification of the DNA before sequencing
3. A structural template to adhere other sequencing instrument specific adaptors to

After barcode ligation the DNA is amplified via PCR, purified and quantified before sequencing. Upon sequencing, generally many samples are pooled into a sequencing lane to provide cost effective sharing of the massive sequencing depth provided by todays sequencing instruments.

Your sequencing data has been produced on our NovaSeq 6000. Information about sequencing density, speed and accuracy for this instrument can be found here.

# 1Sample Definitions

## 1.1 Input Data

Specific information pertaining to the input data for your project is shown in Table 1 below.

Date: 01/09/2020 /mnt/grl/brc/data/Client/0190620/V2 /mnt/genomes/Spinacia_oleracea/primary.fa ApeKI

The deliverable directory contains numerous files consisting of alignment maps, variant calls, and quality control metrics. These files contain the core information from your project. Content description and explanations of these files are given in the appropriate sections below. Many of these files contain metadata which contain additional information which we do not explicitly pull into this report. Information such as the sequencing flowcell, instrument ID, and cluster are embedded in the names of the sequence alignments.

## 1.2 Fastq Format

Your raw GBS data can be found in the file in the FASTQ format. The Illumina file naming structure is traditionally:

A random sample might be named:

Plate-2-Maize_S2_L004_R1_001.fastq.gz

After decompression, FASTQ files have four lines per sequence:

• Line 1: begins with a '@' character and is followed by a sequence identifier and an optional description.
• Line 2: is the raw sequence letters.
• Line 3: begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
• Line 4: encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

For Illumina FASTQ files the sequence identifier contains information that can be used to identify metadata about the run.

@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<sample number>
ElementRequirementsDescription
@@Each sequence identifier line starts with @
<instrument>Characters allowed: a–z, A–Z, 0–9, _Instrument ID
<run number>NumericalRun number on instrument
<flowcell ID>Characters allowed: a–z, A–Z, 0–9A Unique string of characters that identifies the flowcell.
<lane>NumericalLane number
<tile>NumericalTile number
<x_pos>NumericalX coordinate of cluster
<y_pos>NumericalY coordinate of cluster
<is filtered>Y or NY if the read is filtered (did not pass), N otherwise
<control number>Numerical0 when none of the control bits are on, otherwise it is an even number.
<sample number>NumericalSample number from sample sheet

# 2 Quality Control

Raw sequence data may contain residual sequence that is artificially introduced as a result of the laboratory methods used to prepare the DNA for sequencing. These sequence artifacts may introduce bias an increase the complexity of downstream data processing. Additionally, sequencing quality can be affected by various factors such as variation in chemical reagents, instrument platform, temperature or humidity changes, initial DNA quality or concentration. Due to the imperfect nature of the sequencing process and limitations of the optical instruments (batch effects, sequencing error, GC bias, duplicate reads, copy number variation, mapability), base calling has inherent uncertainty associated with it. The magnitude of uncertainty in each base call is represented by an error probability or Phred (pronounced like “fred”) score. This parameter is denoted formally as $$Q$$ and is proportional to the probability $$p$$ that a base call is incorrect, where $$Q = -10log~10~(p)$$.

Any biological inference or conclusion depends on the overall quality of your data. Therfore a quality control (QC) step is essential for providing high quality results. The following QC metrics calculated from your data are designed to help you evaluate the overall technical quality of your experiment. You should always review these metrics to ensure your data are of good quality and lack obvious problems or biases, which may affect how you can ultimately use these data.

During sequencing library construction, short oligonucleotides are ligated to the ends DNA fragments to be sequenced, so that they can be combined with primers for PCR amplification. While sequencing the fragments, if the read length is greater than that of the target DNA, the adapter sequence next to the unknown DNA sequence of interest is also sequenced, sometimes only partially. To recover the true target DNA sequence, it is required to identify the adapter sequence and remove it. Moreover, to improve further the quality of the sequence reads, skewer also trims the 3’ end of the fragment until a Phred quality of 20 is reached. This is equivalent to a base call accuracy of 99%. More information regarding Phred scores can be obtained from and in .

The Bioinformatics Resource Center performs a quality control step to remove any sequencing adapters and low quality bases. The trimming software skewer was used to pre-process raw fastq files. Skewer implements an efficient dynamic programming algorithm designed to remove adapters and/or primers from sequence reads. Reads that are trimmed too short to be of usable length are discarded from downstream analysis along with the corresponding read pair, in the case of paired-end sequencing.

# 3Tassel GBS Pipeline Version 2

The Tassel Version 2 GBS pipeline is an extension of the Java program Tassel. Tassel GBS is a scalable high throughput GBS data analysis platform with moderate computing resource requirements .

The new pipeline stores data to an embedded SQLite database. All steps of the pipeline either read from or write to this database. It is initially created in the GBSSeqToTagDBPlugin step. When running this pipeline, each subsequent step utilizes/adds data from the database created in the first step. A diagram of the database is presented below.

image credit: Tassel GBS v2 Bitbucket

## 3.1 GBSSeqtoTagDBPlugin

GBSSeqToTagDBPlugin takes fastq files as input, identifies tags and the taxa in which they appear, and stores this data to a local database. It keeps only good reads having a barcode and a cut site and no N’s in the useful part of the sequence. It trims off the barcodes and truncates sequences that (1) have a second cut site, or (2) read into the common adapter.

See the table below for a list of your input files and their associated raw and good barcoded reads:

BHKT72DMXXS24_2_fastq.gz 328727547 314900055 95.794
BHKT72DMXXS25_2_fastq.gz 346133972 337261358 97.437

Raw Reads: is the unprocessed reads within the file before any analysis.

Good Barcoded Reads: are sequence reads with a perfect match to one of the barcodes provided in a barcode key file and with no N’s in the sequence following the barcode up to the trim length.

See the table below to determine the read representation per sample:

TB01:2019062415002104871283
TB02:2019062415002113840882
TB03:2019062415002123141617
TB04:2019062415002131220516
TB05:201906241500214403584
TB06:2019062415002153332922
TB07:2019062415002163356085
TB09:2019062415002183206050
TB10:2019062415002191763673
TB11:20190624150021102537969
TB12:20190624150021113643434
TB13:2019062415002112904588
TB14:20190624150021131922513
TB15:20190624150021143311405
TB16:20190624150021152159178
TB17:20190624150021163736498
TB18:20190624150021173036391
TB19:20190624150021181038867
TB20:20190624150021194383077
TB21:20190624150021203541881
TB22:20190624150021212879401
TB23:20190624150021221966082
TB24:20190624150021232645742
TB25:20190624150021241720659
TB269:20190624150021633996310
TB26:20190624150021252093917
TB27:20190624150021264019942
TB281:20190624150021614297906
TB287:20190624150021424301044
TB288:20190624150021431394822
TB28:201906241500212731361
TB292:20190624150021444693220
TB293:20190624150021457144642
TB29:20190624150021282095624
TB30:20190624150021291473205
TB311:20190624150021643242522
TB312:20190624150021653942637
TB313:20190624150021664693147
TB314:20190624150021672697100
TB315:20190624150021684542589
TB317:20190624150021691417591
TB318:20190624150021701249588
TB319:20190624150021713262468
TB31:20190624150021301319527
TB320:20190624150021911927806
TB321:20190624150021723336375
TB322:20190624150021734053412
TB323:20190624150021903570
TB324:20190624150021742941917
TB32:20190624150021311803268
TB33:20190624150021323090496
TB342:20190624150021463898621
TB344:20190624150021473892773
TB34:20190624150021332527338
TB351:20190624150021483546380
TB352:20190624150021496602342
TB355:20190624150021923543571
TB356:20190624150021891076192
TB359:20190624150021753780938
TB35:20190624150021342861343
TB360:20190624150021763531495
TB361:20190624150021774536715
TB362:20190624150021783845707
TB363:20190624150021794377290
TB364:20190624150021802443930
TB365:20190624150021813295198
TB366:20190624150021825336955
TB367:20190624150021831017752
TB368:20190624150021842119291
TB369:20190624150021853487084
TB36:20190624150021353289158
TB371:20190624150021862763307
TB375:20190624150021943362717
TB37:20190624150021365446743
TB382:20190624150021932024661
TB385:20190624150021872118758
TB386:20190624150021882522771
TB387:20190624150021503041348
TB388:20190624150021514596016
TB389:20190624150021523734361
TB38:20190624150021373180931
TB390:20190624150021535009953
TB391:20190624150021542882213
TB392:20190624150021553993264
TB39:20190624150021383939640
TB400:2019062415002103297436
TB401:2019062415002112450620
TB402:2019062415002122269217
TB403:2019062415002132043302
TB404:2019062415002141934339
TB405:2019062415002151620578
TB406:2019062415002163072910
TB407:2019062415002173949526
TB408:2019062415002182728633
TB409:2019062415002193019801
TB40:20190624150021395146874
TB410:20190624150021103293726
TB411:20190624150021111875492
TB412:20190624150021123249439
TB413:20190624150021132913555
TB414:20190624150021142071057
TB415:20190624150021151421149
TB416:20190624150021162978207
TB417:20190624150021172778422
TB418:20190624150021183286214
TB419:20190624150021193634284
TB41:20190624150021403604561
TB420:20190624150021203569161
TB421:20190624150021212395300
TB422:20190624150021223024827
TB423:20190624150021232770340
TB424:20190624150021566907389
TB425:20190624150021573920483
TB427:20190624150021585905164
TB428:20190624150021242675804
TB429:20190624150021254837039
TB42:20190624150021413598196
TB430:20190624150021264358571
TB431:20190624150021272892719
TB432:20190624150021282722583
TB433:20190624150021292702163
TB434:20190624150021302393586
TB435:20190624150021312061431
TB437:20190624150021322331902
TB438:20190624150021333008786
TB439:2019062415002134558757
TB43:20190624150021424395619
TB440:20190624150021353902577
TB443:20190624150021366149376
TB444:20190624150021374628001
TB445:20190624150021382960481
TB446:20190624150021395211966
TB447:20190624150021402340850
TB448:20190624150021414919473
TB44:20190624150021433661816
TB451:20190624150021951051117
TB453:20190624150021594219761
TB454:20190624150021604092276
TB45:20190624150021444168867
TB46:20190624150021453281510
TB47:20190624150021463998183
TB48:20190624150021473264880
TB49:20190624150021483983888
TB50:20190624150021495438900
TB51:20190624150021504072843
TB52:20190624150021513902759
TB53:20190624150021524558727
TB54:20190624150021534345295
TB55:20190624150021544012202
TB56:20190624150021552905396
TB57:20190624150021563934828
TB58:20190624150021571416179
TB59:20190624150021586208739
TB60:20190624150021592530028
TB61:20190624150021604539002
TB62:20190624150021614370473
TB63:20190624150021624066966
TB64:20190624150021634761181
TB65:20190624150021643230965
TB66:20190624150021653967260
TB67:20190624150021662896394
TB68:20190624150021673316771
TB69:20190624150021681153914
TB70:20190624150021692232678
TB71:20190624150021702040634
TB72:20190624150021712796892
TB73:20190624150021722094934
TB74:20190624150021733095355
TB75:20190624150021743640028
TB76:2019062415002175887970
TB77:20190624150021761954851
TB78:20190624150021774320311
TB79:20190624150021783946669
TB80:20190624150021793703259
TB81:20190624150021802535635
TB82:20190624150021812804720
TB83:20190624150021823994439
TB84:20190624150021831980138
TB85:20190624150021842678242
TB86:20190624150021853340099
TB87:20190624150021863009971
TB88:20190624150021873732187
TB89:20190624150021883138272
TB90:20190624150021894309035
TB91:20190624150021902754301
TB92:20190624150021912325574
TB93:20190624150021924015065
TB94:20190624150021932046390
TB95:20190624150021943085905
TB96:20190624150021951710394
blank2:2019062415002162254
blank:20190624150021790212
Total: 606654479

Demultiplexed Reads are sequence reads with a perfect match to one of the barcodes provided in a barcode key file and with no N’s in the sequence following the barcode up to the trim length.

Interpretation: The list above provides the Demultiplexed Reads assigned to each sample. Therefore, the table values should provide a clue as to which samples are most well represented in your GBS experiment (>1x106 Demultiplexed Reads). Low counts (i.e. <1x105) likely indicates sample dropout. Sample dropout occasionally occurs in GBS due to the large number of samples that are pooled on our patterned Illumina flow cells and the error inherent in measuring DNA libraries.

## 3.2 TagExportToFastqPlugin

TagExportToFastqPlugin retrieves distinct tags stored in the database and reformats them to a FASTQ file that can be read by the Bowtie 2 or BWA aligner program. This plugin leads to one of the main practical advantages of the Tassel GBS pipeline. Namely, the collapse of redundant reads into tags of a predefined tag size. This read collapse drastically shortens the time required for tag/read alignment because of the reduced number of tags compared to the input reads. We utilized 64 bp tags during the analysis of your data. Even a single bp difference in reads of similar sequence will lead to different tags being stored in the database. The output file of TagExportToFastqPlugin is the input to the DNA sequence aligner Bowtie 2.

## 3.3 GBS Tag Alignment

The BRC uses the Bowtie 2 alignment software for alignment of the FASTQ data transfered from the TagExportToFastqPlugin. Bowtie 2 is a fast and memory efficient read aligner that supports gapped, local and paired-end alignment modes . To narrow down the locations a read can map in a large genome, Bowtie 2 extracts multiple "seeds" (smaller DNA sequences) from each read/tag and aligns them in an ungapped manner to the FM index in an effort to align the entire read. This multiseed heuristic increases the speed of alignment but can reduce accuracy.

Bowtie 2 deals with inaccuracy of read mapping via two methods:

1. Using an alignment scoring scheme which penalizes inaccurate alignments (gaps,insertions and mismatches relative to the reference) and rewards accuracy (correctly matched bases). Simply put, the higher the score the more similar the read/tag is to the reference sequence. If a read's alignment score exceeds the minimum score threshold (adjustable), the read's alignment is considered "valid". The scoring scheme is implemented differently for local vs end-to-end (default) alignments. In the Tassel v2 GBS pipeline we employ the --very-sensitive-local method during mapping of tags to the genome to ensure high accuracy of each tags mapping location.
2. Outputting a Mapping Quality (MAPQ) for each read alignment. The MAPQ in Bowtie 2 represents the aligners confidence that the location the read is aligned to is the read's true point of origin. MAPQ ranges from (lowest: 0 --> highest: 42) and is reported as: Q = -10 log10 p, where p is an estimate of the probability that the alignment does not correspond to the read’s true point of origin.
For more information on the inner workings of Bowtie 2, see the Bowtie 2 documentation.

The output of each Bowtie 2 run is a Sequence Alignment Map (SAM) file, the details of which can be found below:

ColFieldDescription
1QNAMEQuery (pair) NAME
2FLAGbitwise FLAG
3RNAMEReference sequence NAME
4POS1-based leftmost POSition/coordinate of clipped sequence
5MAPQMAPping Quality (Phred-scaled)
6CIGARextended CIGAR string
7MRNMMate Reference sequence NaMe (‘=’ if same as RNAME)
8MPOS1-based Mate POSition
9ISIZEInferred insert SIZE
10SEQquery SEQuence on the same strand as the reference
11QUALquery QUALity (ASCII-33 gives the Phred base quality)
12OPTvariable OPTional fields in the format TAG:VTYPE:VALUE

This data has been provided to you in the GBS_ folder as an alignment.bam.

See below for a table indicating genome mapping of the tags in your population:

Total TagsSecondary Tag Alignment% Secondary Tag AlignmentSupplementary Tag Alignment% Supplementary Tag AlignmentMapped Tags% Mapped Tags
1175237 0 0.0 0 0.0 599318 50.996

## 3.4 SamToGBSdbPlugin

SAMToGBSdbPlugin reads a SAM file to determine the potential positions of Tags against the reference genome. The plugin updates the current database with information on tag cut positions.

## 3.5 DiscoverySNPCallerPluginV2

DiscoverySNPCallerPluginV2 takes a GBSv2 database file as input and identifies SNPs from the aligned tags. Tags positioned at the same physical location are aligned against one another, SNPs are called from the aligned tags, and the SNP position and allele data are written to the database.

## 3.6 SNPQualityProfilerPlugin

This plugin scores all discovered SNPs for various coverage, depth and genotypic statistics for a given set of taxa.

## 3.7 ProductionSNPCallerPluginV2

This plugin converts data from fastq and keyfile to genotypes, then adds these to a genotype file in VCF or HDF5 format. VCF is the default output.

# 4Variant Detection

## 4.1 Variant Call Format (VCF)

A variant call format (VCF) file is the most common way to communicate a list of variations from the reference genome. VCF is a flexible text based format that is often stored in a gzip compressed format (*.vcf.gz extension). A detailed description of the nuances of this file format can be found at here. Briefly, the format contains meta-information lines, a header line, and then data lines. Each data line is tab-delimited with 8 fixed fields per record, described in the table below. The INFO field is the fixed field that contains a large amount of metadata about the variant. The data contained in the INFO field is variable and will be specified in the VCF header section.

Following the 8 fixed fields, there are optional fields describing the genotype information for each sample (the same types of genotype data must be present for all samples). A FORMAT field is given specifying the order of the colon-separated genotype data for each sample field. See the VCF documentation for more information about genotype fields.

ColFieldDescription
1CHROMIdentifies the chromosome on the reference
2POSThe reference position, with the 1st base having position 1.
3IDSemi-colon separated list of unique identifiers where available
4REFReference base(s)
5ALTComma separated list of alternate non-reference alleles
6QUALPhred-scaled quality score for the assertion made in ALT
7FILTERA semicolon-separated list of codes for failed filters (PASS or . indicate that the variant passed quality checks)
8INFOAdditional information encoded as a semicolon-separated series of short keys with optional values
9FORMAT A colon-separated field that specifies the data types and order for the following <SAMPLE> fields
10+ <SAMPLE> The genotype information for the sample specifed in the header line

The table below provides the number of SNPs that were found per contig:

Chromosome/ContigNumber of SNPs
122115
221824
340542
439876
521850
619540
Total SNPs: 165747

Interpretation: The table above indicates the number of SNPs discovered per Chromosome/Contig. These results indicate the unfiltered SNP marker sites detected by the Tassel v2 GBS pipeline . To generate high quality filtered SNPs you will need to perform traditional site and sample filtering. Depending on your scientific question and organism of interest you may consider following the GATK best practices for variant filtering. A few commonly used variant filtering programs you might consider using are the GUI based program Tassel or the command line programs bcftools, GATK, and vcftools.

# 5Sample Relationships

## 5.1 Principal Component Analysis

Principal Component Analysis (PCA)

Above we used the PCA method to visualize the genetic distance and relatedness of the organisms in the population you provided for GBS. To perform an accurate PCA analysis it is necessary to filter SNP sites, cull weak samples and and impute any missing data. We pre-filter you SNP data for: SNP sites with genotype information at >=60% of the samples in your population. Next, we cull any samples containing inadequate genotype information (<=80% of the pre-filtered SNP sites). Finally, we select only SNP sites containing sufficient genotype data in the population for imputation (i.e. sites with data for >=80% of samples). We then perform whole genome phasing and impute any missing data using the extensively tested Beagle algorithm on these high confidence SNP marker sites before generation of the PCA. During de novo SNP discovery, SNP phasing and imputation are impossible due to the lack of contiguous stretches of variants. Therefore, in de novo analyses, we accept only SNP sites which have data in the entire culled sample population. Information on the number of SNPs and filtering steps used to generate this plot can be found in the ./hapMap/PCA_analysis/Filter.log. Above we have utilized only the first three principal components (which often explain much of a GBS populations genetic variance) for visualization. If you would like further principal component data on your samples, principal components 1-5 calculated by the sklearn PCA function are present in the ./hapMap/PCA_analysis/PCA1.txt file.

Each point on the PCA represents an individual in your GBS population. If you hover over a point with the mouse, the chart will display the corresponding sample name and principal components (PC1,PC2,PC3 are x, y, and z respectively). Additionally, you can drag the 3D plot to better visualize relationships of the different clusters. Pictures from any angle can be saved by clicking the appropriate options in the upper right corner. Each point is colored based on the unsupervised clustering of the first three principal components. We have optimized the agglomerative clustering of the PCA components in your data by testing the clustering with up to 10 data clusters and picking the optimal number clusters such that: if another cluster was added, less than 10% of the samples would be present within. Although we have optimized the PCA to show the largest differences in the population occasionally the automated clustering method doesn't recognize patterns that are visible upon closer inspection. Additionally, if all samples in your population are very closely related, the clustering (defined by color) may have little meaning. The interpretation of the relationships of samples will require viewing the members of each cluster and their relationship in distance to nearby individuals within and outside their cluster. Finally, one of the strongest measures of the accuracy of the PCA is likely your own knowledge of the relationships of the organisms being sequenced. Using known outgroups within your GBS population will provide a reasonable measure of how much distance on the PCA is equivalent to a sizable genetic distance.

How does PCA work?

The PCA method is performed via eigenvalue decomposition of the data covariance or correlation matrix after a data normalization step. Effectively, PCA is a rotation of the axes of the original coordinate system to a new orthongonal axes (principal axes), such that the new axes coincide with the directions of maximum variation for the original observations. More generally, PCA is a mathmatical transformation on a large number of correlated variables into a small number of uncorrelated variables (principal components), i.e. a process of dimensionality reduction. The first principal component accounts for the largest amount of variation possible, the second accounts for less and so on and so forth until all the variance within the dataset is explained. For further information on PCA see the PCA wikipedia page.

## 5.2 Principal Component Variance

Principal Component Variance

This diagnostic graph displays the variance explained by each of the first 15 Principal Components (PCs 1-15) from Section 5.1. These metrics are valuable because they describe exactly how much variance in the dataset can be explained by each dimension in above 3D PCA. For example: if 75% of the genetic variance detected in your population can be explained by the first 3 principal components, you would have much more confidence in the relationships displayed in the PCA plot than if the first 3 PCs could only explain 25% of the genetic variance. Although no clear cutoffs exist, high levels of dataset variance (>50%) explained in the first 3 PCs inspire much greater confidence in the underlying relationships displayed than low levels of described variance.

## 5.3 Maximum Likelihood Phylogeny

This phylogenetic tree was assembled by converting VCF data into phylip format with the vcf2phylip program and performing Maximum-likelihood based phylogenetic inference with RAxML. Only SNP sites that had data present for >= 80% of the population were included in this analysis. The ete3 toolkit was used to plot the resulting newick tree file. Each node represents a genetic division in the population and the edge/branch length (noted on each branch) corresponds to the accumulated genetic distance. All data used to produce this phylogeny and the 400 dpi png file can be found in ./VCF/Phylogeny.

# 6Genome Imputation

Imputation of Genomic Sites:

In addition to the variants discovered by the pipeline, we are providing you with a fully phased and imputed variant call set (./VCF/Imputed_Sites/SNPs.mergedAll.MAFgt0.05.imputed.vcf.gz) that was generated via full genome imputation with the Beagle algorithm . Genotype imputation allows users to accurately fill in the gaps in genotype data inherent to GBS variant calling. Although imputation algorithms can very accurately phase and impute sites, the accuracy of the algorithm is highly dependent on population specific parameters including: genetic relatedness, the length of Identical By Descent (IBD) haplotype blocks, predicted recombination rates and the number of nearby markers considered during imputation. If the population is closely related the IBD haplotype blocks will be larger than if the population is diverse. Imputation algorithms move along each contig looking for sections of shared haplotype in the population and fill in the missing sites with either the variant identified to have a similar haplotype in a reference panel or the highest probability variant based on the population haplotype data and the algorithms statistical model. Before using the imputed markerset for any downstream analyses users are highly encouraged to:

1. Read the literature regarding best practices for genotype imputation in their organism of study.
2. Evaluate the parameters within ./VCF/Imputed_Sites/SNPs.mergedAll.imputed.log to ensure they conform to the proper standards.
3. Perform validation of the imputed SNP sites against a known reference panel.

# 7Interpreting Results

Analysis and interpretation of genomic data generated by sequencing technologies are among the most complex problems in genomic sciences. One consequence of this reality is that analysis and interpretation of sequencing data is far from routine. Analytic “best practices” remain very dynamic and reflect our current limited understanding of genomics including technical aspects of data collection.

The process the UW-BRC pipeline uses to analyze your GBS data can be stated as the following:

After VCFs are output from the pipeline, we perform filtering, imputation and generate population analyses (PCA and Phylogenetic Trees) to provide you with insight into the genetic relationships in your GBS population

## 7.1 Did it work?

Due to the many possible reasons a scientist might have for performing GBS we cannot a priori determine if your experiment was successful for your purposes. However, several metrics may be of interest to you. The first of which is the number of input reads (BarcodedReads) detected per sample. Although no clear rules exist, for a sizable eukaryotic genome (>1 Gb), samples which have in excess of 1x106 reads and low library duplication levels will tend to have good representation of the genotyped sites. Another metric of interest is the total number of SNPs called, and the chromosomes/contigs they are detected on. If you have less SNP sites than you might like on a specific chromosome/contig you should consider the following question. Was the area I wanted to cover reasonably well represented by the enzyme cut site I used in GBS? If not then trying a different enzyme or enzyme combination to cut your genome may yield better representation of that region. If the genome you are using had many contigs and you don't see them in the list, do not worry, for visualization purposes we have only included the top 30 contigs (all SNPs from the missing contigs are still present in your VCFs).

To evaluate how confident you can be in the results, you must go beyond the variant list itself, and evaluate additional aspects of the analyses and perform some basic checks.

# 8Credits

## 8.1 Bioinformatics Resource Center

Bioinformatics Resource Center
Genetics & Biotechnology Center, Room 2130
425 Henry Mall, Madison WI 53706
Email: brc@biotech.wisc.edu
Website: https://bioinformatics.biotech.wisc.edu

## 8.2 Software

Platform: x86_64-pc-linux-gnu (64-bit)

## 8.3 References

Browning, S., Browning, B. (2007). Rapid and accurate haplotype phasing and missing data inference for whole genome association studies by use of localized haplotype clustering. Am J Hum Genet, 81, 1084-1097.

Browning, B., Browning, S. (2016). Genotype imputation with millions of reference samples. Am J Hum Genet, 98, 116-126.

Catchen, J., Amores, A., Hohenlohe, P., Cresko, W., Postlethwait, J. (2011) Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences. G3, 1(3),171-182.

Ewing, B., Hillier, L., Wendl, M. C., & Green, P. (1998). Base-Calling of Automated Sequencer Traces Using Phred. I. Accuracy Assessment. Genome Research, 8(3), 175–185. https://doi.org/10.1101/gr.8.3.175

Ewing, B., & Green, P. (1998). Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Research, 8(3), 186–194. https://doi.org/10.1101/gr.8.3.186

Glaubitz, C., Casstevens, T., Liu, F., Elshire, R., Sun, Q., & Buckler, E. (2014) TASSEL-GBS: A High Capacity Genotyping by Sequencing Analysis Pipeline. Plos One, 9(2):e90346. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090346

Jiang, H., Lei, R., Ding, S. W., & Zhu, S. (2014). Skewer: A fast and accurate adapter trimmer for next-generation sequencing paired-end reads. BMC Bioinformatics, 15(1), 1–12. https://doi.org/10.1186/1471-2105-15-182

Langmead, B., & Salzberg, S. (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods, 9 357–359

Li H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997.

Li, H., Handsaker, B., et al. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England), 25 (16), 2078–9. https://doi.org/10.1093/bioinformatics/btp352