The Genome Analysis Toolkit (GATK)
The GATK is the industry standard for identifying single nucleotide polymorphisms (SNPs) & indels in germline DNA & RNAseq dapta. Its scope is now expanding to include somatic short variant calling, & to tackle copy number (CNV) and structural variation (SV). In addition to the variant callers themselves, the GATK also includes many utilities to perform related tasks such as processing & quality control of high-throughput sequencing data, & bundles the popular Picard toolkit.
gatk.broadinstitute.org/hc/e…
Genome Analysis Toolkit (GATK) PathSeq
We present an updated implementation of the PathSeq pipeline
pmc.ncbi.nlm.nih.gov/article… that makes substantial improvements on the original version. First, computational efficiency has been improved by incorporating faster computational approaches. Second, unlike the original version, Genome Analysis Toolkit (GATK) PathSeq permits users to configure the workflow for multiple use cases such as different library types (i.e. whole-genome & RNA sequencing), sample types (e.g. blood, tissue, sputum, etc.), or host species. Third, the tool suite is implemented in Java w/ the GATK engine
pmc.ncbi.nlm.nih.gov/article… & Apache Spark framework
usenix.org/legacy/event/hotc… enabling parallelized data processing on local workstations, computing clusters & Google Cloud computing services.
cloud.google.com
In summary, we have developed an adaptable & easily configurable pipeline for identification of microbial sequences in next gen sequencing data. This tool allows for customized analyses of biological samples w/ substantially reduced computational time.
pmc.ncbi.nlm.nih.gov/article…
The Genome Analysis Toolkit 4 (GATK4)
GATK4 aims to bring together well-established tools from the GATK & Picard codebases under a streamlined framework, & to enable selected tools to be run in a massively parallel way on local clusters or in the cloud using Apache Spark.
github.com/broadinstitute/ga…
Picard
Picard is a set of command line tools for manipulating high-throughput sequencing (HTS) data & formats such as SAM/BAM/CRAM & VCF.
broadinstitute.github.io/pic…
Picard is implemented using the HTSJDK Java library HTSJDK to support accessing file formats that are commonly used for high-throughput sequencing data such as SAM & VCF.
github.com/broadinstitute/pi…
HTSJDK
A Java API for high-throughput sequencing data (HTS) formats.
github.com/samtools/htsjdk
Apache Spark™
Apache Spark™ is a multi-language engine for executing data engineering, data science, & machine learning on single-node machines or clusters.
spark.apache.org/
At the heart of the Genome Analysis Toolkit (GATK) is an industrial-strength infrastructure & engine that handle data access, conversion & traversal, as well as high-performance computing features. This includes parallelization using Apache Spark & optimized usage of cloud infrastructure. On top of that lives a rich ecosystem of specialized tools that u can use out of the box, individually or chained into scripted workflows, to perform anything from simple data diagnostics to complex reads-to-variants analyses.
Genome Analysis Toolkit v4.6.1.0 (GATK) Tool Set
gatk.broadinstitute.org/hc/e…
The goal of the Allele-Specific filtering workflow is to treat each allele separately in the annotation, recalibration & filtering phases.
gatk.broadinstitute.org/hc/e…
DRAGEN-GATK
Combining Illumina's hardware accelerated data analysis platform w/ the Broad Institute's variant discovery pipelines.
gatk.broadinstitute.org/hc/e…
Under the hood, Illumina's DRAGEN (Dynamic Read Analysis for GENomics) uses FPGAs (Field-Programmable Gate Arrays) to deliver phenomenal speed-ups to their GATK-based germline short variant discovery pipeline. This reduces the end-to-end runtime from over 23 hours down to about 22 minutes on average for a single whole genome sample, starting from unmapped reads & delivering a GVCF &/or a filtered VCF.
gatk.broadinstitute.org/hc/e…