This workshop will introduce participants to variant calling using two methods: Genome analysis toolkit (GATK) and Deep Variant. We will develop a complete workflow for calling variants from whole-genome data for multiple individuals.
On day 1 we will have a hands-on workshop taking attendees through the complete GATK germline variant calling pipeline using Arabidopsis thaliana as the dataset. The heavy computational steps (alignment, MarkDuplicates, BQSR, per-sample GVCF generation) are pre-computed by the instructor and provided as ready-to-use files. Attendees run HaplotypeCaller on one sample themselves, then perform joint genotyping on all 6 pre-built GVCFs. Toward the end we will introduce Deepvariant which will be then covered in detail on Day2.
Day 1: GATK germline variant calling pipeline using Arabidopsis thaliana
3 Hours | Single-Sample Hands-On Guide
by Sivanandan Chudalayandi
Tutorial Setup Instructions
Steps to prepare for the tutorial session:
-
Login to Ceres Open OnDemand. For more information on login procedures for web-based SCINet access, see the SCINet access user guide.
-
Open a command-line session by clicking on “Clusters” -> “Ceres Shell Access” on the top menu. This will open a new tab with a command-line session on Ceres’ login node.
-
Create a workshop working directory by running the following commands. Note: you do not have to edit the commands with your username as it will be determined by the $USER variable.
mkdir -p /90daydata/shared/$USER/variant_calling cd /90daydata/shared/$USER/variant_calling cp -r /project/scinet_workshop2/foundations_bioinf_2026/variant_calling/files/* . -
Launch VS Code:
- Under the Interactive Apps menu, select VS Code
- Specify the following input values on the page:
- Account: scinet_workshop2
- Queue: ceres
- QoS: 400thread
- Number of cores: 32
- Memory required: 50 G
- Number of hours: 6
- Optional Slurm Parameters:
--reservation=foundations_workshop - Working Directory:
/90daydata/shared/$USER/variant_calling
- Click Launch. The screen will update to the Interactive Sessions page. When your VS Code session is ready, the top card will update from Queued to Running and a Connect to VS Code button will appear. Click Connect to VS Code.
Overview
This guide walks through the complete GATK germline variant calling pipeline using Arabidopsis thaliana (TAIR10) and a single sample (SRR1945435) from the 1001 Genomes Project. All heavy pre-computation has been done by the instructor. You are encouraged to run the commands yourself during or after the workshop.
Dataset: Arabidopsis thaliana TAIR10, all 5 nuclear chromosomes
Sample: SRR1945435 (Col-0 accession, ~21x coverage)
Environment: GATK 4.x, bwa-mem2, samtools, bcftools
Directory Structure
playground/
├── 10_Reference
├── 11_Raw
|
├── 12_Align
├── 13_Markdup
├── 14_BQSR
├── 15a_GVCF
│ ├── SRR1945435.chr4a.g.vcf.gz
│ ├── SRR1945435.chr4a.g.vcf.gz.tbi
│ ├── SRR1945436.chr4a.g.vcf.gz
│ ├── SRR1945436.chr4a.g.vcf.gz.tbi
│ ├── SRR1945437.chr4a.g.vcf.gz
│ ├── SRR1945437.chr4a.g.vcf.gz.tbi
│ ├── SRR1945438.chr4a.g.vcf.gz
│ ├── SRR1945438.chr4a.g.vcf.gz.tbi
│ ├── SRR1945439.chr4a.g.vcf.gz
│ ├── SRR1945439.chr4a.g.vcf.gz.tbi
│ ├── SRR1945440.chr4a.g.vcf.gz
│ └── SRR1945440.chr4a.g.vcf.gz.tbi
├── 15_GVCF
│ ├── SRR1945435.chr4.g.vcf.gz
│ └── SRR1945435.chr4.g.vcf.gz.tbi
├── 16_Joint
├── known_sites
│ ├── 1001genomes_snps_indels_chr4.vcf.gz
│ └── 1001genomes_snps_indels_chr4.vcf.gz.tbi
#### General Structure:
workshop/
├── 00_Reference/ # TAIR10 reference genome and indexes
├── 01_Raw/ # Raw FASTQ files
├── 02_Align/ # Sorted BAMs
├── 03_Markdup/ # MarkDuplicates BAMs and metrics
├── 04_BQSR/ # Recalibrated BAMs and tables
├── 05_GVCF/ # Per-sample GVCFs
├── 06_Joint/ # Joint genotyping outputs
├── known_sites/ # 1001 Genomes known variants VCF
├── tmp/ # Temporary files
└── logs/ # Log files
Environment Setup
# Load modules (HPC)
module load bwa_mem2
module load samtools
module load gatk
module load bcftools
# Or activate conda environment
conda activate gatk_workshop
# Set working directory
WORKDIR=/path/to/workshop
cd ${WORKDIR}
# Key variables used throughout — set these once
REF=00_Reference/TAIR10.fna
SAMPLE=SRR1945435
KNOWN=known_sites/1001genomes_snps_pass.vcf.gz
CHRS=(NC_003070.9 NC_003071.7 NC_003074.8 NC_003075.7 NC_003076.8)
Reference Preparation, Alignment & BQSR
-
Reference Genome Indexing
Three separate indexes are required by different tools. Each is built once and reused for all samples.
samtools faidx — random access index
Required by GATK and samtools to jump to any genomic position without reading the whole file.
samtools faidx 00_Reference/TAIR10.fna # Output # 00_Reference/TAIR10.fna.faiGATK CreateSequenceDictionary — contig catalogue
Required by GATK and Picard. Records chromosome names and lengths.
gatk CreateSequenceDictionary \ -R 10_Reference/TAIR10.Chr4.fna \ -O 10_Reference/TAIR.Chr4.dict # Output # 00_Reference/TAIR10.dictbwa-mem2 index — alignment search structure
Required by bwa-mem2 for alignment. Different from the bwa index — index files must share the reference filename as prefix.
bwa-mem2 index 10_Reference/TAIR10.Chr4.fna # Output — all must be present # 10_Reference/TAIR10.Chr4.0123 # 10_Reference/TAIR10.Chr4.amb # 10_Reference/TAIR10.Chr4.ann # 10_Reference/TAIR10.Chr4.bwt.2bit.64 # 10_Reference/TAIR10.Chr4.pacVerify all index files
ls -lh 00_Reference/TAIR10.Chr4.fna* -
Known Sites VCF Preparation
BQSR requires a set of known variant sites to distinguish true variants from sequencing errors. We use the 1001 Genomes Project SNP VCF.
Arabidopsis caveat: No curated truth set like dbSNP exists for Arabidopsis. The 1001 Genomes VCF was itself called with GATK — there is a circularity here. It is acceptable in practice but something that is worth noting.
Check contig naming
TAIR10 uses NCBI accession names (
NC_003075.7). The 1001 Genomes VCF uses integers (1, 2, 3, 4, 5). These must match.# Reference contig names grep ">" 00_Reference/TAIR10.fna | head -5 # VCF contig names zcat known_sites/1001genomes_snp-short-indel_only_ACGTN.vcf.gz \ | grep -v "^#" | cut -f1 | sort -uCreate chromosome rename map
cat > known_sites/chr_rename.txt << 'EOF' 1 NC_003070.9 2 NC_003071.7 3 NC_003074.8 4 NC_003075.7 5 NC_003076.8 EOFFilter, rename, compress and index
bcftools view \ -v snps \ -f PASS \ --threads 8 \ known_sites/1001genomes_snp-short-indel_only_ACGTN.vcf.gz | \ bcftools annotate \ --rename-chrs known_sites/chr_rename.txt \ --threads 8 \ -Oz -o known_sites/1001genomes_snps_pass.vcf.gz tabix -p vcf known_sites/1001genomes_snps_pass.vcf.gzAlways Verify
tabix known_sites/1001genomes_snps_pass.vcf.gz NC_003075.7:1-100000 | head -5 -
Alignment
Align raw reads to the reference with read groups. Read groups are mandatory for GATK — without them HaplotypeCaller will refuse to run.
Tag Meaning Required ID
Read group identifier — must be unique
Yes
SM
Sample name — used by GATK to identify the individual
Yes
PL
Sequencing platform
Recommended
LB
Library — used by MarkDuplicates
Recommended
bwa alignment
bwa-mem2 mem -t8 \ -R '@RG\tID:SRR1945435\tSM:SRR1945435\tPL:ILLUMINA\tLB:lib1' \ 10_Reference/TAIR10.Chr4 \ 11_Raw/SRR1945435_1.fastq.gz 11_Raw/SRR1945435_2.fastq.gz \ | samtools sort -@8 -o 12_Align/SRR1945435.bam #Index the sorted BAM samtools index 12_Align/SRR1945435.bamVerify alignment
# Overall alignment statistics samtools flagstat 12_Align/SRR1945435.bam # Check read groups are present samtools view -H 12_Align/SRR1945435.bam | grep "^@RG" # Check mean coverage samtools depth -r NC_003075.7 12_Align/SRR1945435.bam | \ awk '{sum+=$3; n++} END {printf "Mean depth chr4: %.1fx\n", sum/n}' -
Mark Duplicates
PCR duplicates inflate variant counts and must be flagged before calling. Duplicates are marked, not removed — they remain in the BAM but are excluded by HaplotypeCaller.
gatk MarkDuplicates \ -I 12_Align/SRR1945435.bam \ -O 13_Markdup/SRR194543.markdup.bam \ -M 13_Markdup/SRR194543.metrics.txt # Index samtools index 13_Markdup/${SAMPLE}.markdup.bamInspect duplication metrics
cat 13_Markdup/SRR194543.metrics.txtKey metric:
PERCENT_DUPLICATION— values above 30% suggest over-amplification during library preparation. -
Base Quality Score Recalibration (BQSR)
Sequencers systematically mis-estimate base quality scores. BQSR corrects this by modelling errors against known variant sites.
Key concept: BQSR does not change the reads — it corrects the quality scores. Two steps: model the errors, then apply the correction.
Phred quality score for reference
Phred Score Error Rate Accuracy 1 error in... Q10
10%
90%
10 bases
Q20
1%
99%
100 bases
Q23
0.50%
99.50%
200 bases
Q28.7
0.135%
99.865%
740 bases
Q30
0.10%
99.90%
1,000 bases
Q40
0.01%
99.99%
10,000 bases
Step 1 — BaseRecalibrator (build the model)
gatk BaseRecalibrator \ -I 13_Markdup/SRR194543.markdup.bam \ -R 10_Reference/TAIR10.Chr4.fna \ --known-site known_sites/1001genomes_snps_indels_chr4.vcf.gz \ -O 14_BQSR/SRR1945435.recal.tableStep 1 takes about 30 mins
Step 2 — ApplyBQSR (apply the correction)
gatk ApplyBQSR \ -I 13_Markdup/SRR194543.markdup.bam \ -R 10_Reference/TAIR10.Chr4.fna \ --bqsr-recal-file 14_BQSR/SRR1945435.recal.table \ --tmp-dir tmp/ \ -O 14_BQSR/SRR1945435.recal.bam # index the recal bam file samtools index 14_BQSR/SRR1945435.recal.bamStep 1 takes about 30 mins
A Note: The code below runs BQSR (both step 1 and step2) in parallel across all 5 chromosomes — adapt it to your own samples by updating the paths at the top of the script.
# Run one job per chromosome for CHR in "${CHRS[@]}"; do gatk BaseRecalibrator \ -I 03_Markdup/${SAMPLE}.markdup.bam \ -R ${REF} \ --known-sites ${KNOWN} \ -L ${CHR} \ --tmp-dir tmp/ \ -O 04_BQSR/${SAMPLE}.recal.${CHR}.table \ > logs/${SAMPLE}.${CHR}.bqsr.log 2>&1 & done wait # Gather per-chromosome tables into one gatk GatherBQSRReports \ -I 04_BQSR/${SAMPLE}.recal.NC_003070.9.table \ -I 04_BQSR/${SAMPLE}.recal.NC_003071.7.table \ -I 04_BQSR/${SAMPLE}.recal.NC_003074.8.table \ -I 04_BQSR/${SAMPLE}.recal.NC_003075.7.table \ -I 04_BQSR/${SAMPLE}.recal.NC_003076.8.table \ -O 04_BQSR/${SAMPLE}.recal.tableStep 3 — AnalyzeCovariates (visualize before/after)
For plotting we need R in the path
module load r/4.4.1 R --vanilla -e "install.packages('gsalib', repos='http://cran.r-project.org')- First we need to run BaseRecalibrator on the recalibrated BAM (second pass)
gatk BaseRecalibrator \ -I 14_BQSR/SRR1945435.recal.bam \ -R 10_Reference/TAIR10.Chr4.fna \ --known-sites known_sites/1001genomes_snps_indels_chr4.vcf.gz \ -O 14_BQSR/SRR1945435.recal2.table- Generate before/after plot
gatk AnalyzeCovariates \ -before 14_BQSR/SRR1945435.recal.table \ -after 14_BQSR/SRR1945435.recal2.table \ -plots 14_BQSR/SRR1945435.AnalyzeCovariates.pdfOpen
SRR1945435.AnalyzeCovariates.pdf— the reported vs empirical quality plot should show points much closer to the diagonal after recalibration.
HaplotypeCaller & Per-Sample GVCF
-
HaplotypeCaller — Concepts
HaplotypeCaller’s core differentiator is local de novo assembly:
- Active region detection — identifies regions with elevated mismatch rate
- De Bruijn graph assembly — assembles reads into candidate haplotypes
- Smith-Waterman alignment — re-aligns reads to candidate haplotypes
- PairHMM likelihoods — computes genotype likelihoods for each haplotype pair
- Genotype assignment — assigns most likely genotype using Bayes’ theorem
-
GVCF Mode
GVCF mode (
-ERC GVCF) records evidence at every site — not just variant sites. This is essential for joint genotyping.Format Records Use case VCF
Variant sites only
Final output
GVCF
All sites including reference blocks
Joint genotyping input
A reference block (
<NON_REF>) confirms a site is homozygous reference — it is not missing data. This distinction is critical. -
Run HaplotypeCaller — Single Sample
This is the command you run yourself. Pre-built GVCFs are available as a fallback if needed.
Quick test — run on a 2 MB region of chromosome 4 only (~10-20 minutes)
gatk HaplotypeCaller \ -R 10_Reference/TAIR10.Chr4.fna \ -I 14_BQSR/SRR1945435.recal.bam \ -L NC_003075.7:1000000-3000000 -ERC GVCF \ --native-pair-hmm-threads 4 \ --tmp-dir tmp/ \ -O 15_GVCF/SRR1945435.chr4.g.vcf.gzInspect the GVCF
# Look at the header zcat 05_GVCF/SRR1945435.chr4.g.vcf.gz | grep "^#" | tail -10 # Look at variant records zcat 05_GVCF/SRR1945435.chr4.g.vcf.gz | grep -v "^#" | \ | grep -v "END=" | head # Look at a reference block zcat 05_GVCF/$SRR1945435.chr4.g.vcf.gz | grep -v "^#" | \ | grep "END=" | head # Count variant records zcat 05_GVCF/SRR1945435.chr4.g.vcf.gz | grep -v "^#" | \ | grep -v "END=" | wc -lUnderstand the GVCF fields
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE Chr4 100 . A T,<NON_REF> . . . GT:AD:DP:GQ:PL 0/1:15,12,0:27:99:350,0,480,395,516,911Field Meaning GT
Genotype — 0/0 ref, 0/1 het, 1/1 hom-alt
AD
Allele depth — reads supporting each allele
DP
Total depth at this site
GQ
Genotype quality — confidence in the GT call
PL
Phred-scaled likelihoods for each possible genotype
-
Run HaplotypeCaller — All Chromosomes
This is for all chromosomes; don’t run it during workshop Run one job per chromosome in parallel, then merge into a single GVCF.
# Run all chromosomes in parallel for CHR in "${CHRS[@]}"; do gatk HaplotypeCaller \ -R ${REF} \ -I 04_BQSR/${SAMPLE}.recal.bam \ -ERC GVCF \ -L ${CHR} \ --native-pair-hmm-threads 2 \ --tmp-dir tmp/ \ -O 05_GVCF/${SAMPLE}.${CHR}.g.vcf.gz \ > logs/${SAMPLE}.${CHR}.hc.log 2>&1 & done wait # Check variant counts per chromosome for CHR in "${CHRS[@]}"; do N=$(zcat 05_GVCF/${SAMPLE}.${CHR}.g.vcf.gz | \ grep -v "^#" | grep -v "<NON_REF>" | wc -l) echo "${CHR}: ${N} variant records" done -
Merge per-chromosome GVCFs into one
gatk MergeVcfs \ -I 05_GVCF/${SAMPLE}.NC_003070.9.g.vcf.gz \ -I 05_GVCF/${SAMPLE}.NC_003071.7.g.vcf.gz \ -I 05_GVCF/${SAMPLE}.NC_003074.8.g.vcf.gz \ -I 05_GVCF/${SAMPLE}.NC_003075.7.g.vcf.gz \ -I 05_GVCF/${SAMPLE}.NC_003076.8.g.vcf.gz \ -O 05_GVCF/${SAMPLE}.g.vcf.gz \ > logs/${SAMPLE}.merge.log 2>&1 # Verify zcat 05_GVCF/${SAMPLE}.g.vcf.gz | grep -v "^#" | \ grep -v "END" | wc -l
Joint Genotyping & Variant Filtering
-
Joint Genotyping — Concepts
We are better off with separate GVCFs:
- Scalability
Running HaplotypeCaller on 1000 BAMs simultaneously is impractical — memory, I/O, and runtime all scale badly. Separate GVCFs let you process one sample at a time on whatever hardware you have. - The N+1 benefit
With separate GVCFs you can add sample 1001 without touching the other 1000. With a joint GVCF you’d need to rerun HaplotypeCaller on all samples together. - Error isolation
If one sample fails you rerun only that sample. With a joint GVCF approach a single bad sample requires rerunning everything. - Parallelisation
Separate per-sample GVCFs are trivially parallelisable — one job per sample. A single joint HaplotypeCaller run cannot be split across nodes.
We will do this in a two-step process
6 × .g.vcf.gz → GenomicsDBImport → GenotypeGVCFs → cohort.vcf.gz - Scalability
-
GenomicsDBImport
Consolidates all 6 GVCFs into an efficient database. This step is pre-computed — it is slow and provided ready for you.
# Pre-built — shown for reference gatk GenomicsDBImport \ -V 05_GVCF/SRR1945435.g.vcf.gz \ -V 05_GVCF/SRR1945436.g.vcf.gz \ -V 05_GVCF/SRR1945437.g.vcf.gz \ -V 05_GVCF/SRR1945438.g.vcf.gz \ -V 05_GVCF/SRR1945439.g.vcf.gz \ -V 05_GVCF/SRR1945440.g.vcf.gz \ --genomicsdb-workspace-path 06_Joint/arabidopsis_db \ -L NC_003070.9 \ -L NC_003071.7 \ -L NC_003074.8 \ -L NC_003075.7 \ -L NC_003076.8 \ --tmp-dir tmp/ \ > logs/genomicsdb.log 2>&1 -
GenotypeGVCFs
This is the command you run yourself.
gatk GenomicsDBImport \ -V 15a_GVCF/SRR1945435.chr4a.g.vcf.gz \ -V 15a_GVCF/SRR1945436.chr4a.g.vcf.gz \ -V 15a_GVCF/SRR1945437.chr4a.g.vcf.gz \ -V 15a_GVCF/SRR1945438.chr4a.g.vcf.gz \ -V 15a_GVCF/SRR1945439.chr4a.g.vcf.gz \ -V 15a_GVCF/SRR1945440.chr4a.g.vcf.gz \ --genomicsdb-workspace-path 16_Joint/arabidopsis_db \ -L NC_003071.7 \ --tmp-dir tmp/gatk GenotypeGVCFs \ -R 10_Reference/TAIR10.Chr4.fna \ -V gendb://16_Joint/arabidopsis_db \ -O 16_Joint/cohort_raw.vcf.gz \ --tmp-dir tmp/Inspect the raw cohort VCF
# Summary statistics bcftools stats cohort_raw.vcf.gz | grep "^SN" # Total variant count zcat cohort_raw.vcf.gz | grep -v "^#" | wc -l # SNPs vs indels bcftools view -v snps cohort_raw.vcf.gz | grep -v "^#" | wc -l bcftools view -v indels cohort_raw.vcf.gz | grep -v "^#" | wc -l # Look at the multi-sample genotype columns zcat cohort_raw.vcf.gz | grep -v "^#" | head -5 | \ cut -f1-9,10-15Understanding the multi-sample VCF
CHROM POS REF ALT FORMAT SRR1945435 SRR1945436 SRR1945437 ... Chr4 500 A T GT:AD:GQ 0/1:10,8:99 1/1:0,20:99 0/0:15,0:99Each sample has its own genotype column — this is what joint genotyping produces. A
0/0call is a confirmed reference genotype, not missing data. -
Variant Filtering
Raw calls need filtering to remove false positives caused by strand bias, low depth, and mapping artifacts. With 6 samples we use hard filtering — VQSR requires ≥30 samples.
Key annotation thresholds for Arabidopsis
Annotation Meaning SNP filter Indel filter QD
Quality by depth — normalises QUAL by depth
< 2.0
< 2.0
FS
Fisher strand bias — phred-scaled p-value
60.0
200.0
MQ
Mapping quality of reads supporting variant
< 40.0
—
SOR
Strand odds ratio — alternative strand bias metric
3.0
10.0
MQRankSum
Mapping quality difference ref vs alt reads
< -12.5
—
ReadPosRankSum
Position of alt allele in reads
< -8.0
< -20.0
Filter SNPs
gatk VariantFiltration \ -R ${REF} \ -V 06_Joint/cohort_raw.vcf.gz \ --select-type-to-include SNP \ --filter-expression "QD < 2.0" --filter-name "QD2" \ --filter-expression "FS > 60.0" --filter-name "FS60" \ --filter-expression "MQ < 40.0" --filter-name "MQ40" \ --filter-expression "SOR > 3.0" --filter-name "SOR3" \ --filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \ --filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \ -O 06_Joint/cohort_snps_filtered.vcf.gz \ > logs/filter_snps.log 2>&1Filter indels
gatk VariantFiltration \ -R ${REF} \ -V 06_Joint/cohort_raw.vcf.gz \ --select-type-to-include INDEL \ --filter-expression "QD < 2.0" --filter-name "QD2" \ --filter-expression "FS > 200.0" --filter-name "FS200" \ --filter-expression "SOR > 10.0" --filter-name "SOR10" \ --filter-expression "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \ -O 06_Joint/cohort_indels_filtered.vcf.gz \ > logs/filter_indels.log 2>&1Merge filtered SNPs and indels
gatk MergeVcfs \ -I 06_Joint/cohort_snps_filtered.vcf.gz \ -I 06_Joint/cohort_indels_filtered.vcf.gz \ -O 06_Joint/cohort_filtered.vcf.gz \ > logs/merge_filtered.log 2>&1Compare raw vs filtered
# Raw counts echo "=== Raw ===" bcftools stats 06_Joint/cohort_raw.vcf.gz | grep "^SN" # PASS counts only echo "=== Filtered (PASS only) ===" bcftools view -f PASS 06_Joint/cohort_filtered.vcf.gz | \ bcftools stats | grep "^SN" # Per-sample PASS variant counts echo "=== Per-sample variant counts ===" bcftools query \ -f '[%SAMPLE\t%GT\n]' \ -s SRR1945435,SRR1945436,SRR1945437,SRR1945438,SRR1945439,SRR1945440 \ -i 'FILTER="PASS"' \ 06_Joint/cohort_filtered.vcf.gz | \ grep -v "0/0\|\./\." | \ awk '{count[$1]++} END {for(s in count) print s, count[s]}' | \ sort -
Inspect Results in IGV
# Extract PASS variants for IGV bcftools view -f PASS \ 06_Joint/cohort_filtered.vcf.gz \ -Oz -o 06_Joint/cohort_pass.vcf.gz tabix -p vcf 06_Joint/cohort_pass.vcf.gzLoad in IGV:
- Genome → Load from file →
00_Reference/TAIR10.fna - File → Load from file →
04_BQSR/SRR1945435.recal.bam - File → Load from file →
06_Joint/cohort_pass.vcf.gz
What to look for:
- A good SNP call: reads clearly split between two alleles, balanced strand representation, high mapping quality
- A filtered call: strand bias (all alt reads on one strand), soft-clipped reads at the variant site, low depth
- Genome → Load from file →
-
Transition/Transversion Ratio (Ti/Tv)
A useful quality metric for SNP calls. For Arabidopsis whole-genome data the expected Ti/Tv ratio is ~2.0–2.5. Values well below 2.0 suggest excess false positives.
bcftools stats -f PASS \ 06_Joint/cohort_filtered.vcf.gz | \ grep "Ts/Tv"
Summary — What You Have Built
Raw FASTQs
↓ bwa-mem2 (with read groups)
Sorted BAM
↓ gatk MarkDuplicates
Markdup BAM + metrics
↓ gatk BaseRecalibrator + ApplyBQSR
Recalibrated BAM + BQSR table
↓ gatk HaplotypeCaller -ERC GVCF
Per-sample GVCF (all sites)
↓ gatk GenomicsDBImport
Genomics database (6 samples)
↓ gatk GenotypeGVCFs
Raw cohort VCF
↓ gatk VariantFiltration
Final filtered VCF ✓
Quick Reference — Key Commands
# Check alignment
samtools flagstat sample.bam
# Check coverage
samtools depth -r NC_003075.7 sample.bam | awk '{s+=$3;n++} END{print s/n}'
# Count GVCF variants
zcat sample.g.vcf.gz | grep -v "^#" | grep -v "<NON_REF>" | wc -l
# Count VCF PASS variants
bcftools view -f PASS cohort_filtered.vcf.gz | grep -v "^#" | wc -l
# Check Ti/Tv ratio
bcftools stats -f PASS cohort_filtered.vcf.gz | grep "Ts/Tv"
# Inspect a region in terminal
bcftools view cohort_filtered.vcf.gz NC_003075.7:1-100000 | grep -v "^#" | head
Further Reading
- GATK Best Practices
- 1001 Genomes Project
- TAIR10 Reference
- Next workshop: DeepVariant — A different variant calling approach
Day 2: Deep Variant
Pre-workshop instructions
To help minimize technical issues and delays at the start of the workshop, please try the following tests prior to the workshop.
- Logging on to Atlas Open OnDemand (OOD): Please confirm you can successfully log in to Atlas OOD with your SCINet account (see instructions here). If you are successful, you will be able to see the Atlas OOD home page.
- Atlas Shell Access: When on Atlas OOD, click on the top navigation bar: “Clusters” > “Atlas Shell Access”. A new tab will appear that looks like a shell terminal (e.g., like PowerShell). Please confirm you do not receive any error messages or requests to re-authenticate and that the final line looks like “[firstname.lastname@atlas-login-1 ~]$”.
Tutorial Setup Instructions
Steps to prepare for the tutorial session:
-
Login to Atlas Open OnDemand at https://atlas-ood.hpc.msstate.edu/. For more information on login procedures for web-based SCINet access, see the SCINet access user guide.
-
Open a command-line session by clicking on “Clusters” -> “Atlas Shell Access” on the top menu. This will open a new tab with a command-line session on Atlas’ login node.
-
Request resources on a compute node by running the following command:
srun --reservation=foundations_workshop -A scinet_workshop1 -t 05:00:00 -n 1 --mem 8G --pty bash- For those accessing post-workshop: Remove
--reservation=foundations_workshopand change-A scinet_workshop1to your own project account. For additional information, see our SLURM guide.
- For those accessing post-workshop: Remove
-
Create a workshop working directory and copy the workshop materials into it by running the following commands. Note: you do not have to edit the commands with your username as it will be determined by the $USER variable.
mkdir -p /90daydata/shared/$USER/deepvariant
cd /90daydata/shared/$USER/deepvariant
cp /project/scinet_workshop1/deepvariant/Sample_Data/assembly.fasta .
mkdir PE_directory
cd PE_directory
cp -r /project/scinet_workshop1/deepvariant/Sample_Data/samplename_R*.fastq.gz .
cd ..
# Create a directory for trimmed reads
mkdir Trimmed
# Create a directory for mapped reads
mkdir Mapped
# Create a directory for variants
mkdir Variants
# Create a directory for intermediate files
mkdir Int
# Activate your conda environment
module load miniconda3
module load samtools
module load apptainer
source activate /project/scinet_workshop1/deepvariant/Software/condaenvs/deepvariant
-
Trimming
# Trim adapter artifacts from your reads trim_galore --paired \ --basename samplename \ --output_dir Trimmed \ --cores 24 \ PE_directory/samplename_R1.fastq.gz \ PE_directory/samplename_R2.fastq.gz -
Mapping
# Index your reference assembly bwa-mem2 index assembly.fasta # Index reference assembly using Samtools (for later) samtools faidx assembly.fasta >assembly.fasta.fai # For each of your trimmed and paired reads: bwa-mem2 mem –t 48 assembly.fasta \ Trimmed/samplename_val_1.fq.gz \ Trimmed/samplename_val_2.fq.gz | \ samblaster | \ samtools sort -@ 48 –o Mapped/samplename.bam -
Call variants
apptainer exec Software/sifs/deepvariant_1.6.0.sif \ python3 Software/deepvariant/scripts/run_deepvariant.py \ --num_shards=48 \ --model_type WGS \ --output_vcf Variants/samplename.vcf.gz \ --output_gvcf Variants/samplename.g.vcf.gz \ --ref assembly.fasta \ --reads Mapped/samplename.bam \ --sample_name samplename \ --intermediate_results_dir Int/samplename_int \ --make_examples_extra_args "normalize_reads=true” \\ --dry_run -
Joint Genotyping
#!/bin/bash #SBATCH --time=08:00:00 # walltime limit (HH:MM:SS) #SBATCH --nodes=1 # number of nodes #SBATCH --ntasks-per-node=48 # 20 processor core(s) per node X 2 threads per core #SBATCH --partition=atlas # standard node(s) #SBATCH --job-name="step4:joint_genotyping" #SBATCH --mail-user=$USER@usda.gov # email address #SBATCH --mail-type=BEGIN #SBATCH --mail-type=END #SBATCH --mail-type=FAIL #SBATCH --output="deepvariant-std-%j-%N.out" # job standard output file (%j replaced by job id) #SBATCH --error="deepvariant-std-%j-%N.err" # job standard error file (%j replaced by job id) #SBATCH --account=scinet_workshop1 module load miniconda3 module load bcftools module load htslib conda activate /project/scinet_workshop1/deepvariant/Software/condaenvs/glnexus # Scalable gVCF merging and joint variant calling glnexus_cli \ --threads 48 \ --config DeepVariant_WGS \ Variants/*.g.vcf.gz > Variants/cohort.bcf conda deactivate # Convert raw bcf results to vcf format bcftools convert -Oz -o Variants/cohort.vcf.gz Variants/cohort.bcf # Set GT (DP < 1 then GT = ./.) then re-calculate AF and fill AN,AC INFO tags # Only keep standard FORMAT fields: GT:DP:AD:GQ:PL bcftools +setGT Variants/cohort.bcf --threads 48 -Ob -- -t q -n . -e 'FMT/DP>=1' | \ bcftools +fill-tags --threads 48 - -Ob -- -t AF,AN,AC | \ bcftools annotate --threads 48 - -Ov -x FORMAT/RNC -o Variants/cohort.clean.vcf # End -
Variant Filtration
#!/bin/bash #SBATCH --time=08:00:00 # walltime limit (HH:MM:SS) #SBATCH --nodes=1 # number of nodes #SBATCH --ntasks-per-node=48 # 20 processor core(s) per node X 2 threads per core #SBATCH --partition=atlas # standard node(s) #SBATCH --job-name="step5:variant_filtration" #SBATCH --mail-user=$USER@usda.gov # email address #SBATCH --mail-type=BEGIN #SBATCH --mail-type=END #SBATCH --mail-type=FAIL #SBATCH --output="deepvariant-std-%j-%N.out" # job standard output file (%j replaced by job id) #SBATCH --error="deepvariant-std-%j-%N.err" # job standard error file (%j replaced by job id) #SBATCH --account=scinet_workshop1 module load htslib module load plink 2 # Many ways to filter variants... Here are a couple examples: # Example 1 - AWK chmod +x Scripts/filter_vcf.awk ./Scripts/filter_vcf.awk \ -v dphom=2 \ -v dphet=4 \ -v minqual=20 \ -v mindp=100 \ -v minhomn=1 \ -v minhomp=0.9 \ -v tol=0.2 \ -v minmaf=0.01 \ -v minpresent=0.35 \ Variants/cohort.clean.vcf > Variants/cohort.clean.filt.vcf # "filter_vcf" parameters: # dphom ... minimum read depth to accept a homozygous genotype call # dphet ... minimum read depth to accept a heterozygous genotype call # minqual ... minimum SNP quality (6th column of the VCF) # mindp ... SNPs with a total depth (parsed from DP subfield of the INFO field) will be discarded. # minhomn ... SNPs with fewer than minhomn homozygous calls for the REF and/or the ALT allele will be discarded. # minhomp ... SNPs with a fraction of heterozygous calls among all present genotye calls exceeding 1 - minhomp will be discarded. # tol ... Genotypes with DV/DP <= tol will be called 0/0; genotypes with DV/DP >= 0.5 - tol & DV/DP <= 0.5 + tol will be called 0/1; and genotypes with DV/DP >= 1 - tol will be called 1/1. # minmaf ... SNPs with a minor allele frequency below minmax will be discarded. # minpresent ... SNPs with a fraction of present data less than minpresent will be discarded. # Compress results bgzip --threads 48 Variants/cohort.clean.filt.vcf tabix -p vcf Variants/cohort.clean.filt.vcf.gz # Example 2 - Plink2 # Filter plink2 \ --vcf Variants/cohort.clean.vcf \ --geno 0.35 \ --maf 0.05 \ --vcf-min-qual 16 \ --min-alleles 2 \ --max-alleles 2 \ --vcf-half-call missing \ --allow-extra-chr \ --recode vcf \ --out Variants/cohort.clean.filt # Convert filtered .vcf to numeric [snps, inds] ".traw" file extension: plink2 \ --vcf Variants/cohort.clean.filt.vcf.gz \ --allow-extra-chr \ --recode A-transpose \ --out Variants/cohort.clean.filt.numeric # End -
Merge and Filter
module load miniconda3 module load jemalloc module load bcftools module load htslib conda activate /project/scinet_workshop1/deepvariant/Software/condaenvs/glnexus # Use GLNexus for joint calling .g.vcf samples: glnexus_cli --threads 48 --config DeepVariantWGS *.g.vcf.gz > cohort.bcf # Convert raw bcf results to vcf format: bcftools convert -Oz -o cohort.vcf.gz cohort.bcf # Fill tags and drop DP<=1 calls bcftools +setGT cohort.bcf --threads 46 -Ob -- -t q -n . -e 'FMT/DP>=1' | \ bcftools +fill-tags --threads 46 - -Ob -- -t AF,AN,AC | \ bcftools annotate --threads 46 - -Ov -x FORMAT/RNC -o cohort.clean.vcf # Filter plink2 --vcf cohort.clean.vcf --geno 0.5 --vcf-min-qual 20 --min-alleles 2 --max-alleles 2 --vcf-half-call missing --allow-extra-chr --recode vcf --out cohort.clean.diploid # Make numeric (0,1,2) plink2 --vcf cohort.clean.diploid.vcf --allow-extra-chr --recode A-transpose –out cohort.clean.diploid.Atranspose # See results head -n 20 cohort.clean.diploid.Atranspose.trawStop the interactive job on the compute node by running the command exit.