Skip to main content

Genome Assembly Validation and Improvement



In this workshop, participants will learn how to understand and validate a genome assembly. The participants will appreciate why genome assembly is often an iterative process, where you start with a draft and constantly improve using techniques such as polishing and scaffolding. Participants will also be introduced to gene annotation.

Introduction to Hi-C Data and Genome Scaffolding with Juicer, 3D-DNA, and Juicebox

  • What is Hi-C?
    • Hi-C is a chromosome conformation capture technique that measures the 3D spatial organization of genomes.
    • It captures physical proximity between different regions of the genome by crosslinking, digesting, and ligating DNA, followed by paired-end sequencing.
    • Hi-C reads represent pairs of DNA fragments that were close together in the nucleus, even if they are distant in the linear genome sequence.
    • Hi-C data can reveal:
      • Chromosome territories
      • Topologically associating domains (TADs)
      • Fine-scale contacts are valuable for genome assembly
  • Why Use Hi-C for Genome Scaffolding?
    • Proximity information from Hi-C contacts can be used to:
      • Order contigs along chromosomes
      • Orient contigs correctly
      • Detect misassemblies in initial genome drafts
    • Hi-C offers long-range linking information (>1 Mb) that is difficult to obtain from short-read sequencing alone.
  • How does juicer and 3D-DNA use Hi-C Data for Scaffolding
    • Juicer is a pipeline for processing raw Hi-C reads into a contact map.
      • Aligns Hi-C read pairs to the draft genome assembly (e.g., using BWA or other aligners).
      • Filters and deduplicates valid Hi-C contacts.
      • Generates .hic files — compressed, indexed contact maps.
      • Provides quality control statistics on the data (e.g., contact matrices at various resolutions).
    • 3D-DNA is a pipeline that uses Juicer outputs to scaffold and improve genome assemblies.
      • Takes the initial assembly and Hi-C contact map as input.
      • Identifies and breaks potential misassemblies based on inconsistent Hi-C signals.
      • Clusters contigs into chromosomes using contact frequencies.
      • Orders and orients contigs within each chromosome.
      • Outputs a new assembly with chromosome-length scaffolds.
      • Optionally, the assembly can be manually curated using Juicebox Assembly Tools to correct or refine scaffolding.

Getting started:

srun --reservation=wk2_workshop -A scinet_workshop2 -t 08:00:00 -p ceres -N1 -c8 --pty bash

mkdir -p /90daydata/shared/$USER/genome_assembly
cd /90daydata/shared/$USER/genome_assembly
cp -r /project/scinet_workshop2/Bioinformatics_series/wk2_workshop/day2/ .
cd day2

Run juicer pipeline

  • Setting up subfolders

    mkdir references
    mkdir fastq
    mkdir splits
    
  • The reference folder should contain the genome and its BWA-mem index

    cp ../../02_Files/Genome.fasta references/
    
  • Build Index

    module purge
    module load bwa
    time bwa index references/Genome.fasta
    
  • Make the chrome.sizes file

    module load samtools
    samtools faidx references/Genome.fasta
    cut -f 1,2 references/Genome.fasta.fai > chrom.sizes
    
  • Copy the fastq files to the fastq folder and explore
    Note: Juicer requires that the fastq files be named with this extension _R1.fastq and _R2.fastq. Any other naming scheme will fail, and the files must not be compressed.

    cp ../02_Files/*.fastq fastq
    wc -l fastq/*
    ../02_Files/new_Assemblathon.pl references/Genome.fasta
    
  • Change to splits folder and split the fastq files

    cd splits/
    split -a 3 -l 240000 -d --additional-suffix=_R1.fastq ../fastq/2MAtHiCDedup_R1.fastq&
    split -a 3 -l 240000 -d --additional-suffix=_R2.fastq ../fastq/2MAtHiCDedup_R2.fastq &
    
  • Run juicer

    cd /90daydata/shared/$USER/genome_assembly/day2/01_TestJuicer/
    module purge
    module load juicer
    export _JAVA_OPTIONS=-Djava.io.tmpdir=${TMPDIR}
    time JUICER juicer.sh -d $PWD -p chrom.sizes -s none -z references/Genome.fasta -t 8 --assembly
    

    Essential output files reside within the aligned folder.

    tree aligned/
    aligned/
    ├── header
    ├── inter.txt
    ├── inter_30.txt
    ├── inter_30_hists.m
    ├── inter_hists.m
    ├── merged1.txt
    ├── merged30.txt
    ├── merged_dedup.bam
    └── merged_nodups.txt
    
Juicer2 Output Files — Basic Definitions
File Definition Use

header

Text file with metadata about the Hi-C run (e.g., genome version, parameters).

Used internally by Juicer; small summary of the processing run.

merged1.txt

List of Hi-C read pairs at 1 bp resolution (raw, not normalized).

Can be used to build low-resolution contact maps.

merged30.txt

List of Hi-C read pairs at 30 bp resolution (binning shortens file size).

Used for making higher-level, binned contact maps more efficiently.

merged_dedup.bam

BAM file of aligned, deduplicated Hi-C reads (valid pairs only).

Useful for visualization in genome browsers like IGV, JBROWSE, etc.

inter.txt

Contact statistics between contigs/scaffolds (raw form).

Basic QC: shows how reads link different parts of the genome.

inter_hists.m

MATLAB script with histograms of Hi-C contact distributions.

Helps visualize contact decay with distance; used for QC.

inter_30.txt

Same as inter.txt, but based on reads binned at 30 bp resolution.

Faster/lighter version for QC and distance plotting.

inter_30_hists.m

MATLAB script plotting histograms for 30 bp binned contacts.

Visual QC of contact maps at lower resolution.

merged_nodups.txt

Full list of valid Hi-C contacts, deduplicated, in tab-separated text format.

Main input for 3D-DNA scaffolding and for building .hic files for Juicebox visualization.

Troubleshooting common problems

  • Deduplication is not finishing:

    If this occurs, it is likely due to a low complexity region that has too many reads and thus too much depth for deduplication and becomes a memory hog without any progress.

    You can create a blacklist for these regions in your genome. Typically you can find those regions by running a repeat finder on your genome, and then masking those regions so reads will not map there. Then before you run 3ddna you can swap your genome back to the unmasked version. Tandem repeats and low complexity repeats are usually the culprit, ribosomal DNA may cause problems as well.

  • My juicer script wont submit any jobs:

    Juicer has not been modified suitably to use your HPC system. If this is not the issue, then it is likely that you must make -q and -l match the names of your queue’s. We are using the CPU version of Juicer today, so we should not see any of these issues.

  • My alignments are not finishing:

    Create a larger number of split fastq files within splits/.

Run 3D-DNA pipeline

  • Set up folder

    module purge
    module load dna_3d
    module load parallel
    module load java
    mkdir 3D_DNA/
    cd 3D_DNA/
    
  • Softlink the two main input files for 3D-DNA

    ln -s ../aligned/merged_nodups.txt 
    ln -s ../references/Genome.fasta
    
  • Main 3D-DNA pipeline command

    time run-asm-pipeline.sh Genome.fasta merged_nodups.txt
    # approx 25 mins
    

Appendix: more about 3d_DNA

There are many ways to optimize 3D-dna. Those options can be seen with sh run-asm-pipeline.sh –help . In my experience I have found the manual scaffolding with Juicebox to be much easier when I use the initial scaffolding that 3D-dna produces: Genome.0.hic and Genome.0.assembly. So I typically use the default parameters.

Basic commands
ml dna_3d; bash run-asm-pipeline.sh -h 

3D de novo assembly: version 190716

USAGE: ./run-asm-pipeline.sh [options] path_to_input_fasta path_to_input_mnd

DESCRIPTION: This is a script to assemble draft assemblies (represented in input by draft fasta and deduplicated list of alignments of Hi-C reads to this fasta as produced by the Juicer pipeline) into chromosome-length scaffolds. The script will produce an output fasta file, a Hi-C map of the final assembly, and a few supplementary annotation files to help review the result in Juicebox.

ARGUMENTS:
  • path_to_input_fasta
    Specify file path to draft assembly FASTA file.

  • path_to_input_mnd
    Specify path to deduplicated list of alignments of Hi-C reads to the draft assembly FASTA, as produced by the Juicer pipeline (i.e., the merged_nodups file).


OPTIONS:
  • -m | --mode haploid/diploid
    Runs in specific mode, either haploid or diploid (default is haploid).

  • -i | --input input_size
    Specifies threshold input contig/scaffold size (default is 15000).
    Contigs/scaffolds smaller than input_size will be ignored.

  • -r | --rounds number_of_edit_rounds
    Specifies number of iterative rounds for misjoin correction (default is 2).

  • -s | --stage stage
    Fast forward to later assembly steps.
    Accepted values: polish, split, seal, merge, finalize.

  • -h | --help
    Shows this help message. Use --help for a full set of options.

for advanced parameters

ml dna_3d; bash run-asm-pipeline.sh –help

Different ways to run 3d-dna
  • Set the temp directory for java before you run 3D-dna

    _JAVA_OPTIONS=-Djava.io.tmpdir=${TMPDIR}
    
  • Default run of 3D-dna

    module load dna_3d;module load parallel;module java;bash run-asm-pipeline.sh Genome.fasta merged_nodups.txt 
    
  • 3D-dna run that will scaffold all contigs >1000bp
    Default skips anything smaller than has anything smaller than 15,000 skipped.

    module load dna_3d;module load parallel;module load java;bash run-asm-pipeline.sh -i 1000 Genome.fasta merged_nodups.txt 
    
  • 3D-dna run that allows a larger variation of repeat coverage

    module load dna_3d;module load parallel; module load java;bash run-asm-pipeline.sh --editor-repeat-coverage 20 Genome.fasta merged_nodups.txt 
    
  • 3D-dna run that trys to assemble unmapped reads with Lastz, if installed

    module load dna_3d;module load parallel;module load java; bash run-asm-pipeline.sh -m diploid Genome.fasta merged_nodups.txt
    
  • 3D-dna run that only uses reads with a mapping quality greater than 30

    module load dna_3d;module load parallel;module load java; bash run-asm-pipeline.sh -q 30 Genome.fasta merged_nodups.txt
    
  • 3D-dna run that increases stringency for misjoin detection for less fragmentation

    module load dna_3d;module load parallel;module load java; bash run-asm-pipeline.sh -q 30 --editor-coarse-stringency 90 --editor-repeat-coverage 20 --splitter-coarse-stringency 80 --splitter-coarse-resolution 250000 --editor-coarse-resolution 250000 --editor-fine-resolution 25000 --editor-coarse-region 500000 Genome.fasta merged_nodups.txt
    
  • 3D-dna run that was the best for this arabidopsis dataset

    module load dna_3d;module load parallel;module load java; bash run-asm-pipeline.sh --editor-coarse-stringency 90 --editor-coarse-resolution 125000 --editor-coarse-region  250000 --editor-repeat-coverage 30 Genome.fasta merged_nodups.txt
    

Running JuiceBox on Ceres Desktop

mkdir /90daydata/shared/$USER/genome_assembly/day2/03_JuiceBox
cd /90daydata/shared/$USER/genome_assembly/day2/03_JuiceBox

##Download Juicebox
wget https://github.com/aidenlab/JuiceboxGUI/releases/download/v3.1.4/juicebox.jar
module load java
java -Xms512m -Xmx2048m -jar juicebox.jar
module load dna_3d;module load java;module load parallel; bash run-asm-pipeline-post-review.sh --sort-output -s seal -i 100 -r Genome.0.review.assembly Genome.fasta merged_nodups.txt

Workshop Materials

Workshop materials and recordings available for USDA employees at the links below


    • Thursday, May 1, 2025, 1 – 5 PM ET
      • Instructor: Rick Masonbrink, Sivanandan Chudalayandi, and Satheesh Viswanathan
      • Prerequisites:
        • Familiarity with basic command-line concepts.
      • Workshop recording