Generating whole-genome data from a single insect: PacBio HiFi genome assembly and HiC scaffolding pipelines for reference quality genomes using SCINet
By: Scott Geib | 01/10/2021 USDA-ARS, U.S. Pacific Basin Agricultural Research Center, Hilo, HI
The damage that insect pests cost to human health and agriculture is enormous. For instance, invasive insects cost a minimum of US$70.0 billion per year globally, while associated health costs can exceed US$6.9 billion per year. One method that can help control insect pests is the study of pest insect genomes, because insect genomic data are useful resources for developing alternative and eco‐friendly pest control policies. Such analyses allow entomologists to discover molecular diversity in insect populations that underlie the causes of pest population outbreaks. But many of the existing insect genome sequences are not of high quality, which means that critical functional data may not be present in these genomes. Until recently, sequencing costs made it impractical to sequence large numbers of genomes, so that the genomes of many insect pests have yet to be sequenced.
However, technologies for cheaper, more high-quality genome assemblies are emerging, including the application of PacBio HiFi reads. These single-molecule long reads are processed as circular, consensus sequencing (CCS) reads to generate relatively long single molecule reads (5-20kb+) with exceptionally high accuracy. This technology improves the quality and completeness of genome sequences to capture more contiguous (i.e., not separated by gaps) DNA sequences than older sequencing technologies. Through performing DNA extraction and library preparation in my lab in Hilo, HI, and collaborating with PacBio Sequel II sequencing platforms available within the agency (Stoneville, MS & Clay Center, NE), we generated HiFi data for a number of insect pests as part of the Ag100Pest sequencing initiative.
Despite the small physical size and relatively low DNA yield of many insect species, which can inhibit DNA sequencing, we used library preparation methods adapted for low (as low as ~250 ng) and ultralow (as low as 5 ng) DNA quantities, allowing us to generate whole-genome data from single insects. Therefore, despite the low amount of input DNA, we are still able to generate extremely contiguous assemblies that allow us to test several different diploid genome assemblers developed for HiFi reads (HiCanu, HiFiASM, IPA).
Even for relatively complex genomes of non-model insects (insects that are not closely related to well studied species), we can build highly accurate assemblies. In some cases, contiguous DNA sequences spanning entire chromosomes or chromosome arms, and accurate resolution of repeat regions including centromere and telomere components have been possible.
Another great advantage of this strategy for genome assembly is the extremely short assembly time and relatively low computational requirements to complete the assembly. On a single node of the Ceres system using standard memory requirements, a 500 Mb genome can be assembled in a few hours using the HiFiASM assembler. HiCanu and IPA required somewhat longer compute times, and HiCanu required multiple compute nodes, but all were significantly shorter than previous assemblies with “standard” PacBio CLR data. Additionally, sequencing coverage as low as 20X was able to yield these outcomes, although there is some benefit to sequencing at high coverage, up to 60X was shown to generate improvements in contiguity. By combining these assemblies with HiC data, we have been able to scaffold to chromosome dozens of insect genome, and in many cases, the contiguity of the input HiFi assembly to the HiC scaffolding was already near chromosome scale, and the HiC data serves more to validate the correctness of the assemble, rather than join hundreds of contigs into chromosomes as seen in previous assembly technologies. Through these advances in genome sequencing and assembly methods, hundreds of reference quality assemblies can be performed on Ceres or Atlas each year to meet the needs of researchers across the agency.
SCINet Note: This project utilized computational resources on the Ceres HPC. Pipelines for performing pre-assembly filtering, HiFi assembly, post-assembly scaffolding with HiC and final assembly filtering (to remove microbial and mitochondrial components) have been developed by the Ag100Pest assembly team, and now due to the resources available on Ceres, we can go from raw data to assembly in less than a day, and curation to final genome in less than a week (requiring some manual review of the final assembly). This is revolutionary to non-model genomics, and allows expansion of the scope of projects into pan-genome studies, strain level characterizations, and more.