Faster, Better Metagenome Analysis: One genome, one contig for metagenome samples sequenced with PacBio HiFi reads

By: Derek Bickhart | 04/15/2021
USDA-ARS, Research Geneticist (Animals), Madison, WI

All plant and animal species must interact with microbes to survive and thrive. Examples of these interactions include the symbiosis of nitrogen-fixing bacteria in plant roots or the presence of beneficial microbes in the gastrointestinal tracts of animals. Even slight changes in microbial systems can have substantial effects on animal and plant productivity.

The study of all microbial genomes in a system is called “metagenomics.” Many agriculturally relevant metagenomes are incredibly complex and can have more unique DNA sequences than their plant or animal host. The amount of any one microbial species may vary from sample to sample, making it hard to identify and characterize species that may be present at very low levels. It is also difficult to resolve a “core” genome for each microbial species because the genome is constantly changing due to the transfer of DNA from one microbial species to another, a phenomenon known as “horizontal gene transfer”. Despite these issues, recent advances in technology have made metagenome assembly and characterization much more practical.

Improved length and quality of DNA sequence “reads” provide the best resolution of complex microbial genomes. Recent improvements in a technology known as “circular consensus sequencing” (CCS), by the company PacBio, have resulted in reads that are longer than 5,000 bases and have error rates less than 1% per base content. Previous error rates were 15-17%. While this error correction may not seem substantial, the reliability of CCS reads allows us to confidently detect low-level microbial strains in a sample. Since each read is derived from a single DNA molecule, there are several interesting biological facts about microbial genome structure that we can investigate using CCS technology. For example, we can detect certain types of horizontal gene transfer events that have integrated into microbial genomes on a single-molecule basis. This can include the detection of viruses that infect specific strains of microbes by identifying specific cells where viral genetic information has been integrated into the host genome. All of these analyses are incredibly complex and require sifting through mountains of data, so researchers at ARS need high-performance computing to be able to identify these features within a reasonable timeframe.

SCINet Note: The Ceres cluster has been instrumental in our attempts to characterize agricultural metagenomes by providing the computational power necessary for each project. A complicated metagenome can require over 100 billion bases of DNA sequence to be generated in order to identify a large majority of the microbes in the system. These billions of bases of DNA are spread among hundreds of millions of DNA sequence reads, which are akin to puzzle pieces of their original microbial genomes. To recreate the original genome from these pieces, we use computer algorithms to compare them all against each other. Just like a complex jigsaw puzzle, we can identify some pieces that naturally fit together and then stitch them all together into a larger portion of the puzzle. Imagine, though, that you have hundreds of millions of pieces and have no idea how the end result should look! We calculated that if we used just one CPU processor core, it would take more than a year for our algorithm to assemble the final metagenome from one of our samples. Using the distributed computing power of the Ceres cluster, we can reduce that time down to several days, leaving us with more time to analyze the results for interesting biological information. Already, the methods we have developed have been used by the research community and have been recognized with a 2020 Federal Laboratory Consortium for Technology Transfer (FLC) Midwest Regional Award for outstanding research.