One of the secrets to bacterial survival is a set of genes packaged together in a small genetic construct called a plasmid. When an environment abruptly changes, a bacterial population can’t wait for the glacial pace of evolution to save itself. Instead, it often borrows the genes it needs to fight back against antibiotic stress or host inflammation from plasmids, non-chromosomal and usually circular DNA molecules that can be quickly shared between organisms. 

But plasmids are small, even relative to bacteria, and their genes are often hard to distinguish from other genes that commonly occur in bacterial chromosomes. Though certain plasmids have been captured and repurposed as laboratory tools, their role in nature and human health remains largely unknown. With a CDAC Data Science Discovery seed grant, University of Chicago and Toyota Technological Institute at Chicago (TTIC) researchers will use machine learning to capture the diversity of naturally occurring, often elusive plasmids.

As researchers strive to understand the microbiome, the rich population of microorganisms associated with humans and the natural environment, plasmids have been a persistent blind spot. Common methods to access microbial genomes, such as cultivation and genome-resolved metagenomics, typically miss these tiny structures, leaving the story of how bacteria rapidly adapt to and thereby affect their environment incomplete.

“How we are going to bridge this gap is not so clear; what is clear is that we have not been studying plasmids that occur in naturally occurring microbial communities very effectively,” said A. Murat Eren (Meren), assistant professor of medicine at UChicago Medicine and a Fellow of the Marine Biological Laboratory. “But we want to focus on plasmids because they often promote conditional fitness as microbial habitats go through phases of change, and most questions we are interested in today require us to understand how microbes are able to maintain their presence in harsh environments, such as the gut environments of humans who suffer from inflammatory bowel disease.”

To erase this blind spot, Meren and collaborator Michael Yu of TTIC will turn to machine learning. With data from known plasmids and available metagenomic samples, the team will build a new computational model capable of detecting plasmids and their genes, adding fresh value to existing data and creating a powerful new tool for exploring the role of plasmids in health and the environment.

Metagenomics takes attendance from a microbial ecosystem by sequencing all the DNA from a sample, then using these data to identify which microorganisms are present, along with their gene content and relative abundances. However, these calculations can be incomplete and inaccurate, as many metagenomics approaches do not sequence genomes in their whole form but rather in small DNA fragments that must be re-assembled like a big puzzle — a hard task for even state-of-the-art algorithms. 

On top of that, plasmids are even more challenging to identify than bacterial chromosomes because of their shared and transient nature across multiple microbial populations. In a sense, finding plasmids is the kind of classic “needle in the haystack” problem that machine learning excels at…except scientists aren’t totally sure yet what distinguishes the needle from the surrounding hay. 

“We don’t even know how to recognize plasmids, to be honest,” Yu said. “By examining catalogs of known plasmids, we want to identify certain genetic architectures or signatures that might be indicative of a plasmid genome rather than the main bacterial genome. There’s a great classification problem there.”

Once trained, this classifier could be used to reanalyze the thousands of available metagenomic datasets — collected from healthy and unhealthy humans, marine and terrestrial habitats, and many other natural environments — to discover and measure the abundance of previously-hidden plasmids. The rules used by the classifier to sniff out those plasmids could themselves hold new biological knowledge, as the model “learns” particular genes or combinations of genes that signify a plasmid. More broadly, the results could influence how researchers think of the genetic units that make plasmids plasmids, and the microbiome as a whole.

“Much of the field describe microbiomes just by which species or taxonomic groups are present, and to what abundance in each of these different samples,” Yu said. “But with a better understanding of plasmids and other accessory mechanisms, such as mobile elements or phages, the barriers between species or taxonomic groups start to break down. We want a more fluid set of biological features that can encapsulate the very dynamic, promiscuous nature of the genetic content inside of a metagenome.”

That new perspective could unlock new health applications of microbiome data as well. Despite decades of research, Meren said, current microbiome assays don’t work well as diagnostic tools for many medical conditions, including those that are associated with inflammation. More accurate determination of the genes carried by plasmids in complex habitats and how they fluctuate and affect bacterial survival in different environments could help design more sensitive and clinically-relevant targets for diagnostic purposes.

“If various genes encoded by mobile genetic elements such as plasmids to survive particular stresses are very abundant in your gut, that could indicate that there’s a systemic problem,” Meren said. “Or their absence could tell us that the need for those genes decreased, which may be an indicator of decreasing stress levels. So there could be immediate opportunities to use these insights as diagnostic tools or markers to keep an eye on the progress of disease, which we so sorely need.”