Exploring the current paradigm of gene regulation

How much tissue-specific information is contained in enhancer sequences?

November 30, 2021

How do cells know when to activate a certain gene? This information is encoded in the sequence of the DNA, but our understanding of this code is incomplete. Researchers now tested how much information can be extracted from sequence data to predict which gene is active in which tissue.

A good storyteller knows exactly which anecdotes will bring their stories’ characters to life. By telling the right story at the right time, our genome even manages to give rise to hundreds of different cell types with characteristic life stories breathing an individual identity into every cell.

DNA snippets scattered across the genome harbor the code that directs the script of a cell’s life, successively switching genes on and off. Sequences called enhancers play an outstanding role in this process. They attract transcription factor proteins that start the expression of genes, thereby “enhancing” their activity. In some cases, they are located far away from the gene they activate.

Researchers Philipp Benner and Martin Vingron from the Max Planck Institute for Molecular Genetics (MPIMG) set out to decipher the instructions of the activation patterns in distinct cell types and embryonic tissues of the mouse.

With a series of statistical and bioinformatic analyses, the scientists identified several hundreds of tissue-specific DNA subsequences or “codewords” in enhancers that guide transcription factors, not only confirming sequences already known from other studies, but also identifying many new ones. The results have been published in several articles in NAR Genomics and Bioinformatics and the Journal of Computational Biology.

Training a model

“Today, researchers assume that all the information is in the DNA sequence, including information for specific cell types, tissues, and organs,” says Martin Vingron, Director at the MPIMG. According to the prevailing theory, transcription factor proteins recognize “codewords” in enhancers that are specific for a certain cell type, allowing the genome to tell a cell’s story by jumping to the right chapters. “We wanted to see how far this approach would take us and test its limits,” says Vingron.

The researchers developed a program that is able to identify DNA sequences that are recognized by the cell in order to activate genes in a tissue-specific way. They achieved this by training a statistical model with existing experimental data, telling it which enhancer is active in which tissue. Namely, they used sequencing data from eight tissues of the embryonic mouse like heart, lung, brain, or liver.

Learning to predict

By comparing sequence data between the tissues, the program learned to recognize sequence patterns in enhancers that are characteristic for certain tissues.

This told the researchers how much cell type-specific regulatory information is actually contained in the DNA sequence of enhancers, explains Philipp Benner, who is a postdoctoral researcher in Vingron’s lab: “The better our algorithm can classify any given enhancer, the more information it contains about the tissue or cell types that it is responsible for.”

The statistical classifiers can also identify DNA subsequences that might underlie cell type-specific gene activation. In fact, Benner found several hundred new codewords in addition to patterns that have been identified in other studies.

“Overall, we established a strong and, most importantly, an interpretable model,” says Benner.

Reaching the limits

“With our advanced methods, the predictions are promising but far from perfect”, says Vingron. “Our results indicate that we might really have only a fragmentary understanding of the actual cell type-specific regulatory code.”

It might be possible that not all the required information is contained in the DNA sequence of enhancers but is distributed elsewhere in the genome. Some cross-references in the storybook of the genome might still hide in other regulatory sequences, like promoter regions that are in close proximity to the gene itself.

Parts of the project were funded by the Berlin Institute for the Foundations of Learning and Data (BIFOLD) of the German Federal Ministry of Education and Research (BMBF).

Exploring the current paradigm of gene regulation

Training a model

Learning to predict

Reaching the limits

Other Interesting Articles