Research
Our projects
Epigenetic gene regulation
Epigenetic data such as histone modifications (HMs) help predict enhancers, as we demonstrated in earlier papers, including our CRUP algorithm for predicting enhancers from HM ChIP-seq data. In a typical use case, if a lab needs to determine the location of enhancers in a new cell type, generating only three HM ChIP-seq datasets is sufficient for the algorithm to produce a reliable enhancer prediction.
We subsequently approached the problem of predicting enhancer–promoter interactions in the same way. Can these interactions be predicted based on only a small set of experimental data that a lab would need to produce for a new cell type? Our new CENTRE algorithm does exactly this. It combines publicly available DNA data with a relatively small set of cell-type-specific data. This set includes the same three HMs used by CRUP, plus RNA-seq data to describe gene activity. With this information, our algorithm predicts interacting enhancer–promoter pairs as well as – or better than – leading methods that rely on a wide array of input data [Rapakoulia et al., Bioinformatics, 2023], and which are therefore costlier to apply to a new cell type.
DNA accessibility, as measured, for example, by ATAC-seq, is another highly informative feature regarding the regulatory potential of a region. In fact, cell types are characterized equally by their DNA accessibility patterns and their gene expression profiles. While single cells can now be classified into their cell type using scRNA-seq, it remains challenging to determine cell type from scATAC-seq profiles. We developed scATACat, an algorithm that infers a cell’s type from scATAC-seq data using reference cell types defined by their typical DNA accessibility profiles [Altay et al., NAR Genomics and Bioinformatics, 2024]. We see this as a step toward identifying accessible regions that define cell-type identity and in which we expect to find regulatory signals driving specificity. Along these lines, we also collaborated with Prof. Petra Knaus (Freie Universität) to analyze accessibility profiles of cells under shear stress.
In ongoing work, we are integrating accessibility into enhancer prediction. This has led us to examine in detail the characteristics of enhancers close to promoters (“proximal enhancers”). Furthermore, we are developing machine learning algorithms that use epigenetic data together with transcription factor binding motifs to improve predictions of cell-type-specific transcription factor binding.
Single-cell data analysis: clustering, visualization, batch correction
Motivated by collaborations involving single-cell transcriptomics, we developed methods to address problems that we found had received too little attention in the literature. In 2022, we published the concept of Association Plots [Gralinska et al., J. Mol. Biol., 2022; Proc. Royal Statistical Society Series C, 2023], which comprehensively visualize the genes associated with a particular cluster of cells. Genes are represented by dots: the further to the right a gene lies, the stronger its association with the cluster. These plots are based on the geometry of correspondence analysis, in which cell clusters and their respective marker genes lie along an axis emanating from the origin of the coordinate system.
Building on this idea, we developed CAbiNet, a tool that co-clusters cells and genes and presents the results so that the marker genes for a given cluster are visible within that cluster. Figure 2 below shows an example of what we call a biMAP: a UMAP overlay colored by cell cluster, with marker genes embedded within the corresponding clusters [Zhao, Kohl, et al., NAR, 2023]. In the interactive version, users can mouse over the dots embedded in a cluster to see the corresponding gene name.
For each cluster, we can now also generate an Association Plot, allowing us to visualize both the quality of the cluster and how strongly its marker genes characterize it. In ongoing work, we are developing an alternative algorithm that visually conveys the decisions made during clustering. This approach largely overcomes the curse of dimensionality by making high-dimensional geometry intuitively understandable.
Protein–protein interaction and intrinsically disordered regions
Motivated by discussions – and later collaboration – with Denes Hnisz’s group, we investigated whether the sequences of intrinsically disordered regions (IDRs) can inform us about which proteins physically interact. Predicting protein–protein interactions from sequence alone has been a longstanding challenge in bioinformatics. We set out to develop a machine learning approach aimed at predicting interactions based specifically on the IDRs of the participating proteins [Kibar et al., Proteins, 2023].
During the development of this algorithm, we learned a few unexpected lessons. First, when comparing predictions based on entire protein sequences with those based only on the IDRs, we found that the IDR-based predictions were as good as – or even better than – those using full sequences. This confirms that IDRs play a key role in protein interactions. Second, during evaluation, we realized that the problem itself is generally ill defined in the literature. To address this, we proposed two distinct problem formulations: the “symmetric” and “asymmetric” cases. In the symmetric case, neither of the two sequences appears in the training set. In the asymmetric case, one of the two sequences does. As one anonymous reviewer noted: “This is an interesting and important study that adds significantly to the field … the authors are congratulated for their study and for the clear demarcation between the asymmetric and symmetric problems.” These new definitions help explain discrepancies in reported method performance and now allow the two cases to be addressed more systematically.
Building on our shared interest with the Hnisz lab in IDRs, we also developed a novel algorithm to predict sequences that may constitute IDRs. Starting from the observation that aromatic side chains in IDRs often follow a near-periodic spacing, we modeled this feature by measuring the “non-randomness” of that spacing. Specifically, we collected the distances between consecutive aromatic residues and compared their distribution to a random expectation modeled by a Poisson process. In theory, the random distances should follow a geometric distribution. The deviation between the observed and expected distributions is quantified using a Kolmogorov–Smirnov test. This simple approach works surprisingly well and has allowed us to rapidly screen large protein sequence datasets for potential IDRs. For the full study, see [Naderi et al., Nature Cell Biology, 2024].
The role of PHF13 in chromatin structure
Sarah Kinkley
From 2019 to 2024, Sarah Kinkley led a lab in our department focused on understanding the regulation of chromatin architecture and genome integrity. One of the group’s interests was exploring the impact and function of specific H3K4me3 epigenetic readers – namely the paralogs PHF13 and PHF23 – on epigenome regulation, chromatin structure, and genome integrity. The group also aimed to decipher the role of RNA–DNA hybrids (R-loops) as potential drivers of oncogenesis. To this end, they performed various screens to assess the incidence of R-loops, their role as a precursor state in oncogenesis, and their involvement in synthetic lethality.
PHF13 is an H3K4me3 epigenetic reader. Having developed CRISPR knock-out, degron, tagged, and inducible cell lines, the group examined the functional domains of this protein and how they impact PHF13’s genomic functions. It was found that PHF13 is able to oligomerize in two distinct ways: one via its ordered N-terminal and C-terminal domains, and another via its intrinsically disordered regions (IDRs). This differential oligomerization promoted PHF13 phase transitions, influencing its role in gene regulation and higher-order chromatin compaction.
Oligomerization via its ordered regions resulted in a multivalent, ordered chromatin protein that could extend across nucleosomes, drive global chromosome compaction, and promote strong changes in gene expression – consistent with polymer–polymer phase separation. Oligomerization via PHF13’s IDRs promoted the formation of condensates similar to liquid-like phase separation and also influenced gene expression, albeit targeting different genes and with a weaker amplitude [Rossi et al., Nucleic Acid Research, in revision].
Another major interest of the group is to decipher the role of R-loops (RNA–DNA hybrids) as potential drivers of oncogenesis. RNA–DNA hybrids are highly genotoxic when aberrantly formed or inefficiently resolved. As a result, cells have evolved many dedicated enzymes and mechanisms to limit their formation and eliminate these structures. However, many of the factors regulating these structures are frequently mutated or disrupted in cancer, suggesting that RNA–DNA hybrids may represent a common precursor state to oncogenesis.
Unfortunately, there is a lack of tools for high-throughput, in vivo visualization of these structures, hampering studies aimed at exploring these questions. To address this deficit, we are developing high-precision tools that enable comprehensive, real-time, genome-wide visualization of RNA–DNA hybrids. Using these tools, we aim to perform a series of screens to investigate the incidence of R-loops, their role as a precursor to oncogenesis, and their involvement in synthetic lethality.
Regulatory changes in evolutionary genomics
Stefan Haas
Clade-specific genomic rearrangements are an important mechanism in evolutionary processes, contributing to novel clade-specific phenotypes by restructuring regulatory domains. In collaboration with the Mundlos group, we previously identified major regulatory genomic changes in moles linked to the mole-specific development of female ovotestis [Schindler et al., Development, 2023]. In a complementary approach, we used CRUP to analyze enhancer activity and putative target genes in gonadal tissues of moles and mice, based on the idea that a clade-specific phenotype is accompanied by multiple regulatory adaptations that support its robust development.
We screened for regulatory units with an increased number of ovotestis-specific enhancers potentially regulating development-related genes. In doing so, we discovered the TAD of the transcription factor SALL1, which is expressed specifically in mole ovotestis but not in mouse gonads. Additionally, the regulatory domain of SALL1 contains five strongly ovotestis-specific enhancers, four of which drive metanephros-specific expression in moles only. Intriguingly, these enhancers are widely conserved in mice as well; however, their activity in distinct tissues only partially overlaps with that in moles. This project shows that, during evolution, an entire group of enhancers can become functionally rewired to a new tissue context while still maintaining functional similarity in other tissues across clades.


