Skip to main content
Go to the home page of the European Commission (opens in new window)
English English
CORDIS - EU research results
CORDIS

Genetic Determinants of the Epigenome

Periodic Reporting for period 4 - Gen-Epix (Genetic Determinants of the Epigenome)

Reporting period: 2020-12-01 to 2022-06-30

The Problem: The funded project sought to test the hypothesis that a gross feature of DNA sequence – namely base composition – can be interpreted as a signal by proteins that recognise a simple sequence motif. A screen revealed numerous candidate proteins, several of which are implicated in pluripotency, development and cancer. We chose to follow up one of these proteins, SALL4, which is expressed in stem cells. Initially, postdoctoral researcher Dr Timo Quante, in collaboration with the group of Michiel Vermeulen (Nijmegen), carried out the screen using mouse stem cells. Our team subsequently generated mutant cell lines in which SALL4 was unable to bind AT-rich DNA and found dramatic effects on the expression of AT-rich genes. We showed that SALL4 preferentially regulates expression of genes within AT-rich domains of the genome, as predicted by our hypothesis.

Relevance: SALL4 is of broad biomedical interest for several reasons:
• It is mutated in the human skeletal disorder Okihiro syndrome.
• It is a primary target for the drug thalidomide which leads to its unscheduled degradation.
• It is an essential inhibitor of differentiation that safe-guards pluripotency of stem cells
• It is over-expressed in many cancers and is a potential target for anti-cancer therapeutics

High level outcome: The existence of long domains in the mammalian genome with distinct, evolutionarily-conserved base compositions correlated with gene expression levels has been known for decades, but our SALL4 study is to our knowledge the first to provide evidence for a specific biological function. This research programme helped us to build a comprehensive mechanistic picture of the ways in which DNA sequence-mediated transcriptional modulation establishes, defines, and stabilises cell states, and how defects in this system lead to disease.
In 2016 we hypothesised that proteins recognising runs of A/T could read DNA base composition as a signal to modulate gene expression. We identified SALL4 as a candidate and showed that SALL4 contains an A/T motif binding domain – zinc finger cluster ZFC4 – which is essential for repression of genes that promote ES cell differentiation. Genes are repressed to a degree that correlates with their average AT base composition. Accordingly, over-expression of SALL4 leads to hyper-repression of the most AT-rich genes, as predicted by our hypothesis. Many of the most strongly repressed genes are involved in neuronal differentiation. As differentiation commences, SALL4 is down-regulated thus relieving the repression of differentiation genes in stem cells and facilitating the differentiation programme. Thus, many of the defects seen in Sall4-null ESCs, including precocious differentiation, are mimicked by preventing SALL4 binding to AT-rich motifs.

We determined the DNA sequence binding preferences of ZFC4 used a SELEX protocol. This technique involves identification of enriched DNA motifs after repeated cycles of protein-DNA binding and PCR amplification. This led us to introduce improvements to the current protocol that were recently published as a methods paper.

We confirmed the importance of an N-terminal domain of SALL4 which recruits the NuRD corepressor complex. Mutation of this short region so that it no longer interacts with the NuRD complex results in a severe embryonic lethal phenotype in mice, resembling complete absence of SALL4. We speculated that NuRD might be substituted by a different corepressor complex that also depends upon histone deacetylases for its activity. However, replacement of the NuRD binding domain by a SIN3A recruitment domain did not rescue this defect. We conclude that only the NuRD corepressor will do, for reasons that are unclear at the molecular level.

We defined precisely a domain that leads to multimerization of SALL4 with itself and with other SALL protein family members. We now wish to determine whether multimerization is essential or dispensable for SALL4 function by further molecular genetic experiments. We are in any case following up the functional relevance of SALL1, which co-exits with SALL4 in embryonic stem cells.

A major output of this project was the publication by Pantier et al (Molecular Cell 2021). The enhanced SELEX protocol was recently published in STAR Protocols (Pantier et al, 2022). A manuscript describing the structure of the ZFC4 domain complexed with AT-rich DNA is under revision at the journal Life Science Alliance. The analysis of multimerization by SALL4 is in progress and publication is expected within the next 1-2 years.
A/T and G/C base pairs form large and relatively homogenous AT-rich or GC-rich regions that usually encompass several genes together with their intergenic sequences. Base compositional domains are often evolutionarily conserved and coincide with other genomic features including early/late-replicating regions, lamina-associated domains and topologically associating domains. Despite these interesting correlations, it was not known whether conserved AT-rich and GC-rich domains are passive by-products of evolution or whether DNA base composition can play an active biological role. Our project provides the first clear evidence that the latter proposal is correct.

To identify novel AT-binding proteins, we utilized a DNA pulldown-mass spectrometry screen in mouse embryonic stem cells (ESCs) which are pluripotent and can be differentiated in culture. Our top hit was SALL4 which is a multi-zinc-finger protein that restrains differentiation of ESCs and participates in several physiological processes, including neuronal development, limb formation and gametogenesis. In humans, failure of SALL4 function is the cause of two severe developmental disorders: the recessive genetic disorder Okihiro syndrome and embryopathies due to treatment during pregnancy with the drug thalidomide. Despite its biomedical importance, the molecular functions of SALL4 were poorly understood. This project demonstrated that many of the defects seen in Sall4-null ESCs, including precocious differentiation, are mimicked by inactivation of its AT-binding domain. A major function of SALL4 is to sense DNA base composition and therefore restrain transcription of genes that promote differentiation.

Our results uncover a novel regulatory mechanism that uses base composition to regulate gene expression programmes. Vertebrate genomes are on average relatively AT-rich (60% A/T) and therefore the short A/T motifs to which it binds occur throughout the genome with frequencies that vary probabilistically according to local base composition. As base composition is a constant feature of the genome, regulation is achieved by varying the availability of the base composition reader itself. Accordingly, as cells enter differentiation, expression of SALL4 drops, suggesting that differentiation is triggered by loss of SALL4-mediated inhibition of key developmental genes. Global regulation of this kind confers the ability to modulate expression of multi-gene blocks using relatively few base composition readers and is potentially more economical than controlling each gene by a separate mechanism. Our findings demonstrate that base compositional domains are not merely a biologically irrelevant by-product of genome evolution but constitute a signal that is impacts gene expression.
Generalised mode-of- action of proteins that interpret DNA sequence features to tune gene expressio