Deep representational learning of the evolutionary DNA code in the vertebrate pallium

Projektinformationen

EvoDCode

ID Finanzhilfevereinbarung: 101202429

DOI

10.3030/101202429

EK-Unterschriftsdatum 27 Februar 2025

Startdatum 1 April 2025

Enddatum 31 März 2027

Finanziert unter

Marie Skłodowska-Curie Actions (MSCA)

Gesamtkosten

Keine Daten

EU-Beitrag

€ 200 400,00

Koordiniert durch

VIB VZW
Belgium

Periodic Reporting for period 1 - EvoDCode (Deep representational learning of the evolutionary DNA code in the vertebrate pallium)

Berichtszeitraum: 2025-04-01 bis 2027-03-31

Vertebrate genomes generally consist of billions of nucleotides and encode tens of thousands of genes. Yet, while these genes could theoretically be expressed in almost any combination, evolution has resulted in a large, but finite number of stable configurations with distinct downstream functions. These configurations form the basis of cell types, groups of cells that share a core identity. However, our understanding of how an organisms’ cell types are encoded in the genome is still lacking. While genes themselves can be homologous between species to a certain extent, the regulatory DNA is much less conserved. The expression of genes is regulated by enhancer elements that bind transcription factors, which can be located far away from the target gene. To investigate the regulatory logic that encodes cell types in the genome, a systematic approach must be taken towards sequence phylogeny, where we identify which sequences are active in each of the different types of cells across a wide variety of species, before asking what the defining features of these sequences are.
The rapidly advancing field of artificial intelligence holds great promise for comparative genomics and Convolutional Neural Networks (CNNs) and DNA language models have already been successfully used to model gene regulatory logic in an interpretable way by revealing the transcription factor binding sites within cis-regulatory elements and their co-regulatory relationships. These models require large amounts of training data, and the introduction of the single-cell Assay for Transposase Accessible Chromatin sequencing (scATAC-seq) has enabled us to collect the large amounts of data stratified by cell type, which are required to train these artificial intelligence models.
The overarching goal of this proposal is to better understand how the genome sequence underlies cell identity in the pallium across vertebrate species. I hypothesize that gene regulatory logic can be directly learned from the genomic sequence and used to model and predict cell types. I specifically focus on the pallium, a part of the brain that is strongly tied to species-specific behaviour and underwent strong divergent evolution. In humans the dorsal pallium is expanded into the cerebral cortex, providing most of our expanded cortical abilities, while the avian dorsal pallium consists of only a single cortical layer (Wulst). Our understanding of how these large differences came to be is still limited. Using modern single-cell epigenomic methods we can study how evolutionary changes impact gene regulation by sampling across a wide set of vertebrate species and using this data to model cell type evolution.

Since this grant did not run for the duration of the proposed timeline, not all aims and milestones put forward in the proposal have been achieved. The work will be finished under a different source of funding. For WP1, the generation of a cross-species dataset, we have met the aims described in our applications. Here we collected brain samples from 48 different vertebrate animals (mammals, birds, lizards and ray-finned fish), nearly double the 25 species that were originally planned. From these 48 species we have generated 43 single-cell multiome datasets and 25 HyDrop-ATAC datasets (task 1.1). For 35 of the species the data is processed and clustered (task 1.3). We have also generated 10 Nova-ST datasets (task 1.2).
WP2 is currently ongoing and some of the tasks have been completed. Notably task 2.1 and 2.2 are partially completed in that 27 species-specific sequence-to-function models have been trained (task 2.1) and co-embeddings have been generated of the different species’ scRNA-seq libraries using SATURN (task 2.2). However, the proposed cross-species model architecture (task 2.1/2.2) is not finished yet. The work in tasks 2.3/2.4 has not done yet. Similarly, WP3 was planned later in the timeline and has not been conducted yet. Given the timeline as proposed in the original grant proposal we are on track in regard to deliverables.

Given the limited run time of this grant we have not yet obtained all the results that we expected from this project. The project will be funded another two years under an EMBO grant, which will make it possible to analyze this large collection of datasets in depth. In addition to the generated single-cell datasets, the main results of the project this far have been generation of the 27 sequence-to-function models, mostly on non-model organisms. This is an important step since the development of this field has been primarily focused on human, mouse and drosophila. We show that such models are species agnostic and can also be trained well on less deeply annotated reference genomes.

Periodic Reporting for period 1 - EvoDCode (Deep representational learning of the evolutionary DNA code in the vertebrate pallium)

Herunterladen Den Inhalt der Seite herunterladen