Deciphering the Language of DNA to Identify Regulatory Elements and Classify Transcripts Into Functional Classes

What is the problem/issue being addressed?
We have developed new deep learning language models for genomic sequences based on transformer architecture. These models can be used to learn the structure of DNA, identify the functional elements and predict their function.
Moreover, we have collected and curated benchmark datasets (publicly available in genomic_benchmarks GitHub repository) that can be used to judge any new methods developed in the future. We uploaded our neural network models into the public repository Hugging Face Hub.
Finally, during our outreach activities, we have taught newcomers to the machine learning field in general and deep learning in particular to finetune pre-trained neural networks to their downstream tasks.

Why is it important for society?
Understanding DNA functional elements and their function contribute to understanding the molecular mechanisms and can provide insight into the molecular mechanisms underlying disease.
More generally, transformer embeddings provide a way to understand DNA on a deeper level. They map categorical variables (sequence of A, C, G, and T) into a continuous space. This can be useful for a variety of tasks, including prediction, clustering, and similarity measurement.

Objectives
The goal is to provide easy to use neural network models that even people that are not machine learning experts can use to train for a specific task, e.g. classification of genomic sequences or identification of elements in the genome.

This project enables me to return from the industry to the research as planned. I also partially covered the cost of the computation (neural network training). Thanks to the collaboration with my supervisor, I was given a chance to develop my soft skills and reach maturity as an independent researcher. As a proof of that I have received my first grant as a principal investigator (PI) that will start in January 2023 and acquired two PhD students (starting in Fall 2022). The fellowship opened the door to CEITEC for me in particular and Brno’s Biology/Genetics community in general (I had never had an academic job in his city before).

In the grant proposal, we originally suggested using traditional LSTM and CNN neural networks and we have started with these models in 2020. When we put together our benchmark datasets, we soon realized that Transformer models outperformed our first attempts.

The field was soon revolutionized by large models, both for DNA (DNABERT) and proteins (Facebook’s ESM, Rostlab’s ProtBert, Deepmind’s AlphaFold). We have built on those results and trained the Transformer model based on DeBerta architecture not only for the human genome but also mouse, zebrafish, drosophila, c’eleghans.

In collaboration with Joanna Sulkowska (Warsaw University), we have started to fine-tune protein models to the problem of identification of knots. The main results are listed in the part 3)

We have exceeded our original dissemination goals (e.g. organising not one, but three workshops).

Three publications have been supported by the grant: one published in Frontiers in Genetics, the second currently in a review process in BMC Genomic Data and the last one currently as a preprint in bioRxiv. All three of them are publicly available.
1) Klimentova, E., Polacek, J., Simecek, P., & Alexiou, P. (2020). PENGUINN: Precise exploration of nuclear G-quadruplexes using interpretable neural networks. Frontiers in Genetics, 11, 568546.

2) Gresova, K., Martinek, V., Cechak, D., Simecek, P., & Alexiou, P. (2022). Genomic Benchmarks: A Collection of Datasets for Genomic Sequence Classification. bioRxiv.

3) Martinek, V., Cechak, D., Gresova, K., Alexiou, P., & Simecek, P. (2022). Fine-Tuning Transformers For Genomic Tasks. bioRxiv.

During the project we have collected genomic benchmark datasets that can be used to measure quality of the current and future DNA classification methods (https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks)

We have trained language models over the genomes of Homo sapiens, Mus musculus, Danio rerio, Drosophila megalomaster, C’ elegans and Arabidopsis thaliana (https://github.com/ML-Bioinfo-CEITEC/DNA-pretraining) and show those models to be SOTA on the genomic benchmarks mentioned above.

We have also experimented with several approaches to predict the presence of knot in protein structure (https://github.com/ML-Bioinfo-CEITEC/pknots_experiments)

All of those results can help to provide insight into genes / proteins function and potentially contribute for novel drug targets. For sure, they can be used by bioinformatics for genomic / protein sequence classification.

Periodic Reporting for period 1 - LanguageOfDNA (Deciphering the Language of DNA to Identify Regulatory Elements and Classify Transcripts Into Functional Classes)

Share this page

Download