European Commission logo
English English
CORDIS - EU research results
CORDIS

Integrated Connectedness for a New Representation of Biology

Periodic Reporting for period 3 - ICON-BIO (Integrated Connectedness for a New Representation of Biology)

Reporting period: 2020-07-01 to 2021-12-31

We address the problem of making new algorithms for mining and extracting new biomedical knowledge from systems-level, multi-scale, heterogeneous, molecular (“omics”) data. The objective is to improve biological understanding and contribute to precision medicine. In particular, we develop new machine learning (ML) and network science methods and apply them to the omcs data. We look for new functions of genes, disease related, new biomarkers, better patient stratification and repurposing of known drugs to different patient groups, including cancer and Covid-19. This is of importance to society, as it may lead to improving health and wellbeing of all.
Details:
Developing new models for data fusion for new conceptual paradigms in biology (Work Package 1, WP1):
We introduce algorithms that represent biological macromolecules as vectors in d-dimensional space by decomposing the molecular network matrices with Nonnegative Matrix Tri-Factorization (NMTF), an ML technique. We identify new cancer-related genes, validate 80% of our novel cancer-related gene predictions in the literature and by patient survival curves, demonstrating that 93.3% of them have a potential clinical relevance as biomarkers of cancer. We published this in one of the top journals in our field, Bioinformatics, with the impact factor of 6.937:
i. A. Xenos, N. Malod-Dognin, S. Milinković, N. Pržulj, Linear functional organization of the omic embedding space, Bioinformatics 37 (21), 3839-3847, 2021
2. New methods for Non-Negative Matrix Tri-factorization (NMTF) and related problems (WP2):
To uncover molecular mechanisms and drug indications for specific cancer types, we develop an integrative framework able to harness a wide range of diverse molecular and pan-cancer data. It captures the underlying biology predictive of drug response. To integrate the data, we use three types of matrix factorizations: non-negative matrix factorization (NMF), NMTF, and symmetric NMTF (SNMTF). We published this study as:
ii. T. Gaudelet, N Malod-Dognin, and N. Pržulj, Integrative Data Analytic Framework to Enhance Cancer Precision Medicine, Network and Systems Medicine, 4 (1), 60-73, 2021
In addition, we develop four methods to solve the SNMTF. They are based on four theoretical approaches known from the literature: the fixed point method (FPM), the block-coordinate descent with projected gradient (BCD), the gradient method with exact line search (GM-ELS) and the adaptive moment estimation method (ADAM). For each of these methods we offer a software implementation: for the former two we use Matlab and for the latter Python with the TensorFlow library. We published it as:
iii. R Hribar, T Hrga, G Papa, G Petelin, J Povh, N Pržulj, V Vukašinović, Four algorithms to solve symmetric multi-type non-negative matrix tri-factorization problem, Journal of Global Optimization 82 (2), 283-312, 2021
Furthermore, we presented another novel method, called Multi-project and Multi-profile joint Non-negative Matrix Factorization, capable of integrating data from different sources, such as experimental and observational multi-omic data. We identified groups of patients and cell lines similar to each other. We predicted the drug profiles for patients and identified genetic signatures for resistant and sensitive tumors to a specific drug. We published the results as:
iv. D. A. Salazar, N. Pržulj, and C. F. Valencia, Multi-project and Multi-profile joint Non-negative Matrix Factorization for cancer omic datasets, Bioinformatics, 2021, 1 37 (24), 4801-4809
3. Data science, combinatorial and algebraic topology algorithms (WP3):
The PI’s group published three new graphlet-based methods enabling modelling and mining of omics biological networks in the top journals in the field, two in Bioinformatics, and one in PLoS ONE, with impact factor of 3.24.
The first method enables network modelling and graphlet-based mining of the network data with weights on edges that can represent the probability of an interaction occurring in the cell. We show that probabilistic graphlet-based methods more robustly capture biological information in these data, while simultaneously showi a higher sensitivity to identify condition-specific functions compared to their unweighted graphlet-based method counterparts. We published these results as:
v. S. Doria-Belenguer, M. K. Youssef, R. Böttcher, N. Malod-Dognin and N Pržulj, Probabilistic Graphlets Capture Biological Function in Probabilistic Molecular Networks, Bioinformatics; 2020; Proceedings of ECCB 2020, September 2020
Furthermore, we introduce a new graphlet-based definition of eigencentrality of genes in a pathway, graphlet eigencentrality. We compute the centrality of genes in a pathway either from the local perspective of the pathway, or from the global perspective of the entire network. Our results suggest that by considering different graphlet eigencentralities, we can capture different functional roles of genes in and between pathways. We published this result as:
vi. S.F.L. Windels, N. Malod-Dognin, and N. Pržulj, Graphlet eigencentralities capture novel central roles of genes in pathways PLoS ONE, https://doi.org/10.1371/journal.pone.0261676 Accepted: December 7, 2021

In addition, we present a hypergraph-based approach for modeling biological systems and formulate vertex classification, edge classification and link prediction problems on (hyper)graphs as instances of vertex classification on (extended, dual) hypergraphs. We introduce a novel kernel method on vertex- and edge-labeled (colored) hypergraphs for analysis and learning. We show its potential use to estimate the interactome sizes in various species. We published this as:

vii. J. Lugo-Martinez, D. Zeiberg, T. Gaudelet, N Malod-Dognin, N. Pržulj, and P. Radivojac, Classification in biological networks with hypergraphlet kernels, Bioinformatics, btaa768, https://doi.org/10.1093/bioinformatics/btaa768 April 2021
We plan to use the results described in WP2 and WP3 to further improve the paradigms and methods described in WP1 above.
4. Applications (WP4):
The structure of DNA packing (chromatin) impacts gene expression. The alterations in chromatin structure (CS) have been shown to coincide with the occurrence of cancer. We propose a comparative pipeline to analyze CSs and apply it to study chronic lymphocytic leukemia (CLL). We show that CSs are a rich source of new biological information about DNA that can complement other data types. We published this in the top conference in our field, Intelligent Systems for Molecular Biology (ISMB), with the acceptance rate of around 15%, and with the proceedings in a special issue of journal Bioinformatics:
viii. N. Malod-Dognin, V. Pancaldi, A. Valencia and N. Pržulj, Chromatin network markers of leukemia, Bioinformatics 36 (Supplement_1), i455-i463, July 2020

The COVID-19 pandemic has been raging. To address this challenge, we adapt an explainable ML algorithm for data fusion and utilize it on new omics data on viral–host interactions, human protein interactions, and drugs to better understand SARS-CoV-2 infection mechanisms and predict new drug–target interactions for COVID-19. We published these results in Nature’s journal Scientific Reports, with impact factor of 4.379:

ix. Network neighbors of viral targets and differentially expressed genes in COVID-19 are drug target candidates C. Zambrana, A. Xenos, R. Böttcher, N. Malod-Dognin, and N. Pržulj, Scientific Reports 11, 18985, 2021
The Main Outputs:

ICON-BIO project is continuing to have important outcomes. In short, we devised new algorithms for mining and fusing heterogeneous omics data from publicly available databases and applied them to several tasks of precision medicine. During the reporting period, we published 9 journal papers, while since the beginning of the project, we published 14 peer-reviewed journal publications in the top ranked scientific journals, addressing the project’s work packages (WPs). The figure from a representative paper is presented below.
The main published results during the reporting period include the following (the numbering corresponds to the papers listed above):
i. We introduce new algorithms for network embeddings and demonstrate that genes that are embedded close in these spaces have similar biological functions, so we can extract new biomedical knowledge directly by doing linear operations on their embedding vector representations, demonstrating potential clinical relevance as biomarkers of cancer.
ii. We developed an integrative framework able to harness a wide range of diverse molecular and pan-cancer data, using three types of matrix factorizations: non-negative matrix factorization (NMF), non-negative matrix tri-factorization (NMTF), and symmetric NMTF (SNMTF).
iii. We develop 4 methods to solve symmetric multi-type non-negative matrix tri-factorization problem (SNMTF), of special importance in data science, since it serves as a mathematical model for the fusion of different data sources in data clustering.
iv. We presented another novel method, called Multi-project and Multi-profile joint Non-negative Matrix Factorization, capable of integrating data from different sources, predicting the drug profiles for patients, and identifying genetic signatures for resistant and sensitive tumors to a specific drug.
v. We introduce probabilistic graphlets as a tool for analyzing the local wiring patterns of probabilistic networks, showing a higher sensitivity to identify condition-specific functions compared to the unweighted graphlet-based method counterparts.
vi. We introduce a new graphlet-based definition of eigencentrality of genes in a pathway, graphlet eigencentrality, to identify pathways and cancer mechanisms.
vii. We present a hypergraph-based approach for modeling biological systems and formulate vertex classification, edge classification and link prediction problems on (hyper)graphs as instances of vertex classification on (extended, dual) hypergraphs.
viii. We model the chromatin of the affected and control chronic lymphocytic leukemia (CLL) cells as networks and analyze the network topology by our state-of-the-art methods. We show the existence of structural markers of cancer related DNA elements in the chromatin.
ix. We adapt an explainable artificial intelligence algorithm for data fusion and utilize it on new omics data on viral–host interactions, human protein interactions, and drugs to better understand SARS-CoV-2 infection mechanisms and predict new drug–target interactions for COVID-19.
Additional results since the beginning of the project include:
x. We proposed new neural networks with structures inspired by the multi-scale organization of a cell. We showed that these models are able to correctly predict the diagnosis for the majority of the patients by analyzig their differential gene expression data.
xi. We generalize spectral embedding, spectral clustering and network diffusion. Applying Graphlet Laplacian based spectral embedding, we demonstrate that Graphlet Laplacians capture biological functions.
xii. We propose a novel, data-driven concept of an integrated cell, iCell. We introduce a computational prototype of an iCell, which integrates three omics, tissue-specific molecular interaction network types.
xiii. To model the multi-scale organization of complex biological systems, we utilize simplicial complexes from computational geometry.
xiv. We propose a new, multi-scale, protein interaction hyper-network model that utilizes hypergraphs to capture different scales of protein organization.

Dissemination of the Outputs:
These research outputs were disseminated at numerous scientific and industrial events and institutions. In particular, we presented the results by giving 46 invited and contributed talks since the beginning of the project on April 1st, 2018. The PI gave 42 invited/keynote/plenary talks and the lab members gave 4 contributed talks at the top conferences/institutions in our field. The details are at https://life.bsc.es/iconbi/docs/NP_CV.pdf .
We progressed beyond the state of the art in the following:

1. We provided several abstractions for fusing heterogeneous types of omics data about a cell and implemented the first prototype of an integrated cell, iCell. We are currently furthering this work towards better biological (omics) data models, data analytics algorithms and their applications.

2. We designed new machine learning methods for integration / fusion of the multi-scale omics data. We are currently working on furthering these methods for improved biological quality and computational efficiency. Also, we are working on the software package that will encompass all of our currently available methods in this realm and that will be made open source.

3. We constructed new data science, combinatorial and algebraic topology algorithms for modelling the multi-scale organization of the cellular omics data. They were based on modelling the data by graphs, hypergraphs and abstract simplicial complexes. In addition, we constructed a new graphlet-based method enabling modelling and mining of omics biological networks with weights on edges and furthered other graphlet-based methods. We plan to use the results described in points 2 and 3 here to further improve the abstractions described in point 1 above.

4. We applied our new abstractions and methods to various forms of cancer and on Covid-19. We plan on applying them further to rare genetic disease data, additional Covid-19 and cancer data and other related precision medicine applications.
An illustration of a data fusion framework for an integrated cell, iCell. In Nat. Comms. 10:805,2019