Periodic Reporting for period 4 - ICON-BIO (Integrated Connectedness for a New Representation of Biology)
Reporting period: 2022-01-01 to 2023-06-30
Details: Developing new models for data fusion for new conceptual paradigms in biology (Work Package 1, WP1):
We introduce algorithms that represent biological macromolecules as vectors in d-dimensional space by decomposing the molecular network matrices with Nonnegative Matrix Tri-Factorization (NMTF), an ML technique. We identify new cancer-related genes validate 80% of our novel cancer-related gene predictions in the literature and by patient survival curves, demonstrating that 93.3% of them have a potential clinical relevance as biomarkers of cancer. We published this in one of the top journals in our field, Bioinformatics, with an impact factor of 6.937:
i. A. Xenos, N. Malod-Dognin, S. Milinković, N. Pržulj, Linear functional organization of the omic embedding space, Bioinformatics 37 (21), 3839-3847, 2021
2. New methods for Non-Negative Matrix Tri-factorization (NMTF) and related problems (WP2):
To uncover molecular mechanisms and drug indications for specific cancer types, we develop an integrative framework able to harness a wide range of diverse molecular and pan-cancer data. It captures the underlying biology predictive of drug response. To integrate the data, we use three types of matrix factorizations: non-negative matrix factorization (NMF), NMTF, and symmetric NMTF (SNMTF). We published this study as:
ii. T. Gaudelet, N Malod-Dognin, and N. Pržulj, Integrative Data Analytic Framework to Enhance Cancer Precision Medicine, Network and Systems Medicine, 4 (1), 60-73, 2021
In addition, we develop four methods to solve the SNMTF. They are based on four theoretical approaches known from the literature: the fixed point method (FPM), the block-coordinate descent with projected gradient (BCD), the gradient method with exact line search (GM-ELS) and the adaptive moment estimation method (ADAM). We offer a software implementation for each of these methods: for the former two, we use Matlab, and for the latter Python with the TensorFlow library. We published it as:
iii. R Hribar, T Hrga, G Papa, G Petelin, J Povh, N Pržulj, V Vukašinović, Four algorithms to solve symmetric multi-type non-negative matrix tri-factorization problem, Journal of Global Optimization 82 (2), 283-312, 2021
Furthermore, we presented another novel method, called Multi-project and Multi-profile joint Non-negative Matrix Factorization, capable of integrating data from different sources, such as experimental and observational multi-omic data. We identified groups of patients and cell lines similar to each other. We predicted the drug profiles for patients and identified genetic signatures for resistant and sensitive tumors to a specific drug. We published the results as:
iv. D. A. Salazar, N. Pržulj, and C. F. Valencia, Multi-project and Multi-profile joint Non-negative Matrix Factorization for cancer omic datasets, Bioinformatics, 2021, 1 37 (24), 4801-4809
3. Data science, combinatorial and algebraic topology algorithms (WP3):
The PI’s group published three new graphlet-based methods enabling the modelling and mining of omics biological networks in the top journals in the field, two in Bioinformatics and one in PLoS ONE, with an impact factor of 3.24.
The first method enables network modelling and graphlet-based mining of the network data with weights on edges that can represent the probability of an interaction occurring in the cell. We show that probabilistic graphlet-based methods more robustly capture biological information in these data while simultaneously showing a higher sensitivity to identifying condition-specific functions than their unweighted graphlet-based method counterparts. We published these results as:
v. S. Doria-Belenguer, M. K. Youssef, R. Böttcher, N. Malod-Dognin and N Pržulj, Probabilistic Graphlets Capture Biological Function in Probabilistic Molecular Networks, Bioinformatics; 2020; Proceedings of ECCB 2020, September 2020
Furthermore, we introduce a new graphlet-based definition of eigencentrality of genes in a pathway, graphlet eigencentrality. We compute the centrality of genes in a pathway either from the local perspective of the pathway or from the global perspective of the entire network. Our results suggest that by considering different graphlet eigencentralities, we can capture different functional roles of genes in and between pathways. We published this result as:
vi. S.F.L. Windels, N. Malod-Dognin, and N. Pržulj, Graphlet eigencentralities capture novel central roles of genes in pathways PLoS ONE, https://doi.org/10.1371/journal.pone.0261676 Accepted: December 7, 2021
In addition, we present a hypergraph-based approach for modeling biological systems and formulate vertex classification, edge classification and link prediction problems on (hyper)graphs as instances of vertex classification on (extended, dual) hypergraphs. We introduce a novel kernel method on vertex- and edge-labeled (colored) hypergraphs for analysis and learning. We show its potential use to estimate the interactome sizes in various species. We published this as:
vii. J. Lugo-Martinez, D. Zeiberg, T. Gaudelet, N Malod-Dognin, N. Pržulj, and P. Radivojac, Classification in biological networks with hypergraphlet kernels, Bioinformatics, btaa768, https://doi.org/10.1093/bioinformatics/btaa768 April 2021
We plan to use the results described in WP2 and WP3 to improve further the paradigms and methods described in WP1 above.
4. Applications (WP4):
The structure of DNA packing (chromatin) impacts gene expression. The alterations in chromatin structure (CS) have been shown to coincide with the occurrence of cancer. We propose a comparative pipeline to analyze CSs and apply it to study chronic lymphocytic leukemia (CLL). We show that CSs are a rich source of new biological information about DNA that can complement other data types. We published this at the top conference in our field, Intelligent Systems for Molecular Biology (ISMB), with the acceptance rate of around 15%, and with the proceedings in a special issue of the journal Bioinformatics:
viii. N. Malod-Dognin, V. Pancaldi, A. Valencia and N. Pržulj, Chromatin network markers of leukemia, Bioinformatics 36 (Supplement_1), i455-i463, July 2020
The COVID-19 pandemic has been raging. To address this challenge, we adapt an explainable ML algorithm for data fusion and utilize it on new omics data on viral–host interactions, human protein interactions, and drugs to understand SARS-CoV-2 infection mechanisms better and predict new drug–target interactions for COVID-19. We published these results in Nature’s journal Scientific Reports, with impact factor of 4.379:
ix. Network neighbors of viral targets and differentially expressed genes in COVID-19 are drug target candidates C. Zambrana, A. Xenos, R. Böttcher, N. Malod-Dognin, and N. Pržulj, Scientific Reports 11, 18985, 2021
ICON-BIO project is continuing to have important outcomes. In short, we devised new algorithms for mining and fusing heterogeneous omics data from publicly available databases and from biomedical collaborators and applied them to several tasks of precision medicine. During the reporting period, we published 8 refereed journal papers (and submitted two more to preprint archives and journals), while since the beginning of the project, we published 22 peer-reviewed journal publications in the top ranked scientific journals, addressing the project’s work packages (WPs). The figure from a representative paper is presented below.
The main published results during the reporting period include the following:
1. We introduce a new machine learning (ML) methodology to explore the functional organization of different tissue-specific and species-specific embedding spaces generated by a Non-negative Matrix Tri-Factorization (NMTF). We use it to compare the most prevalent cancers in human to their corresponding control tissues and find that cancer alters certain biological functions. We exploit this to predict new cancer-related functions and genes that the currently available methods cannot identify.
2. We applied our iCell ML integration algorithm to bulk and single-cell RNA-seq data and identified biological processes perturbed during senescence (biological aging, the gradual deterioration of functional characteristics in living organisms) and predicted 90 new genes involved in its escape. Our study unravels novel genes and pathways associated with senescence escape after targeted therapy in NRAS mutant melanoma.
3. We develop 4 methods to solve symmetric multi-type non-negative matrix tri-factorization problem (SNMTF), of special importance in data science since it serves as a mathematical model for the fusion of different data sources and data co-clustering.
4. We introduced a new graphlet-based definition of the eigencentrality of genes in a pathway, graphlet eigencentrality, to identify pathways and cancer mechanisms described by a given graphlet adjacency. We showed that different graphlet eigencentralities describe cancer driver genes that play central roles in pathways or the crosstalk between them.
5. We did a comparative network inferential analysis of the patterns of variables and factors associated with Zika virus infections in Brazil during 2015–2016, coinciding with a microcephaly epidemic, and identified multiple contributing determinants. This advances our understanding of the cumulative interactive effects of exposures to chemical and non-chemical stressors in the built, natural, physical, and social environments on adverse pregnancy and health outcomes in vulnerable populations.
6. To understand the molecular basis of Covid-19 disease and design therapeutic strategies, we built upon the recently proposed concept of an integrated cell, iCell, fusing three omics, tissue-specific human molecular interaction networks. We applied this methodology to uncover new infection-related genes and strategies for repurposing known drugs to treat this disease.
7. Antithrombin resistance is a rare subtype of hereditary thrombophilia caused by prothrombin gene variants, leading to thrombotic disorders. We proposed an integrative framework to address the lack of genomic samples and support the genomic signal from the full genome sequences of five subjects by integrating it with subjects’ phenotypes and the genes’ molecular interactions. We revealed new gene clusters involved with this rare disease.
Additional results since the beginning of the project include:
8. We introduce new algorithms for network embeddings and demonstrate that genes that are embedded close in these spaces have similar biological functions, so we can extract new biomedical knowledge directly by doing linear operations on their embedding vector representations, demonstrating potential clinical relevance as biomarkers of cancer.
9. We developed an integrative framework able to harness a wide range of diverse molecular and pan-cancer data, using three types of matrix factorizations: non-negative matrix factorization (NMF), non-negative matrix tri-factorization (NMTF), and symmetric NMTF (SNMTF).
10. We presented another novel method, called Multi-project and Multi-profile joint Non-negative Matrix Factorization, capable of integrating data from different sources, predicting the drug profiles for patients, and identifying genetic signatures for resistant and sensitive tumors to a specific drug.
11. We introduce probabilistic graphlets as a tool for analyzing the local wiring patterns of probabilistic networks, showing a higher sensitivity to identify condition-specific functions compared to the unweighted graphlet-based method counterparts.
12. We introduce a new graphlet-based definition of eigencentrality of genes in a pathway, graphlet eigencentrality, to identify pathways and cancer mechanisms.
13. We present a hypergraph-based approach for modeling biological systems and formulate vertex classification, edge classification and link prediction problems on (hyper)graphs as instances of vertex classification on (extended, dual) hypergraphs.
14. We model the chromatin of the affected and control chronic lymphocytic leukemia (CLL) cells as networks and analyze the network topology by our state-of-the-art methods. We show the existence of structural markers of cancer related DNA elements in the chromatin.
15. We adapt an explainable artificial intelligence algorithm for data fusion and utilize it on new omics data on viral–host interactions, human protein interactions, and drugs to better understand SARS-CoV-2 infection mechanisms and predict new drug–target interactions for COVID-19.
16. We proposed new neural networks with structures inspired by the multi-scale organization of a cell. We showed that these models are able to correctly predict the diagnosis for the majority of the patients by analyzig their differential gene expression data.
17. We generalize spectral embedding, spectral clustering and network diffusion. Applying Graphlet Laplacian based spectral embedding, we demonstrate that Graphlet Laplacians capture biological functions.
18. We propose a novel, data-driven concept of an integrated cell, iCell. We introduce a computational prototype of an iCell, which integrates three omics, tissue-specific molecular interaction network types.
19. To model the multi-scale organization of complex biological systems, we utilize simplicial complexes from computational geometry.
20. We propose a new, multi-scale, protein interaction hyper-network model that utilizes hypergraphs to capture different scales of protein organization.
1. We provided several abstractions for fusing heterogeneous types of omics data about a cell and implemented the first prototype of an integrated cell, iCell. We are currently furthering this work towards better biological (omics) data models, data analytics algorithms and their applications.
2. We designed new machine learning methods for the integration/fusion of the multi-scale omics data. We are currently working on furthering these methods for improved biological quality and computational efficiency. Also, we are working on the software package that will encompass all of our currently available methods in this realm and that will be made open source.
3. We constructed new data science, combinatorial and algebraic topology algorithms for modelling the multi-scale organization of the cellular omics data. They were based on modelling the data by graphs, hypergraphs and abstract simplicial complexes. In addition, we constructed a new graphlet-based method enabling modelling and mining of omics biological networks with weights on edges and furthered other graphlet-based methods. We plan to use the results described in points 2 and 3 here to further improve the abstractions described in point 1 above.
4. We applied our new abstractions and methods to various forms of cancer and on Covid-19. We plan on applying them further to rare genetic disease data, additional Covid-19 and cancer data and other related precision medicine applications.
Dissemination of the Outputs:
These research outputs were disseminated at numerous scientific events and institutions. In particular, we presented the results by giving many invited and contributed talks since the beginning of the project on April 1st, 2018. The PI gave 52 invited/keynote/plenary talks and the lab members gave a number contributed talks at the top conferences/institutions in our field. The details are at https://life.bsc.es/iconbi/docs/NP_CV.pdf and also web pages of the lab members.