Skip to main content

Artificial intelligence, branching processes and coalescent –
Searching the Information from a genetic Cornucopia

Final Report Summary - ARSINFORMATICA (Artificial intelligence, branching processes and coalescent –<br/>Searching the Information from a genetic Cornucopia)

A summary description of the project objectives
The main objective of the multidisciplinary ArSInformatiCa project was to reinforce the international dimension of a research career of European computer scientist by training him in complementary skills in a world-class research centre, the George R. Brown School of Engineering at William M. Rice University in Houston, USA. Knowledge, gained by the researcher in the outgoing host, transferred to European return host, Silesian University of Technology in Gliwice, Poland, was also important objective of the project. These main goals defined three research objectives: (1) – modelling human evolution using stochastic methods for understanding genetic flow between archaic human populations responsible for the observed variation patterns, (2) – developing advanced ML-based techniques and computer simulation models for understanding the role of natural selection at molecular level in the development of cancer, and (3) – studying common disease – rare variant (CDRV) paradigm. In addition, the project was focused in collaboration between Rice University in Houston, in particular Statistics Department, established in George R. Brown School of Engineering, with European Union research institutions, in particular Silesian University of Technology and other institutions located in Poland, but also in other parts of Europe.

A description of the work performed in the project and main results achieved
The whole work within ArSInformatiCa project has been divided into three research work packages. Additionally, it included a networking training work package. As envisaged in the work flow chart, all work packages have been started during the outgoing phase of ArSInformatiCa project. This has given opportunity of discussing the scope of the whole research with scientific staff of Rice University. In accordance with the distribution of the research over time, all work packages have been completed during the return phase, implemented at Silesian University of Technology. First work package, was focused on modelling human evolution, the second concerned machine learning and coalescent methods in evolutionary understanding the genetics of cancer, and third addressed two models of parallel implementations of the whole genome simulator. In addition, the project included the networking training work package, performed by the fellow outside premises of hosts. The main results achieved in ArSInformatiCa project include the papers in scientific journal and presentations on international scientific conferences in Europe, USA, and Africa.
Two international conference papers have been published, as a result of research in work package WP1 (they were presented at conferences in Paris, France, and Cape Town, South Africa). Training in advanced methods of population genetics and bio-statistics as well as studying various models of human population evolution were the basis for development of the advanced versions of scientific software for calculating time to coalescence for branching processes. This software has been used for forward-in-time simulation of evolution of human population modelled by slightly-critical branching process. In addition, the Bayesian model of the genetic drift in branching processes has been formulated. It was applied as explanation of eliminating the hypothetical admixture of Neandertal mtDNA from Upper Palaeolithic anatomically modern humans gene pool. This model has been used in conjunction with recent data on Neandertal Genome Project, supplementing the model with prior distributions, to estimate the most probable amount of Neandertal mtDNA admixture in a mtDNA gene pool of H. sapiens population at that time. Also, the Multi-Null-Hypotheses (MNH) method has been further developed and its predictions were verified against Wall’s neutrality tests.
Three presentations (one given at international conference in Las Vegas, USA, one in Ustron, Poland, and one presented at Scientific Seminar in Wisla, Poland) as well as one application for European patent, were the results of the work performed in work package WP2. The first, refers to verification of MNH method as an expert knowledge generator in a search for natural selection in genes implicated in human familial cancers. The results achieved, confirmed suitability of MNH method applied in that role. While it is computationally demanding (due to necessity to perform intensive computer simulations in order to obtain critical values of neutrality tests applied against modified null hypotheses), and therefore not applicable for wide search in many genes, it generates accurate knowledge, subject to use in machine learning methods for generalizing that knowledge and use it in fast machine-learning based inferring. This latter topic has been presented at international conference in Poland with proceedings published by Springer in a series “Advances in Intelligent Systems and Computing”. Finally, machine learning methods (descriptive and predictive) applied to search for recessive cancer genes (RCG) using data from the Cancer Genome Project, have been reported during Scientific Seminar in Wisla, Poland. Last but not least, as the result of work package WP2, the fellow together with the return host, has sent an application for European patent to protect his invention of the device for natural selection testing with increased accuracy of detection.
As a result of work package WP3, two papers were published in scientific journal. These papers describe how parallelization has been introduced to originally sequential whole genome coalescent-based simulator called GENOME. Two methods were considered: multithreading (distributed processing with shared memory – tightly coupled model) and message passing interface (MPI) method (distributed processing with local memories – loosely coupled model). The GENOME has been chosen as the basis for parallelization as it was previously used for generating synthetic associations, reported to be responsible for problems with interpretation of genome-wide association studies (GWAS). Synthetic associations have been also studied in work package WP3, and this phenomenon has been confirmed as particularly strong if the gene under consideration has evolved under the pressure of balancing selection.
In addition to scientific results, networking work package WP4, has strengthen scientific collaboration between Rice University in USA and four Polish research institutions: Silesian University of Technology, Centre of Oncology in Gliwice, University of Wroclaw, and Medical University of Lodz. This activity, resulted in two common proposals for grants on multidisciplinary research in evolution of cancer. The networking work package contributed also to collaboration between Silesian University of Technology and two others European institutions: University d’Evry in France, and University of Vienna in Austria.

The expected final results and their potential impact and use (including the socio-economic impact and the wider societal implications of the project )
The research results achieved will have scientific impact in two fields of information sciences: artificial intelligence (in particular machine learning) and distributed computer simulations applied to retrieve meaningful information from the genetic data. This is an important challenge for information sciences, which motivated the goal of the ArSInformatiCa project. However, the progress in machine learning achieved should not be limited to target groups interested in genetic applications. The methods developed, have been verified by application to human/cancer genetics. However, they have a potential to benefit wider and general context of data mining. The example is further development of applicant’s rule-based method known as quasi-dominant rough set approach (QDRSA). While this method has been tested in a search for signatures of natural selection at molecular level in genes involved in human familial cancers, it is expected to become a general machine learning approach having impact on the development of rough sets research. In a wider, socio-economical perspective by promoting a long-term collaboration between Rice University and European research institutions, the ArSInformatiCa project has also contributed to the excellence of the European Research Area. A research inspired by the project results is envisaged to be continued also after its completion. Moreover, the scale of this research is expected to grow by including also other researchers from Silesian University of Technology and Rice University.
The project web page: www.arsinformatica.eu