Periodic Reporting for period 1 - AMX DATA (Identification of risk factors for the development of colorectal cancer and / or premalignant lesions using Big Data tools and Artifical Intelligence for early cancer detection)
Berichtszeitraum: 2020-10-01 bis 2021-12-31
The principal objective of AMX Data is identify the relationship of a number of indicators present in the medical history of patients with high risk of CRC and/or premalignant lesions development.
As a secondary objective, AMX Data project will serve to:
Establish a predictive model that improves an individual's risk identification for CRC.
Establish a predictive model that improves the identification of an individual's risk of premalignant (and specifically advanced adenomas) lesions.
AMADIX to anticipate the disease and treat intime that group of healthy individuals who currently do not have access to screening programs and have higher potential to develop CRC to not develop the disease.
The main results achieved until now are mostly related to:
- Data structuration and loading: The data already held by AMADIX was obtained from different sources and documents with different structures, so the work of structuring the data requires special attention and is crucial for an accurate analysis. The data has been cleaned and structured in a number of steps. For example, data consolidation, deletion of duplicate data sets, handling of missing data, creation of a primary key for patient identification, and database design and creation were all performed by the IA.
- New CRC risk factors were identified. The IA was involved in testing various machine learning algorithms, creating new features by feature engineering and the analyzing molecular features and their combination by feature engineering, then testing them with machine learning algorithms. In case of the EHRs the creation of new features, decompose the data into training and testing datasets and test machine learning procedures using clinical features present in EHRs were the main tasks.
- Data analysis to stablish correlation between the different risk factors and the appearance of the illness, including statistical verification of the sample data (CRC, AA, control and random), correction of possible sample bias, principal component analysis (PCA) to evaluate the different variables, graph analysis, itinerary analysis, statistical analysis of the variables. The risk factors were examined in several univariate and multivariate ways. Correlation was looked at between the features as well as the variation inflation factor to examine multicollinearity between the features. Healthy and patients affected by CRC and AA were compared along the features using hypothesis tests. Exploratory data analysis was used in order to look for patterns and summarizing the dataset's main characteristics beyond what they learn from modeling and hypothesis testing.
- Clinical analysis of the variables. The validity of all results has been tested with comparative literature using recent research on the subject.
These risk factors identified need to be validated in an independent cohort, in an algorithm.
Moreover, AMX Data will have an impact in decreasing CRC mortality as a consequence of identifying and treating patients at earlier stages of the disease.