First, we gathered data from public databases to constitute a dataset large enough to represent the entire population. We were able to collect about 12,500 tumor characteristics from about 4000 patients with breast cancer and their related follow up information.
Then, we tested mathematical tools and their tuning to analyze the dataset. Among existing ML algorithms, we used predictive algorithms. They allow to perform supervised analysis, which consists in building a mathematical model with known variables in order to predict an output. Here, variables are the gene expression levels that reflect the components (characteristics) of the tumor, and the output is response to treatment. We built hundreds of prediction models based on the 4000-patient tumor characteristics, by testing several ML algorithms, and several tuning parameters for each.
To test if the models were good performers, we split our dataset in two parts. The first part was used to train algorithms to learn to predict the outcome. At the end of the algorithm training, we tested the prediction performance of our model on the second part of the dataset (for whom we also know response to treatment). In that way, we can compare response prediction given by mathematical models versus the known treatment response, and determine if models are good performers or not.
Then, we extracted candidate biomarkers from the best prediction models. When building a model, not all the variable have the same importance to predict the outcome. For each model, they are ranked according to their importance in the prediction. The best-ranked variables are the ones that can be tested as potential biomarkers. Using the important variable ranking, we selected about 350 characteristics among the 12,500 measured that were identified as important by ML analysis. By analyzing these characteristics, we were able to identify neural development actors as components of the tumor that were linked to bad response to hormonotherapy in breast cancer, and they were never highlighted before as such. We were able to confirm the importance of these components in response to treatment in several independent clinical dataset. This indicated that our models are representative of entire population, and not only of the specific group that we studied.
We have also implemented a deep learning algorithm to create virtual patients in order to increase patient cohorts. We used for that an algorithm called GAN (generative adversarial networks). It was initially used to create images that look like photographs of human faces, even though these faces did not belong to a real person (see examples at
https://thispersondoesnotexist.com/(opens in new window)). We adapted the algorithm to create fake tumor dataset mimicking the 12500 tumor measured characteristics, and were able to create virtual tumor samples that could not be differentiate from real tumor samples.
PredAlgoBC project provides to clinicians a ready-to-use biomarker signature that help classifying patients who should receive hormonotherapy from those who should not. It also led to the identification of neuronal components as potential new targets for breast cancer therapy.
The results are currently under review for publication in a scientific journal.
The algorithm pipelines that allow implementing the same analysis than us with other datasets are publically available.
The GAN algorithm was develop by a master student in the lab who got a grant to start a PhD to deepen the development of GAN use in the cancer field.
The identification of nervous system as important tumor component for response to hormonotherapy led us to start a collaboration to investigate more deeply this implication through bench work analysis.