Project description
Finding the hidden proteins in mass spectrometry
Proteomics, the study of proteins using mass spectrometry, offers valuable insights into cell functions. However, many protein forms remain undetected due to their complexity, limiting our understanding of diseases and potential treatments. Existing machine learning models for proteomics often fail to fully analyse mass spectra and are not easily interpretable. This lack of transparency hinders their application in clinical settings. With this in mind, the ERC-funded explainProt project aims to solve this by developing clear, end-to-end machine learning models to analyse complex proteomic data. By combining new sequencing techniques and nanopore devices, the project will uncover hidden proteins and structural variants, advancing research in disease detection and microbial discovery.
Objective
Mass spectrometry driven proteomics allows deep insights into the working of cells. Still, the vast majority of proteoforms, representing the full heterogeneity of molecular forms of protein products in a sample, currently remain undetected in proteomics experiments. This lack of information strongly restricts our knowledge of disease progression, possible biomarkers, and therapeutic targets across a large number of diseases. Several machine learning approaches have been developed for proteomics data, but not being trained end-to-end, they cannot capture the full wealth of proteomic mass spectra and commonly remain unexplained black boxes. Within explAInProt, my team and I will develop representations of spectra that allow deploying explainable, end-to-end machine learning models on the wealth of proteomic data available, regarding both bottom-up and topdown spectra to identify novel protein variants. Explanations will allow identifying the origin of predictions and allow reducing bias and building up the trustworthiness of AI systems required for clinical applications. To verify results, we will pioneer orthogonal real-time strategies based on selective sequencing approaches and calling of amino acids that we will introduce for nanopore sequencing devices as a complementary acquisition method. All combined, this will allow to drastically increase our knowledge about the current dark matter of mass spectrometry driven proteomics: those proteins and peptides that are non-canonically modified, non-tryptic, have potentially multiple amino acid substation, or no close match in databases or result from structural variants such as fusion proteins that they remain undetected in current analyses. We will highlight applicability in two areas of particular concern in current approaches: the detection of structural variants in proteomic mass spectra and the characterization of novel microbial organisms without sufficient database information.
Fields of science (EuroSciVoc)
CORDIS classifies projects with EuroSciVoc, a multilingual taxonomy of fields of science, through a semi-automatic process based on NLP techniques. See: https://op.europa.eu/en/web/eu-vocabularies/euroscivoc.
CORDIS classifies projects with EuroSciVoc, a multilingual taxonomy of fields of science, through a semi-automatic process based on NLP techniques. See: https://op.europa.eu/en/web/eu-vocabularies/euroscivoc.
- natural sciencesbiological sciencesbiochemistrybiomoleculesproteinsproteomics
- natural sciencescomputer and information sciencesdatabases
You need to log in or register to use this function
We are sorry... an unexpected error occurred during execution.
You need to be authenticated. Your session might have expired.
Thank you for your feedback. You will soon receive an email to confirm the submission. If you have selected to be notified about the reporting status, you will also be contacted when the reporting status will change.
Keywords
Programme(s)
- HORIZON.1.1 - European Research Council (ERC) Main Programme
Funding Scheme
HORIZON-ERC - HORIZON ERC GrantsHost institution
14482 Potsdam
Germany