Skip to main content
Go to the home page of the European Commission (opens in new window)
English en
CORDIS - EU research results
CORDIS

Learning Isoform Fingerprints to Discover the Molecular Diversity of Life

Periodic Reporting for period 1 - ORIGIN (Learning Isoform Fingerprints to Discover the Molecular Diversity of Life)

Reporting period: 2023-06-01 to 2025-11-30

The ORIGIN project—Learning Isoform Fingerprints to Discover the Molecular Diversity of Life—is motivated by a fundamental challenge in biology and biomedical research: the accurate identification and quantification of protein isoforms. These isoforms, arising from alternative splicing, post-translational modifications, and other molecular processes, represent a key layer of biological complexity with significant implications for understanding health, disease, and therapeutic targets. Despite advances in mass spectrometry-based proteomics, current methodologies fall short in capturing this diversity due to a heavy reliance on MS2 spectra and limited utilization of the rich MS1 data.
ORIGIN aims to overcome these limitations by developing deep learning models that can learn isoform-specific peptide intensity patterns—referred to as "isoform fingerprints"—from MS1 signals. The project unites state-of-the-art artificial intelligence (AI), large-scale data curation, novel acquisition strategies, and software engineering into a cohesive framework. The goal is to enable the direct identification and quantification of protein isoforms at scale and with unprecedented precision, transforming both basic biological research and clinical proteomics.
At its core, ORIGIN addresses several pressing needs: (a) enhanced sensitivity and throughput in proteomics workflows, particularly for large-cohort or single-cell studies, (b) lack of robust, generalizable models and benchmarks for peptide property prediction, and (c) the absence of tools that leverage relative peptide abundance patterns for isoform-level analysis.
During the reporting period, the ORIGIN project made susbstantial technical and scientific progress toward enabling isoform-resolved proteomics through the development of predictive models, data infrastructure, and novel mass spectrometry workflows. Central to this effort was the systematic curation and harmonization of large-scale datasets capturing key peptide properties, including intensity patterns, retention times, and charge state distributions. These datasets supported the development of multiple deep learning models, such as pfly for peptide detectability prediction and PepSi-Print for the prediction of relative intensity patterns of peptides originating from the same protein. The latter introduced a novel approach based on pairwise peptide intensity ratios using a Siamese neural network, addressing the lack of clean training data and enabling robust isoform fingerprinting directly from MS1 signals.
In parallel, the project advanced MS1-centric analysis strategies to reduce reliance on MS2 identifications. This included the creation of SWAPS, a framework for propagating peptide identities across experiments, and Jigsaw, a novel acquisition strategy. Jigsaw enhances peptide identification diversity under high-throughput conditions, particularly for short-gradient runs. These computational and experimental advances are supported by open-source software tools such as Koina for online model inference, dlomix for model training, and Oktoberfest for spectral library generation and rescsoring.
Together, these efforts form a cohesive infrastructure that supports reproducible, scalable, and accurate isoform identification. The results achieved so far lay the technical foundation for transforming proteomics workflows and advancing our understanding of the molecular diversity encoded in protein isoforms.
The results of the ORIGIN project have the potential to substantially impact both fundamental research and applied biomedical sciences by enabling high-resolution, isoform-specific proteomic analysis at scale. By shifting the analytical focus from MS2-dependent workflows to MS1-centric strategies enhanced by deep learning, ORIGIN introduces a paradigm potentially capable of improving sensitivity, reproducibility, and throughput in complex proteomic studies, including those involving large cohorts or single cells. The models, datasets, and software developed throughout the project—particularly PepSi-Print, SWAPS, and Koina—form a robust technological foundation for isoform fingerprinting, with demonstrated utility across diverse experimental conditions.
To ensure further uptake and long-term success, continued validation of these tools across biological and clinical settings is essential. Broader adoption may benefit from dedicated demonstration studies in translational applications, especially in disease-specific proteomics where isoform resolution can reveal clinically actionable insights. While the open-source nature of the infrastructure lowers technical barriers, integration into commercial software pipelines and compatibility with evolving mass spectrometry hardware will help accelerate mainstream adoption. Additional needs include sustained access to high-quality data for model refinement, support by international collaborations to test interoperability across laboratories, and frameworks to manage data standardization and reproducibility.
As the project progresses toward completion, the cumulative results reflect a coherent and impactful system of models, datasets, and workflows toward isoform-resolved proteomics. These outcomes may offer a scalable path toward deeper molecular characterization in both research and clinical proteomics, with the potential to redefine how protein diversity is measured and understood.
My booklet 0 0