Periodic Reporting for period 1 - ORIGIN (Learning Isoform Fingerprints to Discover the Molecular Diversity of Life)
Reporting period: 2023-06-01 to 2025-11-30
ORIGIN aims to overcome these limitations by developing deep learning models that can learn isoform-specific peptide intensity patterns—referred to as "isoform fingerprints"—from MS1 signals. The project unites state-of-the-art artificial intelligence (AI), large-scale data curation, novel acquisition strategies, and software engineering into a cohesive framework. The goal is to enable the direct identification and quantification of protein isoforms at scale and with unprecedented precision, transforming both basic biological research and clinical proteomics.
At its core, ORIGIN addresses several pressing needs: (a) enhanced sensitivity and throughput in proteomics workflows, particularly for large-cohort or single-cell studies, (b) lack of robust, generalizable models and benchmarks for peptide property prediction, and (c) the absence of tools that leverage relative peptide abundance patterns for isoform-level analysis.
In parallel, the project advanced MS1-centric analysis strategies to reduce reliance on MS2 identifications. This included the creation of SWAPS, a framework for propagating peptide identities across experiments, and Jigsaw, a novel acquisition strategy. Jigsaw enhances peptide identification diversity under high-throughput conditions, particularly for short-gradient runs. These computational and experimental advances are supported by open-source software tools such as Koina for online model inference, dlomix for model training, and Oktoberfest for spectral library generation and rescsoring.
Together, these efforts form a cohesive infrastructure that supports reproducible, scalable, and accurate isoform identification. The results achieved so far lay the technical foundation for transforming proteomics workflows and advancing our understanding of the molecular diversity encoded in protein isoforms.
To ensure further uptake and long-term success, continued validation of these tools across biological and clinical settings is essential. Broader adoption may benefit from dedicated demonstration studies in translational applications, especially in disease-specific proteomics where isoform resolution can reveal clinically actionable insights. While the open-source nature of the infrastructure lowers technical barriers, integration into commercial software pipelines and compatibility with evolving mass spectrometry hardware will help accelerate mainstream adoption. Additional needs include sustained access to high-quality data for model refinement, support by international collaborations to test interoperability across laboratories, and frameworks to manage data standardization and reproducibility.
As the project progresses toward completion, the cumulative results reflect a coherent and impactful system of models, datasets, and workflows toward isoform-resolved proteomics. These outcomes may offer a scalable path toward deeper molecular characterization in both research and clinical proteomics, with the potential to redefine how protein diversity is measured and understood.