Skip to main content
Go to the home page of the European Commission (opens in new window)
English English
CORDIS - EU research results
CORDIS

Sustainable Training of Code Language Models through Data Refinement

Project description

A solution to sustainably train code language models

As large language models (LLMs) transform software engineering, their energy consumption becomes a pressing issue. These models, trained on vast datasets from platforms like GitHub, offer invaluable assistance, but at a significant environmental cost. The sheer volume of data required results in substantial CO2 emissions, challenging the sustainability of LLMs. Supported by the Marie Skłodowska-Curie Actions (MSCA) programme, the condenSE project proposes an approach to reducing the data used for training code language models. Specifically, the reduction aims to decrease energy usage without compromising effectiveness. Its innovative approach aligns with EU Green Deal goals and UN Sustainable Development Goals. The project’s solution is a step towards a greener tech future.

Objective

"Large language models (LLMs) have gained widespread attention and user adoption. These models, when trained on source code from platforms like GitHub, acquire a deep understanding of both the semantic and syntactic structures of code (i.e. code language models or CLMs). This understanding has paved the way for significant advancements in software engineering, offering developers valuable assistance in labor-intensive tasks like bug fixing and code writing. While CLMs offer tremendous assistance in software engineering tasks, their massive data requirements result in substantial energy consumption and CO2 emissions.

This proposal challenges the conventional wisdom that ""more data is better"" and instead advocates for a refined approach to data in the training of CLMs. We propose that by intentionally decreasing training data volume while simultaneously enhancing data quality through data refinement techniques, we can reduce energy consumption while maintaining or even improving performance on software engineering tasks. The condenSE project represents a pioneering effort to advance sustainable training practices for CLMs. Unlike existing methods, which are often non-systematic or limited to natural languages, condenSE promises a comprehensive approach to achieve sustainability via data refinement for CLMs.

This initiative is well-aligned with the EU Green Deal initiative and UN Sustainable Development Goals, and the increasing attention for LLMs and CLMs means that now is the right time to address their sustainability. The proposal's potential for success is further strengthened by the host institution's international standing, providing a wide range of collaborative opportunities, as well as by the complementary expertise of the applicant and supervisor, spanning the fields of software engineering, machine learning, dataset creation, and language model application."

Fields of science (EuroSciVoc)

CORDIS classifies projects with EuroSciVoc, a multilingual taxonomy of fields of science, through a semi-automatic process based on NLP techniques. See: The European Science Vocabulary.

You need to log in or register to use this function

Keywords

Project’s keywords as indicated by the project coordinator. Not to be confused with the EuroSciVoc taxonomy (Fields of science)

Programme(s)

Multi-annual funding programmes that define the EU’s priorities for research and innovation.

Topic(s)

Calls for proposals are divided into topics. A topic defines a specific subject or area for which applicants can submit proposals. The description of a topic comprises its specific scope and the expected impact of the funded project.

Funding Scheme

Funding scheme (or “Type of Action”) inside a programme with common features. It specifies: the scope of what is funded; the reimbursement rate; specific evaluation criteria to qualify for funding; and the use of simplified forms of costs like lump sums.

HORIZON-TMA-MSCA-PF-EF - HORIZON TMA MSCA Postdoctoral Fellowships - European Fellowships

See all projects funded under this funding scheme

Call for proposal

Procedure for inviting applicants to submit project proposals, with the aim of receiving EU funding.

(opens in new window) HORIZON-MSCA-2023-PF-01

See all projects funded under this call

Coordinator

SIMULA RESEARCH LABORATORY AS
Net EU contribution

Net EU financial contribution. The sum of money that the participant receives, deducted by the EU contribution to its linked third party. It considers the distribution of the EU financial contribution between direct beneficiaries of the project and other types of participants, like third-party participants.

€ 210 911,04
Address
KRISTIAN AUGUST GATE 23
0164 OSLO
Norway

See on map

Region
Norge Oslo og Viken Oslo
Activity type
Research Organisations
Links
Total cost

The total costs incurred by this organisation to participate in the project, including direct and indirect costs. This amount is a subset of the overall project budget.

No data
My booklet 0 0