Periodic Reporting for period 1 - TrustLLM (Democratize Trustworthy and Efficient Large Language Model Technology for Europe)
Reporting period: 2023-11-01 to 2025-04-30
The consortium has unique expertise and practical experience in building LLMs, combined with leading NLP researchers, as well as organizations, working on transferring the technology to companies and end-users.
TrustLLM built a robust infrastructure for Germanic language models, creating multilingual datasets (Danish, Dutch, Faroese, German, Icelandic, Norwegian, Swedish) with strong privacy safeguards. The custom “TrustLLM Trove” framework enhanced open-source tools with quality filters and language ID for Germanic languages. Efficient text filtering and a TDM-compliant HTML pipeline ensured legal, high-quality data extraction.
Model Development & Training
A 7.8B parameter multilingual model was trained on 2.3T tokens across 17 Germanic variants, outperforming models like Llama-2 7B. LoRA-based adapters enabled efficient tuning for under-resourced languages like Icelandic and Faroese.
Evaluation & Applications
The EuroEval platform standardized evaluation across 7 tasks and 8 languages. Bias datasets for Danish and Dutch were introduced. Real-world tools include BijsluiterBot (medical info), BookBot (youth reading), Svarkur (Icelandic Q&A), and apps for automotive and accessibility.
Technical Innovation & Trust
Innovations include RAG-Ex for explainability, retrieval-enhanced transformer insights, tool-augmented reasoning, and dynamic tokenization for morphologically rich languages. These advances support transparent, reliable AI aligned with European values.
 
           
        