Skip to main content
Weiter zur Homepage der Europäischen Kommission (öffnet in neuem Fenster)
Deutsch Deutsch
CORDIS - Forschungsergebnisse der EU
CORDIS

Voice driven interaction in XR spaces

Periodic Reporting for period 1 - VOXReality (Voice driven interaction in XR spaces)

Berichtszeitraum: 2022-10-01 bis 2024-03-31

VOXReality is an ambitious project whose goal is to facilitate and exploit the convergence of natural language processing (NLP) and computer vision (CV). Both technologies are experiencing a huge performance increase due to the emergence of data-driven methods (ML and AI). On the one hand, CV/ML are driving extended reality (XR) revolution beyond what was possible up to now, and, on the other, speech-based interfaces and text-based content understanding are revolutionising human-machine and human-human interaction. VOXReality employs an economical approach to combine these two. VOXReality pursues the integration of language- and vision-based AI models with either unidirectional or bidirectional exchanges between the two modalities. Vision systems drive both AR and VR, while language understanding adds a natural way for humans to interact with the back-ends of XR systems or create multimodal XR experiences combining vision and sound. The results of the project are: 1) a set of pretrained next-generation XR models combining language and vision AI and enabling richer, more natural immersive experiences to boost XR adoption, and 2) a set of applications using these models to demonstrate innovations in various sectors. The above technologies are validated through three use cases: 1) Personal Assistants, 2) Virtual Conferences, 3) Augmented Reality Theaters.
The VOXReality project has managed to achieve a series of objectives, already defined in the DoA, and in line with the foreseen schedule and on the foreseen degree. First and foremost, it managed to contribute to the improvement of the human-to-machine and human-to-human XR experiences through the definition of the use cases, the generic pilot scenarios and the pilot planning and validation. For this purpose, it developed the first version of the advanced AI models, the first version of the model deployment analysis and the infrastructure that will be used for the deployment of the models. Second, it widened the multilingual translation and adapted it to the different contexts, through the NLP models (Automatic Speech Recognition, Neural Machine Translation, Visual Language, Conversation Agents). Third, it managed to automate the generation of virtual agents using multimodal information through the development of two Conversation Agents, one for the VR conference and one for the Training Assistant. Fourth, VOXReality extended and improved the visual grounding of language models through the development of a series of spatial-aware Vision-language models (Vision (RGB)-language models) that are used in two use cases. Fifth, it managed to provide accessible pretrained XR models optimized for deployment under the defined used cases where in detail (i) identified and begun implementing the necessary adaptation of the “once-for-all” training technique for model optimization and the efficient deployment in various platform, (ii) developed of a two-stage optimization pipeline for the models, and (iii) the use of four different deployment options. Sixth, it managed to initiate the demonstration of integration paths for the pretrained models through the publication of the Open Calls invitation and the subsequent documents collection.
The VOXReality project orients to the delivery of AI models and XR applications exploiting those models. First and foremost, it develops pretrained AI models on (a) automatic speech recognition (ASR) that provide textual outputs from user speech, (b) neural machine translation (NMT) that provide translation text to given textual inputs in other languages, (c) visual language models (VLM) that provide textual outputs from visual content, and (d) generative language models in the form of conversation agents (CA) to assist with instructional training and indoor navigation. These models are properly trained and openly available under permissive licensing schemes so as to address the needs of interested stakeholders that intend to use them in various settings and domains. Secondly, it provides – through the Open Calls - funds for (a) integrating models into new XR applications, (b) extending models / tools, and (c) performing both activities. Third, the VOXReality consortium has identified four deployment options to be provided to deploy the components developed as a result of the project activities:
1. Source code: Inference codes to be open to allow modifications and deployment of the components with the desired configuration of the users
2. Containerization: Docker containers that enable the deployment of the components as API services with minimum intervention from the users required
3. Kubernetes: Means to deploy the components for the application pipelines that utilize Kubernetes framework
4. ONNX: Models in the ONNX format to ensure easy installation of the models into Unity applications.
Fourth, the VOXReality models are compliant with the AI Act whereas the datasets collected follow the provisions of the GDPR. Project's results can be exploited so as to change the way business and entertainment are conducted. VR Conference can be employed to facilitate communication even if the users do not speak a common language and get assistance by conversation agents in a remote venue. Using the Training Assistant, engineers may continuously update their skills without having to move and get supported by conversation agents that can support them. In AR theatres, foreign visitors may attend local language theatrical plays and experience the play in their language, enjoying visual effects driven by actor’s speech.

Furthermore, it manages to create strategic impact to the increasing need of the EU market to provide viable alternatives and innovative services that will enable European citizens to continue their activities in the post pandemic landscape. The language models and applications demonstrate their perspectives, values and requirements in terms of privacy and security but also as open science outcomes since they are delivered publicly through widely known repositories. Moreover, it supports the European landscape to maintain its presence in the XR applications market through the provision of new interactive services based on the AI VOXReality language models. Alongside, it supports European SMEs through its Open Calls and the open available models to gain competitiveness in the AI and XR applications global market, following the EU guidelines provided through the AI Act. Indirectly, it contributes to the objective of the EU of having 20 million ICT professionals in Europe by 2030, through the employment in the project team of ICT researchers with advanced digital skills and supporting 3 PhD theses candidates. The Scientific impact of VOXReality is addressed through the numerous publications as well as the open delivery of the project results. Additionally, it contributes to the objective of the Green Deal as the AI models and XR applications it delivers support respectively active remote participation to conferences and the effective remote development of skills of engineer professionals. Last, it creates a cultural impact as through the AR theatre creates new immersive multilingual experiences to audiences that was not possible for them beforehand to enjoy them.
Mein Booklet 0 0