VODIS is aimed at a speech-based, hands-free interface for car drivers to operate sophisticated in-car navigation, communication and entertainment facilities. This convenient device, which replaces manual operation with spoken commands, reinforces security by helping the driver to concentrate fully on traffic ahead. The system is intended to enhance a range of other in-car devices such as route guidance, radio, other audio/video entertainment sources, and dialogue units which offer the driver spoken information and verify unclear instructions. The need for good human-computer interaction in often noisy traffic will receive special consideration.
The first year of the VODIS project involved two major tasks, development of functional and system specifications of the Voice Control Unit (VCU) and provision of language resources (speech and natural language databases).
The system is based on a state-of-the-art driver information system, the BERLIN RCM303A from Bosch, which integrates a route guidance (navigation) computer, traffic-messaging), mobile telephone and more conventional equipment such as radio and audio/video entertainment. The implementation of the different functional modules of the VCU has begun, allowing the consortium to demonstrate a partially voice controlled BERLIN inside a vehicle. The preliminary demonstrator enables voice-control of audio components and a GSM-telephone.
User Requirements and Specification
To identify the market potential of voice interfaces, a market-research paper was produced reflecting the present views of both Bosch as a system provider and the automobile manufacturers involved in the project. Additional insights are expected as demonstrators become available and end users are able to evaluate the VCU while driving.
Technical development started with the definition of a system architecture in terms of functional modules. Task analyses have led to the specification of the basic command-and-control vocabularies to be considered within the initial 'prompted-approach'. These served as a backbone for the functional specification of the prompted-approach VCU and the design of the Human-Machine Interface (HMI) including a dialogue management module.
The functional specification represents a compromise between feasibility, affordability and ergonomics that will lead to an integrated demonstrator within a short period of time. The purpose of this intermediate demonstrator is twofold:
- to serve as an integration platform for the complementing technologies of the different partners, i.e. speech enhancement, recognition, dialogue control, text-to-speech synthesis, visualisation and remote-control,
- to enable the automobile manufacturers to validate commonly accepted assumptions with respect to the user's attitude towards voice operation and to formulate more realistic specifications.
The prompted approach VCU attempts to extend an existing HMI concept towards voice operation, thus retaining the menu-structure and some of the drawbacks imposed by the tactile interface. The mixed-initiative VCU, however, is part of a HMI concept developed under the assumption of spontaneous speech input and the corresponding error-recovering dialogues.
Provision of language resources
At the beginning of the speech enhancement work the partners RWTH-Aachen and Bosch distributed a 200-speaker isolated-digit database adapted from a corpus originally collected by the Deutsche Telekom. Common understanding of the hostile acoustical environment was established by use of an 11-speaker database recorded by Bosch in a middle-class diesel-engined automobile from VW. The experience acquired during this exercise was used for the formulation of the large-scale training database specifications. The corresponding collecting campaigns have started in Karlsruhe, Wolfsburg and Nancy.
To collect data in simulated environments, UKA developed the software 'tkvodis' which elicits task-domain utterances, such as navigation queries to the BERLIN. This has been installed at Bosch and Renault, where collecting campaigns with employees as naive test speakers have been carried out.
Speech and language technology development
Speech enhancement by noise reduction has two potential target applications in a voice controlled BERLIN, namely robust recognition and hands-free telephony. As the GSM telephone connected to the BERLIN is an independent component, working with it's own input microphone, effort has been focused on the recognition task. Two schemes operating in the frequency domain have been considered so far:
1) single-microphone spectral subtraction in several variants. Different spectral analysis schemes have been combined with a noise-power estimation method.
2) dual-microphone delay-and-sum beamformer with coherence-based weighting of each channel.
The effectiveness of both schemes has been assessed in the context of both isolated and connected digit recognition tasks, but using phonetic models that were trained with clean speech. For this reason, the single-microphone scheme has recently been incorporated into the feature-extraction module of the recogniser, thus allowing for training with noise-reduced feature-vectors and further eliminating the necessity of a synthesis filterbank. Some key-parameters of the noise-suppression algorithms are presently fixed to the 'optimal' values determined. Further effort will be devoted to the signal to noise ratio dependent parametrisation based on the car engine speed as a measure of the noise intensity.
A number of comparative tests of different recognition processing schemes (from L&H and Bosch) have been conducted. There have been a series of subsequent revisions of these schemes showing significant improvement over time of the preliminary single-channel system for connected digit recognition.
At the same time RWTH-Aachen have been working on acoustic echo control, making impulse-response measurements in different vehicles in order to estimate the length of the echo canceller required and develop models that allow for a realistic simulation of the loudspeaker microphone path. A robust adaptation algorithm for the echo canceller has been optimised with respect to the recognition scores on an isolated-digit task. A real-time implementation on a floating-point DSP-board (96002 from MOTOROLA) is already available. Further effort will be devoted to the consideration of the cascade of echo controller and noise-suppression system.
The feasibility and limitations of grapheme-to-phoneme (G2P) based recognition have been investigated for a number of tasks related to the BERLIN operation. All 65 command and control words and other lists containing around 50 items (e.g. all German airports) are robustly recognised even in a noisy office environment. However, significantly larger vocabularies, required for the city-name list in the German navigation database (33000 items), can not be treated in the same manner, thus requiring much more elaborated input schemes including the partial spelling of names.
Special attention has been paid to the recognition of RDS station names and GSM phonebook items, since their pronunciation is likely to deviate from standard rules. Thus, for a large set of RDS station names provided by the test-sites exception dictionaries for the G2P converter have been generated. Company name acronyms (e.g. KLM, CNRF), however, will be treated by exception rules.
The semantic grammars for the natural language component are also under development based on a small amount of speech-data collected from native speakers and small navigation-databases containing streets and points of interest from around Karlsruhe and Paris. The grammars written for the PHOENIX parser are able to cope with navigation queries such as:
'how can I get to Forbes Avenue ?'
'what's the shortest path to St. Mary's Hospital'
'show me the way to the Airport'
'I would like to go to the closest fast-food restaurant'
Further development will require a considerably larger amount of spoken data. To collect written French and German navigation queries a web-page form has been set up at the VODIS web site.
Peugeot-Citroen and Bosch have prepared and distributed a questionnaire to German users of the BERLIN to investigate the difficulties that people experience when using a multimedia driver information system with a tactile interface.
The Way Ahead
The work in 1997 will be mainly devoted to the implementation and integration of the modules. The final prompted approach demonstrator, will cover the whole functionality of the BERLIN (including navigation system) as well as having full dialogue capabilities.
A systematic field-test will be carried out by Volkswagen, Peugeot-Citroen and Renault upon availability of the prompted approach demonstrator in late 97. Furthermore, potential end-users will be given the opportunity to evaluate the prototype vocal interface within popular trade-shows such as Cebit.
Other organisations representing the general driver-community (ADAC), other car manufacturers, car-rental companies (Hertz, Avis), courier service providers (UPS, FedEx), producers of digital maps (TeleAtlas) and related navigation and travel information products (Gräfe und Unzer Verlag) will also be invited to user group demonstrations.
The project will be focused on voice-operation of a commercially available state-of-the-art driver information system, namely the BERLIN RCM303A from Bosch, that presently integrates a route guidance (navigation) computer with traffic-messaging (RDS/TMC), mobile telephone (GSM) and more conventional equipment, such as car-radio and audio/video entertainment sources, thus demanding a high amount of driver interaction, e.g. when entering the name of the city and the particular location (street, public building, etc.) a driver needs to be guided to.
The base-technologies are phonetically based recognisers and natural language understanding systems, to be demonstrated in a car-environment for the first time. Special interest will be devoted to the open-vocabulary problem, e.g. recognising a finite set of a-priori unknown words which become available as a textual list at runtime, and to the speech enhancement problem by taking all available a-priori information about the originating sources (e.g. engine rotation) into account.
Progress and results
In an initial stage, to be completed after roughly two years, a menu-driven voice operation of a limited number of BERLIN-functions will be pursued, e.g. source selection, telephone operation and navigation within limited regions of selected German and French cities. Within this highly prompted and dialogue-aided scenario, the system keeps control of the interaction process. The ultimate stage, to be addressed from the beginning but completed after three years, allows for spontaneously spoken database queries for the navigation task. This natural-language-understanding system uses the previously developed prompted interface in a fallback-strategy, e.g. when clarification or verification are required.
The voice control unit will be based on commercially available hard-ware, e.g. several multiple purpose DSP-boards hosted on one or two PCs in the boot of the test-vehicles and linked to the target system via a special I2C-type bus. Miniaturisation, integration and cost optimisation aspects have been given secondary priority. Two evaluation stages for the demonstrators have been defined, the first one within a highly-prompted, menu-and-dialogue driven scenario, the second one allowing more spontaneous speech input. The evaluation will be carried out at a user-level in the lab and at a driver-level by means of field-tests with a representative set of drivers. This will reflect acceptance, impact and amount of cognitive load imposed by the novel HMI. Bosch considers the work as pre-competitive research, and it is expected to use both the results and the relationships established as a basis for the development of voice HMIs to the entire driver-information-systems product line.
Funding SchemeCSC - Cost-sharing contracts