Skip to main content

Multi-channel biometrics combining acoustic and machine vision analysis of speech, lip movement and face

Periodic Reporting for period 2 - SpeechXRays (Multi-channel biometrics combining acoustic and machine vision analysis of speech, lip movement and face)

Reporting period: 2016-11-01 to 2019-04-30

SpeechXRays is about access control to on line data or services, and possibly to physical locations. Although required for decades, access control is still a topic for improvement.
Most available solutions rely on passwords, which are weak (easily stolen and forgotten, expensive to manage), and are a pain for the end users when complex passwords are mandatory.
One-time passwords are interesting, but not convenient for end-users: they must make sure that their phone and credit cards cannot be stolen together.
Recent solutions are relying on secure elements or on access to hardware security modules, but this is not enough for strong authentication, now required by the new PSD2 directive, were 2 factors are mandatory.

SpeechXRays proposes a multi biometric approach based on innovative face and voice recognition techniques. The benefit of these biometrics is that they do not need specific acquisition devices: a camera for face and a microphone for voice are available on any smartphone, tablet or computer. Users can use multiple devices and change them very easily. With new, high accuracy, convenient, revocable biometrics, we address all the major concerns of current solutions.
SpeechXRays includes 11 work packages and 11 objectives.

Work packages were developed as planned except voice biometrics, as the first voice buiometrics partner never met its objectives. We had to replace it by VoiceTrust. With additional work, we were able to develop a robust voice + face biometric verification.

Objective 1: Develop and test a cost effective, convenient, privacy preserving multimodal biometrics solution based on voice biometrics and machine vision analysis of speech, lip movement and face
1.1 Voice biometrics: VoiceTrust put its state-of-the art active speaker recognition technology, and made it available for Romanian and Greek.
1.2 Audio-visual (multichannel) biometrics: all technical partners including the integrator cooperated to deliver an integrated face + speaker recognition system. SpeechXRays uses the combination of speech and face motion against spoofing.

Objective 2: Implement the novel biometrics solution in a broadband network, giving access to smart services running over networks with state-of-the-art security, avoiding single points of failure
2.1 Corporate use case: IFIN-HH tested SpeechXRays on more than 600 users, both employees and visitors of IFIN-HH, in consistent conditions with a real-life deployment of the system in the premises of IFIN.
2.2 eHealth use case: FORTH fully implemented the eHealth pilot over a 3 months period, and tested more than 400 end users in different conditions representative of the eHealth actual environment.
2.3 Consumer use case: Forthnet demonstrated the use of SpeechXRays in a simulated consumer environment, with more than 1000 end users representing a variety of age, gender and technology acquaintance level.
All use cases were GDPR compliant, following the recommendation of an ethics advisor. the target KPIs were met.

Objective 3: Guarantee interoperability and portability between systems and services
3.1 Compare text-independent and text-dependent solutions: VoiceTrust presented the benefit of active, text dependent solution for convenience and against spoofing.
3.2 Device and network independency: we tested SpeechXRays on multiple hardware and software environments. SpeechXRays includes an API relying on REST.
3.3 Standard compatibility: we comply with standard NIST SP500-288, and worked on 2 ISO standards: ISO/IEC 24714 and ISO/IEC 24745:2011

Objective 4: Develop a vibrant application and service ecosystem
4.1 User community: We presented the SpeechXRays project and platform to many people. However because the project was disturbed by the replacement of one partner, we had no early adopter.
4.2 Developer community: 10 developers made 5 different developments on top of the SpeechXRays API.
4.3 Hacking contest: the University of Bucharest hosted a spoofing contest, aimed at presentation attacks. Students were not able to spoof the system in the time given, despite their creativity.
We present SpeechXRays innovations below:
Speaker recognition
VoiceTrust offered its state-of-the-art active voice biometrics solution and train custom acoustic models for Romanian and Greek – two languages for which there is no current similar speaker recognition solutions available elsewhere. We studied 2 approaches for rapid language development:
• Collection of training corpus with 300 individuals (Romanian)
• Usage of existing speech recognition corpus (Greek)
The first approach was more favorable than the second one.

Face recognition
Nowadays, almost all face recognition systems are variations of deep CNN. They are distinguished from each other’s by the size and quality of the databases used for training. For the SpeechXRays project, TSP used publicly available research databases for training. Because labelling is not always accurate, TSP used clustering algorithms to reduce the mislabeling of the data. TSP was able to obtain good competitive results with models trained on public databases.
TSP and RealEYES collaborated to train the system on the 1.5M subjects’ database of RealEYES. The resulting models can be used commercially.

Emotion detection
RealEYES explored new approaches based on deep learning to measure emotion. It grew its emotion training DB significantly, to the levels required to apply deep neural networks methods. As a result, RealEYES has achieved up to 77% improvements on MCC scores for some emotions, enabled tracking of new emotion such as Contempt and Attention, and improved accuracy of its face detection and facial landmark tracking components.

Anti-spoofing and liveness detection
Anti-spoofing for voice relies on text prompted design. In SpeechXRays, the vocabulary is small, but it was enough for the spoofing contest as only one student was able to spoof speaker recognition. Another measure can be easily added: use a fix pass phrase and detect replay attempts.
Liveness detection relies on the motion of mouth and eyebrows, and by computing luminosity and colors in different parts of an image. It considers frontal face ratio, and normalizes the image features, so that affine transformation do not create “fake liveness”.

Cancellable biometric
TSP provides a cancellable face solution “as a service”, using a complex architecture of REST API. The method relies on a shuffling key obtained from a password or a secure element. Users keep their privacy as templates are randomized and it is computationally infeasible to reverse them.

2 biometric factors for strong authentication
Unlike other strong authentication solutions, SpeechXRays implements strong authentication with 2 biometrics, and demonstrates from the results all use cases that fusion brings a lot of value.
SpeechXRays logo