
The basic objective of VIDAS was that of approaching the problem of facial animation from a combined audio-video point of view, for both the analysis and the synthesis. The main goal was the consolidation of advanced technologies for synthetic/natural representation and coding of facial sequences. In particular, a generic face-to-face communication has been modelled as a multimodal information source characterised in audio by a human speaker's voice and in video by the same speaker's face. In the vast majority of cases, in fact, an interpersonal communication consists exactly of these two strongly correlated items: a speaking face together with its synchronous speech. From the technical point, a major goal addressed by the project was that of implementing a software prototype for synthetic facial animation, compliant with the new MPEG-4 standard, capable of reproducing the speaker's face through a 3D deformable model, suited to be calibrated through standard FDP (Face Definition Parameters) and to be animated through standard FAP (Face Animation Parameters).
The mandate of VIDAS included a deep and responsible commitment in the MPEG-4 standardisation process with special reference to SNHC (Synthetic/Natural Hybrid Coding) and FBA (Face and Body Animation). Part of VIDAS responsibilities have also concerned the dissemination of this kind of information to the sciengific European community through a variety of initiatives like the periodic ACTS Concertation Meetings, ECMAST and IST conferences and "ad hoc" workshops like those organised by VIDAS itself in Rhodes (IWSNHC3DI'97) and in Santorini (IWSNHC3DI'99).
Final goals of VIDAS have been the evaluation of the subjective impact produced by the synthetic facial animation on normal hearing and on hearing-impaired subjects, and the feasibility study for porting this MPEG-4 technology on real-time platforms.
The major technical issues have concerned: The sub-image where the presence of the head is detected is further processed to segment the facial region through a merging process that specifies which regions of the sub-image form the face. The algorithm works on a Region Adjacency Graph (RAG) whose nodes represent a connected component of the image (regions or flat zones) and the links connect two neighbouring nodes. An iterative merging algorithm is finally applied on this graph for removing some of the links and merges the corresponding nodes, until the final segmentation is reached.
Multi-resolution search (3 levels) of the speaker's face on the first frame of the image sequence.
The developed technique exploits also the colour information to obtain more accurate contours. The model used for each region is the median of each (y,u,v) component, computed recursively from the median of the two merged regions. The merging order is the relative squared error between region models, and the merging criterion (a termination criterion) is the final number of regions.
The face partition is not directly used for tracking purposes since its regions do not fulfil any fixed motion or spatial homogeneity. Instead, a second partition level is defined by re-segmenting the face partition. The re-segmentation yields a second partition whose objective is to guarantee the colour homogeneity of each region (texture partition) while preserving the contours present in the face partition.
Original images, bright face components, bright and dark face components and the final segmented faces.
The texture partition of the previous image is projected into the current frame to obtain the texture partition at the current image. The projection of the texture partition accommodates the previous partition to the information in the current image. An estimation of the region position in the current image is obtained by motion compensation of the previous texture partition. Compensated markers are fit into a finer partition to validate them. In a first step, compensated markers are reduced to the set of fine regions that are totally covered by them. Finer regions that are partially covered by more than one compensated marker are assigned to the uncertainty area. This step is purely geometrical and it yields the main connected components of each projected marker. Once the main components of every compensated region have been computed, neighbouring regions from the fine partition can be added to them. This second step takes into account geometrical as well as colour information and yields the core components of the face region. The final face partition is created by applying the refinement step based on the distance to the face space to these core components.
The face segmentation and tracking technique developed within this project successfully performs in a large set of sequences. Therefore, it can be used as a generic technique for applications that require the extraction of faces from sequences with human presence. Some examples of the achieved results are shown in the figure above and below.
Examples of the achievable results in facial region segmentation.
A set of parameters describing how to adapt a 3D face model to the face image are computed from the extracted feature data, with specific reference to the CANDIDE model developed at the University of Linkoping. The method consists in estimating 3D rigid and non-rigid parameters that correspond to the best matching with the 2D speaker's face in the input image. The rigid parameters are limited to 3 global translations and 3 global rotations whereas the non-rigid parameters correspond to different weights, which control the principal eigen-shapes (we recall that the eigen-shapes describe non-rigid deformations on the 3D CANDIDE model). Some results of the projection of the 3D CANDIDE model on the image plane are shown in the figure below.
The face-mask Candide adapted to the feature data. The allowed deformations were rotation around the z-axis (left), scaling (center) and translation (right). The distance between the eyes and the distances eyes-nose and nose-mouth were also allowed to change.
After adapting the CANDIDE model to the face region, the estimated 3D co-ordinates of a subset of MPEG-4 feature points are computed and converted in FDP and FPA format for calibration and animation purposes, respectively.
Example of Neutral Face, obtained on the "Oscar" model developed at DIST, University of Genoa.
Feature points, through which MPEG-4 defines a set of relevant somatic points on the face, represent a key-concept. Feature points are subdivided in groups, mainly depending on the particular region of the face they belong to. Each of them is labelled with a number idengifying the particular group it belongs to, and with a progressive index idengifying them within the group.
Every proprietary model available at any decoder must be in a neutral position and all the FDP used for calibration are referred to a neutral face corresponding to the posture represented in the figure above.
MPEG-4 has made some basic assumptions on the neutral position, forcing the co-ordinate system being right-handed, head axes being parallel to the world axes; the gaze direction being aligned to the Z axis, all face muscle being relaxed, eyelids being tangent to the iris, the pupil diameter being 1/3 of the iris diameter, the line of the lips being horizontal and at the same height of lip corners, the mouth being closed and the upper teeth touch the lower ones and the tongue being flat, horizontal with the tip of the tongue touching the boundary between upper and lower teeth.
(Left) Frontal and side views of the texture calibration target "Claude"; (right) model "Oscar" reshaped with the feature points of "Claude", with and without texture mapping.
These calibration points are very few (around 80) with respect to the global number of vertices on the wire-frame which, depending on its complexity, can be as numerous as 1000 or more. The VIDAS tools have been employed to animate different models, like "Mike" and "Oscar", developed at DIST-University of Genoa, or "Miraface", developed at Miralab, University of Geneva. After having calibrated the model geometry, also the texture information and the texture co-ordinates for each feature point can be mapped on the model surface, as shown in the figure above.
Model animation is achieved by supplying corresponding FAP information derived from the analysis of natural sequences by means of the tools mentioned before. In the following figure an example of the achieved results is reported.
Example of facial animation driven by FAP, extracted from natural video analysis, obtained on the "MiraFace" model.
Since the set of calibration guide points, defined in MPEG-4 as feature points, is very limited with reference to the complexity of the human head geometry, the tools developed by VIDAS apply suitable interpolation for assuring smooth and realistic surface rendering. Particular care has been paid also to solve the problem of texture adaptation to avoid annoying artefacts in correspondence to deformable face features, which are even more appreciable when the model is animated. The methodology for model calibration, developed in VIDAS, is based on the use of Radial Basis Functions with exploitation of a priori knowledge on the geometry of human heads.
As far as facial animation is concerned, VIDAS tools are based on the definition of a set of pre-defined movements and their eventual composition, expressed as a function of each specific MPEG-4 animation parameter. High-level animation parameters like the one encoding the facial expression (emotions) and acoustically coherent postures of the mouth (visemes), have been also represented in terms of suitable configurations of low-level animation parameters.
Finally, VIDAS has also adapted and integrated a text-to-speech converter, developed by Elan Informatique SA, into the facial animation decoder thus allowing the synthesis of coherent lip movements in synchronisation with speech.
The main achievement of VIDAS was the creation of a software analysis/synthesis system for reproducing the characteristics, movements and expressions of a human face by means of a synthetic 3D model compliant to MPEG-4 specifications. After decades of research and tool development in the area of virtual character animation, MPEG-4 standardisation has forced convergence among the multitude of proposals and proprietary technologies. VIDAS tools represent an MPEG-4 compliant solution capable of creating a database of animated face sequences, which can be eventually integrated within any extended MPEG-4 material. As a result of its standard compliance, VIDAS technology allows the exchange of data legible by any MPEG-4 decoder.
The VIDAS decoder is based on 3D head models of variable complexity, suited to hw/sw platforms of different CPU/graphic power. In addition to the synthesis aspect, VIDAS tools also performs the analysis of natural audio and video, for providing the calibration and animation parameters necessary for encoding facial characteristics and movements, as specified in MPEG-4.
For achieving this goal, VIDAS has participated actively from 1995 to 1999 to the MPEG-4 work with specific reference to the working group SNHC (Synthetic/Natural Hybrid Coding) and to the "ad hoc" group FBA (Face and Body Animation). VIDAS involvement in MPEG-4 has included meeting attendance, email discussion, participation to Core Experiments and testing. On behalf of VIDAS, DIST-University of Genoa hosts the MPEG-4 FBA test data set, Miralab-University of Geneva has donated its "MiraFace" head model to the MPEG-4 Implementation Group and DIST-University of Genoa has provided its "FAE" (Facial Animation Engine) to MPEG-4 Implementation Studies Group for testing.
A live demo of the project achievements has been given recently during the IST'98 Conference in Vienna, November 30 - December 2, 1998 and at the Conference for the launch of the 5th European Research FP in Essen, February 25-26, 1999.After the successful organisation of the 1st Int. Workshop on "SNHC and 3D Imaging", Rhodes, 1997, VIDAS has organised the 2nd edition of the workshop in Santorini (Greece) on September 15-17, 1999.
The system is usable for generating synthetic video with animated faces to be included in any MPEG-4 content, such as an animated mail reader, an interactive CD-ROM for teaching foreign languages, a web-based virtual salesman, a virtual guide to museums and exhibitions, or a human-animated interface used to access a service center via wired or wireless terminals. During the public demos given by VIDAS, visitors have been invited to "lend" their face to our synthetic facial model that has been then animated through pre-recorded FAP files. The feedback we got from this experience was very positive and really surprising, since a quantity of possible applications has been suggested, solicited or even recommended.
As an example, this technology could be used for implementing a "virtual friend" available at anytime for a confidential talk, a "virtual travel agent" for giving you information about ticket and accommodation arrangements, a "virtual museum guide" for introducing you to historical and cultural heritages, a "virtual salesman" for describing you a catalogue of products or, finally, a "virtual professor" for teaching you a foreign language. Despite the specific context of the application, what is innovative and of big impact is the possibility to interface a multitude of web services through an interactive animated face. Moreover, since the facial animation bitstream is MPEG-4 compliant, it could be embedded into a more complex MPEG-4 scene description with added sound, video, text and graphics, that could become part of an even more complex multimedia content distributed through the web, on web-CDs, CD-ROMs or also broadcasted.
A number of commercial applications are currently under development for the use of the VIDAS tools in interactive interface development for assessing databases and service centres. In its applications for real-time videophone, a bitrate of 3-4 kbit/s has been evaluated for the transmission of synthetic video information.
Assessment with hearing impaired users, through the participation of NAD - Irish National Association for the Deaf, has demonstrated the appreciable positive impact of this technologies on deaf users for the possibility of emphasising the visual articulation of speech.
In the recent past many significant examples have been given about the dramatic impact that Internet technologies are having on our habits and style of life, tele-work, tele-training, tele-presence, web-gaming, web-chatting, and so on. There is an increasing pressure coming both from the market and from our personal new cultural attitude, to transfer, partially or totally, our usual activities and relations from the real world to the virtual world. In this scenario, with opportunities to share virtual environments where avatars reproduce the human behaviour and interact with a multitude of autonomous agents, issues such as media synchronisation and user interactivity are critical. The algorithms developed by VIDAS offer a concrete contribution to anyone interested in implementing MPEG-4 animated faces for developing such kind of friendly user interfaces, either from the calibration and animation point of view. The business potential of the VIDAS technology is clearly impressive. All multimedia applications in which an animated, talking human face can be used as an interface with consumers could benefit immediately from the VIDAS tools, including applications such as introducing Internet customers to services and products through a virtual salesman and educational multimedia material for children. This technology can be used within interactive CD-ROMs for teaching foreign languages or speech training for the deaf. Other applications for VIDAS include commercial and industrial motion picture production.
The combination of robust and realistic calibration/animation techniques make it possible to implement a variety of applications based on virtual animated faces and transfer these multimedia technologies into actual products and services.
It is our intention to contribute to speed up the process leading from "empty words on paper" to "meaningful instruments" for the people, in the area of using MPEG-4 technologies for friendly access to shared environments in applications of electronic commerce, personalized access to cultural contents or for advanced forms of entertainment. But significant work has still to be done.
The availability of an efficient and realistic technology for MPEG-4 facial synthesis like the one developed by VIDAS cannot become a complete product until similar technology will be achieved for the automatic capturing of human facial parameters from the analysis of natural video and audio. International research in image analysis and facial feature extraction has led to hundreds of good papers in recent years, but no concrete evidence has been produced so far on their effectiveness, generality and robustness. Now there is a standard and not all the facial features deserve equal interest, some of them are of lower importance than others, and some are useless, since they are not included in the standard at all.
If there is the possibility to encode and synthesize high-level information, like that associated to visemes and facial expressions then there should be tools capable of extracting it from natural audio/video sequences. In consideration of this situation, the VIDAS partners have planned recently to start specific research for the development of multimodal analysis tools for friendly interaction with the facial model.
In conclusion,the opportunities for exploiting this technology are many and very attractive, provided that similar robustness and effectiveness is consolidated in the analysis tools necessary for managing multimodal interaction with the users. This, in the opinion of the VIDAS partners, the system component still needed before MPEG-4 facial animation a fully exploitable reality.
COVEN >>