Next: Text-To-Visual/Auditory Speech Up: Control Architecture Examples Previous: Physically Based

Semantic (Speech/Linguistics) Driven

Facial expression, head, and eye motion can be automatically driven from spoken input, thereby providing a high level programming interface for 3D facial animation. In this mode of operation a particular spoken utterance, with associated intonation and emotion, can be computed independently of the facial model. Once the computation is complete, a facial model can be articulated through the Action Units (AU) described by FACS notation system.

The process is as follows:

  1. Phonemes are characterized by their degree of deformability. For each deformable segment, the algorithm looks for the nearby segment whose associated lip shapes influence it, using the look-ahead model for coarticulation [70]. The properties of muscle contractions are taken into account in two ways: (1) spatially, by adjusting the sequence of contracting muscles if antagonist movements (i.e., movements which show very different lip positions, like pucker movements versus lip extensions) succeed each other, and (2) temporally by noticing if a muscle has enough time to contract (respectively relax) before (respectively after) the surrounding lip shape. Both constraints act on the final computation of the lip shapes [111].

  2. Starting from a functional group (lip shapes, conversational signal, punctuator, regulator or manipulator), algorithms can incorporate synchrony, and create coarticulation effects, emotional signals, and eye and head movements [113]. Rules generate automatically the facial actions corresponding to an input utterance. A conversational signal (movements occurring on accents, like raising of eyebrow) starts and ends with the accented word, while punctuator signals (such as smiling) coincide with pauses. Blinking is synchronized at the phoneme level. Head nods and shakes appear on accent and pause. The head of the speaker turns away from the listener at the beginning of a speaking turn and turns toward the listener at the end of a speaking turn to signal a change of turn.

  3. Facial interaction between agents and synchronization of head and eye movements to the dialogue for each agent are accomplished using Parallel Transition Networks (PaT-Nets), which allow facial coordination rules to be encoded as simultaneously executing finite-state automata [24]. PaT-Nets can call for action in the simulation and make state transitions either conditionally or probabilistically. All face and eye movement behavior for an individual is encoded in a single PaT-Net. Each node of the PaT-Net corresponds to one gaze function. A PaT-Net instance is created to control each agent with appropriate parameters. Then as agents' PaT-Nets synchronize the agents with the dialogue and interact with the unfolding simulation they schedule activity that achieves a complex observed interaction behavior. Probabilities appropriate for each agent given the current role as listener or speaker are set for the PaT-Net before it executes. At each turn change, the probabilities affect actions accordingly.



Next: Text-To-Visual/Auditory Speech Up: Control Architecture Examples Previous: Physically Based


pkitchin@graphics
Thu Nov 17 10:12:34 EST 1994