This synthesis approach is a descendent of Parke's [105][104][103] parametrically controlled polygon topology
synthesis technique, incorporating code developed by Pearce, Wyvill,
Wyvill, and Hill [109] and Cohen and Massaro
[28][27] and is principally focused on
the lower face.
The facial model includes a polygonal representation of a tongue, controlled by four parameters: tongue length, angle, width, and thickness. While the model is a considerable simplification compared to a real tongue, it does contribute a great deal of information to visual speech perception.
In addition to the tongue control parameters, a number of other new (relative to the earlier Parke models) parameters are used in speech control, including parameters to raise the lower lip, roll the lower lip, and translate the jaw forward and backward. Some parameters have been modified to have more global effects on the synthetic talker's face than in the original Parke model. For example, as the lips are protruded the cheeks pull inward somewhat. Another example is that raising the upper lip also raises some area of the face above.
An important improvement in the visual speech synthesis software has been the development of a new algorithm for articulator control which takes into account the phenomenon of coarticulation [28]. Coarticulation refers to changes in the articulation of a speech segment depending on preceding (backward coarticulation) and upcoming segments (forward coarticulation). An example of backward coarticulation is the difference in articulation of a final consonant in a word depending on the preceding vowel, e.g. boot vs beet. An example of forward coarticulation is the anticipatory lip rounding at the beginning of the word ``stew''. The substantial improvement of more recent auditory speech synthesizers, such as MITtalk [1] and DECtalk, over the previous generation of synthesizers such as VOTRAX [140], is partly due to the inclusion of rules specifying the coarticulation among neighboring phonemes.
Our approach to the synthesis of coarticulated speech is based on the articulatory gesture model of Lofqvist [84]. A speech segment has dominance over the vocal articulators which increases and then decreases over time during articulation. Adjacent segments will have overlapping dominance functions which leads to a blending over time of the articulatory commands related to these segments. Given that articulation of a segment is implemented by several articulators, there is a dominance function for each articulator. The different articulatory dominance functions can differ in time offset, duration, and magnitude. Different time offsets, for example, between lip and glottal gestures could capture differences in voicing. The magnitude of each function can capture the relative importance of a characteristic for a segment. For example, a consonant could have a low dominance on lip rounding which would allow the intrusion of values of that characteristic from adjacent vowels. The variable and varying degree of dominance in this approach naturally captures the continuous nature of articulator positioning. This model, as implemented, provides the total guidance of the facial articulators for speech rather than simply modulating some other algorithm to correct for coarticulation. To instantiate this model it is necessary to select particular dominance and blending functions [28]. For example, when synthesizing the word ``stew'', the consonants /s/ and /t/ have very low dominance versus a strong and temporally wide dominance function for the vowel /u/. Because of the strong dominance of the vowel, its protrusion value spreads through the preceding /s/ and /t/.
For simultaneous visual-auditory speech synthesis from English text, this system uses a common higher level software to translate the text into the required segment, stress, and duration information to drive both the visual and auditory synthesis modules. To carry out this higher level analysis, the MITalk [1] software has been integrated with the facial synthesis software. In addition to providing phonemes, the syntactic and lexical boundary information from MITalk are used to control other facial behavior, such as eyebrow raising, blinking, and eye and head movements. Currently running on an SGI Crimson-VGX, the system can produce high-quality real-time simultaneous visual-auditory speech of up to about a minute length after a short pause.