Most face models used for face animation do not include the components necessary to properly articulate the oral cavity. This is due in part to a limited supply of oral cavity parameter sources. A face model which will be used by a human lipreader for speech understanding has more stringent requirements than one used only for natural looking speech. For example, proper modeling of the tongue is very important for lipreading but only moderately important for other applications. Determining the important visual parameters for speech understanding when combined with high to low quality acoustic speech is an active research area. In addition, the dynamics or temporal derivatives of visual speech parameters should be represented in the model [56].
The most common sources of visual speech parameters are text driven models of speech articulation. The front end of these models are very similar to traditional text-to-speech models except that visemes are generated in addition to phonemes. The output of an acoustic speech synthesizer needs to be synchronized with the visual speech which forces accurate simultaneous modeling of acoustic and visual speech articulation. This is also an active research area. Systems of this type are described in [95][94][27].
A few researchers have attempted to derive phonemes from acoustic speech and then map them to visemes as an aid to understanding telephonic speech for the hearing impaired. Unfortunately, phoneme classification errors are compounded in the mapping to visemes which is not one to one and a proper mapping has not been completely established. However, an advantage of this approach is the theoretical availability of emotion, prosody and non-phoneme sounds which can imply changes in facial expression. Natural language understanding of text is also being used to generate facial expressions [112].
Cartoon animators have been mapping acoustic speech to lip movements for over 70 years. In some cases, this was done directly from filmed speech using the rotoscope. Several systems have been developed within the last decade which automatically extract visual speech parameters from talking faces using a video camera and frame buffering. The simplest approach to extracting lip movements visually is by placing dots on the lips and tracking them from frame to frame for lipreading [131][46][47][16]. The disadvantages of this approach are low resolution of lip parameters and invasiveness. Another approach which provides higher accuracy but is still invasive is to paint the lips to aid in grayscale thresholding [12][91][8][57][134]. Both the inner and outer lip contour are obtained with this method. The system described in [56][51][116] obtained a combination of inner lip contour and teeth/tongue reflection using a head mounted camera, fixed lighting, and nostril tracking to avoid face paint. This same system was used to obtain a pure inner lip contour by blacking out the teeth as described in [15]. The system described in [118][117] was a precursor to the [116][15] system but did not use a head-mounted camera and relied entirely on nostril tracking to locate the oral cavity in a sequence of facial images. Optical flow was used in [90] to measure mouth motion dynamics after manual definition of four rectangular mouth regions. Viseme recognition from manually windowed mouth images was used in [12].
The measurement of tongue position is very difficult. Only a coarse indication of forward tongue presence was available in [56][51][116][15][118][117]. Detailed tongue information was obtained in [157] using electromagnetic receivers glued to the tongue and teeth.