Next: Summary Up: Validation Previous: Perception Based Validation

Perceptual Paradigms

There are a number of issues regarding perceptual-based validation. The first issue, discussed earlier, concerns the size and type of the test stimuli. Different types of stimuli might engage different perceptual processes. For example, short nonsense words would presumably tap only low-level perceptual processes versus the involvement of lexical processes for real words. The presence of lexical information in the latter case could aid recognition by constraining the response alternatives. Similarly, when meaningful sentences are used, subjects can use syntactic and semantic constraints to improve performance. A couple of related issues in stimulus selection concern the complexity of the test units, and how the low level segments are specified. As an example of the first, in order to test segment coarticulation rules, it would be desirable to include words incorporating segment clusters (eg., consonant clusters and diphthongs) rather that the usual CVC words. As an example of the second problem, one might simply specify segment identities and durations analyzed from an actual human talker. However, since many systems will incorporate text-to-speech translation, it may be more appropriate to use the segment identities and durations derived from the translation module. To help pinpoint problems, we might actually want to test the systems while either including or bypassing various modules.

A second issue concerns the method of collecting responses from the observers. In some early tests (Diagnostic Rhyme Test (DRT) [139] and Modified Rhyme test [61][45]), observers were presented with short words with a closed set of alternative responses (2 for DRT, 6 for MRT) which differed on the initial consonant. For example, if the test word was bat, the response alternatives might be bat, cat, rat, sat, mat, and fat. Although the closed form may be informative about the overall level of performance, it does not yield much information about the confusions made by the perceiver.

Although more difficult to score, a better approach is the open response method in which the observer simply reports the word heard. These response words can then be broken into constituent segments and compared with the segments actually presented, to form confusion matrices. For example, we can look at the responses in terms of actual initial consonant presented and perceived initial consonant. For sentence-length test material, this sort of analysis would have to be preceded by algorithms for alignment of the stimulus and response strings. Some examples of these algorithms are the NIST String Alignment and Scoring Program [63], and the sequence comparator of Bernstein, Demorest, and Eberhardt [11].

Given the confusion data for natural and synthetic faces, we can assess overall agreement and particular problem areas in our facial synthesis strategies. In addition to the direct comparison of confusion data, these strategies can be further analyzed using MDS techniques and the resulting multidimensional spatial representations can be compared. By using techniques such as INDSCAL, multidimensional representations with common axes (though different dimensional weights) can be obtained. One may be able to then characterize different facial synthesis systems on the basis of these dimensional weights.

We should note that it is important to examine the confusions of different segment types and positions. For example, systems may do well on consonants but relatively poorly on vowels. Similarly, the transmission of consonants may vary depending on whether they occur in initial position, final position, or clusters.

In addition to analysis of accuracy and confusion data, a couple of other types of data may be of value. The first type is the response latency of the human observer. For example, for the auditory modality, even with the same level of accuracy, one may observe longer latencies for synthetic than natural speech, and different latencies for different synthesizers. One reason for this might be the relative lack of redundant information available in the synthetic forms [35]. For sentence-length materials we can also see differences in latencies for verification of sentence truth for natural and synthetic speech.

The second type of data is obtained by collecting quality ratings on natural and synthetic faces. For example, we can ask about whether the speech is too fast or slow, how easy it is to lipread, and how pleasing and realistic the face is. These ratings can then be compared.

Because synthetic visual speech may often be accompanied by auditory speech, it will be useful to test combined materials. To do that, we should use an efficient evaluation technique is to compare the visual speech intelligibility added to the auditory intelligibility, at various levels of acoustic degradation, utilizing a reference natural face and the model(s) we want to test. In that perspective, Le Goff, Guiard- Marigny, Cohen, and Benoit [54] compared the audiovisual intelligibility of the same corpus (18 phonetically constrained VCVCV sequences) with the same acoustic (naturally uttered) material. They used five different conditions of added noise, across four conditions of visual displays: no visual display (audio alone), the natural face of the speaker, the 3D model of the whole face (Parke modified by Cohen), and a 3D model of the lips. The results showed that the whole natural face restores the two-thirds of the missing auditory intelligibility when the acoustic transmission is degraded or missing; the facial model (tongue movements excluded) restores half of it; and the lip model restores a third.

It should be noted that the evaluation techniques presented here in the context of analysis of linguistic transmission can also be used for para-linguistic information (for example, emotion). In the auditory modality, for example, Cahn [22] analyzed confusions in the perception of the emotional content of sentences. In her study, five sentences were synthesized each presented in six emotions in a variety of random orders. The observers then identified which of the six were intended. Similar experiments can be carried out with synthetic faces at the sentence and smaller unit levels, and compared also with natural face transmission.

An issue not yet discussed concerns the selection of our perceptual observers. Who should they be? It may be that a number of different types are valuable. For example, in some early stages of development we may want to use experts such as trained FACS coders for evaluation of paralinguistic information transmission or expert lipreaders for evaluation of speech. These observers, in addition to analysis of their recognition levels and confusions, might be able to offer valuable qualitative insights about the synthesis systems. However, we would also want to test our systems with more naive observers who would be the typical end-users of our systems. The latter approach has been used in the majority of prior studies.

Another issue concerns internationalization. To allow the widest range of researchers to compare their work, multilingual intelligibility tests are valuable. Some recent advances in this area are given in [7][126][10]

A final issue of concern when evaluating speech intelligibility with human observers is that of learning. This issue has received considerable attention in the context of training lipreading [142][52], electrotactile [1], and unfamiliar speech distinctions [85]. The important point from these studies is that performance may change considerably with experience, so it is important to examine how well observers do with our synthetic faces both at first sight and later when they are more familiar with them.



Next: Summary Up: Validation Previous: Perception Based Validation


pkitchin@graphics
Thu Nov 17 10:12:34 EST 1994