Given the methods outlined in Section for the analysis of facial information, we are faced with the problem of comparison between natural and synthetic information. To begin with, for any given measurement technique one might obtain measurements from the synthetic face in two ways. First, we could simply feed the synthetic images through the same analysis tools as those used on the natural face. Note, however, that 3D measurements based on multiple views would require synthesis of these views. An alternate approach would be to simply make the measurements from the surfaces of the synthetic face.
Once we have obtained two sets of measurements, how should they be compared? It should be pointed out that often it may be of interest to simply examine the pattern of differences for single features at a time. For example, we might want to know whether the synthetic jaw rotates to the same degree as the natural one.
It is also of interest to evaluate the global agreement of natural and synthetic information. In a sense we can consider the features as locating test segments (e.g., phonemes or visemes) in a multidimensional space with our task being to compare the spatial arrangements of two sets of points. One possible way to do this would be to correlate the two sets of measures. Another possibility would be to use an error function, such as the absolute or squared difference in measures. For any of these methods, one would also want to examine both the instantaneous and dynamic agreement statistics.
A further complication in the comparison process could be that all measurements in a set might not cover the same range or be of equivalent importance. For example, lip width might be much more important than upper lip raising. Thus, we might want to weight the goodness metric differently for different measurement components. How would these weights be determined? One way would be to evaluate how important each measure is in providing optimal recognition. For example, Finn  used an algorithm to obtain the best weighting for recognition. A similar examination of various measures was used by Goldschen  in selecting which measurements would be used for recognition.
Other, higher level, analyses of the measured features might be of value in comparing facial articulations. For example, considering the measurements multidimensionally with multiple measurement sets, such as from different facial synthesizers, we might examine what weightings of the measurement dimensions bring the sets into closest agreement. As an example of another possible approach , Benoit, Lallouache, Mohamadi, and Abry  used dynamic clustering and correspondence analysis to classify the visemes used in a language. This could be applied to natural and synthetic measurements to assess their similarities.
Using these techniques we can arrive at some metrics of agreement between the behavior of natural and synthetic faces. However, while these metrics are of some value, they do not necessarily answer all of our validation concerns. First of all, different types of facial synthesis systems have been constructed for different purposes, and it may not be appropriate to judge them in the same way. For example, a system which simply uses analyzed features to drive synthesis may have less error than one which takes its input from text, but with much less flexibility. A second consideration is that the synthetic face may look nothing like a natural one. For example, we may be mapping human measurements to a dog's face . This would of course result in large errors. Finally, how do we know that the measurements selected truly reflect the linguistic information conveyed facially? A partial answer was given just above - we can weight the measurements by how important they are for machine recognition. But this may be misleading because the features used by machines may differ from those used by human observers. Since the ultimate consumers of synthetic faces are humans, it is also essential to validate our work with human perceptual tests.