We assume that the input to GEN consists of strings of vowels V and consonants C. GEN allows each segment to play a role in the syllable or to remain ``unparsed''. A syllable contains at least a nucleus and possibly an onset and a coda.
Let us assume that GEN marks these roles by inserting labeled brackets around each input element. An input consonant such as b will have three outputs 0[b] (onset), D[b] (coda), and X[b] (unparsed). Each vowel such as a will have two outputs, N[a] (nucleus) and X[a] (unparsed), In addition, GEN ``overparses'', that is, it freely inserts empty onset O[ ], nucleus N[ ], and coda D[ ] brackets.
For the sake of concreteness, we give here an explicit definition of GEN using the notation of the Xerox regular expression calculus (Karttunen et al ). We define GEN as the composition of four simple components, Input, Parse, OverParse, and SyllableStructure. The definitions of the first three components are shown in Figure 4.
|define Input||[C | V]* ;|
|define Parse||C -> ["O[" | "D[" | "X["] ... "]" .o. V -> ["N[" | "X["] ... "]" ;|
|define OverParse||[. .] (->) ["O[" | "N[" | "D["] "]" ;|
A replace expression of the type A -> B ... C in the Xerox calculus denotes a relation that wraps the prefix strings in B and the suffix strings in C around every string in A. Thus Parse is a transducer that inserts appropriate bracket pairs around input segments. Consonants can be onsets, codas, or be ignored. Vowels can be nuclei or be ignored. OverParse inserts optionally unfilled onsets, codas, and nuclei. The dotted brackets [. .] specify that only a single instance of a given bracket pair is inserted at any position.
The role of the third GEN component, SyllableStructure, is to constrain the output of Parse and OverParse. A syllable needs a nucleus, onsets and codas are optional; they must be in the right order; unparsed elements may occur freely. For the sake of clarity, we define SyllableStructure with the help of four auxiliary terms (Figure 5).
|define Onset||"O[" (C) "]" ;|
|define Nucleus||"N[" (V) "]" ;|
|define Coda||"D[" (C) "]" ;|
|define Unparsed||"X[" [C|V] "]" ;|
|define SyllableStructure||[[(Onset) Nucleus (Coda)]/Unparsed]* ;|
Round parentheses in the Xerox regular expression notation indicate optionality. Thus (C) in the definition of Onset indicates that onsets may be empty or filled with a consonant. Similarly, (Onset) in the definition of SyllableStructure means that a syllable may have or not have an onset. The effect of the / operator is to allow unparsed consonants and vowels to occur freely within a syllable. The disjunction [C|V] in the definition of Unparsed allows consonants and vowels to remain unparsed.
With these preliminaries we can now define GEN as a simple composition of the four components (Figure 6).
With the appropriate definitions for C (consonants) and V (vowels), the expression in Figure 6 yields a transducer with 22 states and 229 arcs.
It is not necessary to include Input in the definition of GEN but it has technically a beneficial effect. The constraints have less work to do when it is made explicit that the auxiliary bracket alphabet is not included in the input.
Because GEN over- and underparses with wild abandon, it produces a large number of output candidates even for very short inputs. For example, applying GEN to the string a yields a relation with 14 strings on the output side (Figure 7).
|N[a] N[a]N N[a]D NN[a] NN[a]N NN[a]D NX[a] NX[a]N NX[a]D ON[a] ON[a]N ON[a]D OX[a]N X[a]N|
The number of output candidates for abracadabra is nearly 1.7 million, although the network representing the mapping has only 193 states. It is evident that working with finite-state tools has a significant advantage over manual tableau methods.