Xerox Research Centre Europe
Copyright © 1996-97 Xerox Corporation. All rights reserved.

Examples of Networks and Regular Expressions

This page contains examples to illustrate the compilation of finite-state networks from regular expressions. You can copy and paste the formulas directly into the main window of the Finite-State Compiler page. If you can suggest interesting or amusing examples to add to this page, we would like hear from you!

  1. Lexical Transducer
  2. Elision and Contraction
  3. Noun Phrase Parser
  4. Syllabification
  5. Plus/Minus One
  6. Soft-drink Machine
  7. Better Soft-drink Machine
  8. Soundex

1. Lexical Transducer

A lexical transducer is a lemmatizer that relates inflected surface forms to the corresponding lexical forms. The lexical form consists of the canonical citation form of the word and some number of morphological tags that describe the inflection. The expression in Figure 1.1 compiles into a transducer that encodes a small subset of a lexical transducer for English. The size of the network is 16 states, 22 arcs.

[ [l e a v e %+VBZ .x. l e a v e s] |
[l e a v e %+VB .x. l e a v e] |
[l e a v e %+VBG .x. l e a v i n g] |
[l e a v e %+VBD .x. l e f t] |
[l e a v e %+NN .x. l e a v e] |
[l e a v e %+NNS .x. l e a v e s] |
[l e a f %+NNS .x. l e a v e s ] |
[l e f t %+JJ .x. l e f t] ]
Figure 1.1: A lexical transducer

Lexical transducers are typically derived by composing a network of lexical forms with rule transducers that encode the regular morphological alternations of the language (Karttunen 1994). For the sake of illustrating the concept, the expression in Figure 1.1 simply maps each lexical form to the corresponding surface form with the help of the crossproduct (.x.) operator and constructs the union of the pairs.

The lexical forms constitute the upper language of the relation, the surface forms are on the lower side. Consequently, applying the transducer 'downwards' to a lexical form generates the corresponding surface form; 'upwards' application to a surface form produces an analysis. Note that some surface forms correspond to more than one lexical form. Figure 1.2 illustrates the application of the transducer in both directions.

APPLY DOWN> leave+VBD
left
APPLY UP> leaves
leave+NNS
leave+VBZ
leaf+NNS
Figure 1.2: Generation and Analysis

The tags used in this example come from the well-known Brown corpus: +VBD means the past tense of a verb, +JJ is adjective, etc. Note that +VBD and +JJ are single symbols.

2. Elision and Contraction in French

A tourist arriving in Paris is often bewildered by the apparent inconsistency in the naming of the numerous local railroad stations. Why is it "Gare du Nord" but "Gare de Lyon"? Why "Gare de l'Est" but "Gare d'Austerlitz"? ("Gare" means 'station'.)

The Parisians explain that this variation reflects two productive morphological alternations in French:

  1. Elision of the word-final vowel in articles ("le", "la"), prepositions ("de") and pronouns ("ce", "se") in front of a word beginning with a vowel or "h" (sometimes).
  2. Contraction of "de le" to "du".

Knowing this does not help unless you also know that 1 takes precedence over 2 in cases where both are applicable. That is the reason why there is no "du" in "Gare de l'Est". One also must know that there is no "le" in "Gare de Lyon" and "Gare d'Austerlitz" since "Lyon" and "Austerlitz" are proper names.

Both types of alternation can be modeled easily as conditional replacements (Karttunen 1995). We need to formulate the replacements in such a way that they describe the relation correctly both in isolation ("de le") and in the context of a longer string "Gare de le Est". Consequently, we must be careful about the beginning and end of strings in stating the conditions.

With suitable definitions for 'Vowel' and 'H', the elision rule could be formalized simply as in Figure 2.1:

[[ e % ] -> ' || [% |.#.] [c | d | l | s] _ [Vowel|H]]
Figure 2.1: Elision of "le " to "l'", "de " to "d'", etc. (rule 1)

Here the percent sign causes the following space to be to interpreted as a symbol rather than as whitespace. This expression relates "e " to "'" (a string consisting of a single quote) when it is preceded by a word-initial "c", "d", "l", or "s" and followed by a vowel or "h". The expression [%  |.#.] denotes the beginning of a word; that is, a preceding blank space or a string boundary. Consequently, "de ", "le ", etc. do not alternate when they are part of some longer word. The actual statement in Figure 2.3 is a little more complicated because the vowels are spelled out explicitly as small and capital letters.

The contraction rule can be expressed as the replacement in Figure 2.2.

[[ d e %   l e] -> d u || [%  |.#.] _ [%  |.#.]]
Figure 2.2: Contraction of "de le" to "du" (rule 2)

As before, the context [% |.#.] indicates that "de" and "le" must be whole words, preceded and followed by a space or a string boundary.

Both rule 1 and rule 2 are applicable in cases such as "Gare de le Est". Because the correct result is "Gare de l'Est" and not "Gare du Est", rule 1 has precedence. In other words, the transducer for rule 1 is applied first and the transducer for rule 2 is applied to the output from rule 1.

 Input: Gare de le Est et Gare de le Nord
Rule 1: Gare de l'Est et Gare de le Nord 
Rule 2: Gare de l'Est et Gare du Nord    
Figure 2.3: Sequential rule application

Another way to get the right effect is to derive a single transducer by composition, as shown in Figure 2.4. By placing rule 1 above rule 2 in the composition, we ensure that the application of the resulting single transducer has exactly the same effect as the sequential application of the original transducers in Figure 2.3.

[[ e % ] -> ' || [% | .#.] [c|d|l|s] _ [A|E|I|O|U|H|a|e|i|o|u|h]]
.o.
[ [d e %   l e] -> d u || [%  |.#.] _ [%  |.#.]]
Figure 2.4: Composition of Elision and Contraction
(order matters)

The size of the network is 19 states, 192 arcs. Adding the missing accented vowels would increase the number arcs but not the number of states.

The transducer maps the lexicalized forms "Gare de le Est" and "Gare de le Nord" unambiguously to "Gare de l'Est" and "Gare du Nord", But this does not completely solve the problem for the foreign tourist. See Figure 2.5.

APPLY DOWN> Gare de le Est et Gare de le Nord
Gare de l'Est et Gare du Nord
APPLY UP> Gare de l'Est et Gare du Nord
Gare de l'Est et Gare du Nord
Gare de l'Est et Gare de le Nord
Gare de le Est et Gare du Nord
Gare de le Est et Gare de le Nord
Figure 2.5: Unambiguous generation vs. ambiguous analysis

The relation is not unambiguous in the other direction. The upward application of the transducer in Figure 2.5 to "Gare de l'Est" gives two results: "Gare de le Est" and "Gare de l'Est". That is, rule 1 in itself does not guarantee that "l'Est" can only arise in French as the result of elision. Similarly, "Gare du Nord" yields both "Gare de le Nord" and "Gare du Nord" because rule 2 does not say that every instance of "du" is a contraction of "de le". In fact there is another "du" in French. And yet another puzzle for the foreigner: why the "de" in "Gare d'Austerlitz" (cf. "Gare de Lyon" vs. "Gare Montparnasse") when there no trains from Gare d'Austerlitz to Austerlitz?

The problem of giving Elision priority over Contraction could also be solved in another way. With a slight reformulation of the contraction rule into Contraction2, we can get the correct result regardless of the order of application. We just need to refine the context of the "de le" to "du" rule in such a way that it has no effect in the environment where the "le " elides to "l'". That is, we restrict Contraction2 so that "de le" does not contract to "du" in front of a word that begins with a vowel or h.

Figure 2.6 shows the alternative solution to the problem. Just to demonstrate that the reformulation of the contraction rule makes the ordering irrelevant, we now compose the transducers in the opposite order.

[d e %  l e] -> d u || [%  |.#.] _ [%   ~[[A|E|I|O|U|H|a|e|i|o|u|h] ?*] [%  |.#.]]
.o.
[ e % ] -> ' || [% | .#.] [c|d|l|s] _ [A|E|I|O|U|H|a|e|i|o|u|h]
Figure 2.6: Composition of Contraction2 and Elision
(order is irrelevant)

The difference between two formulations of the contraction rule is that in Figure 2.6, the right context, by virtue of the expression [%  ~[[A|E|I|O|U|H|a|e|i|o|u|h] ?*] .#.], excludes any sequence where the space after "le" is followed by a string beginning with a vowel or "h". The reader may be interested in verifying that the expressions in Figure 2.4 and 2.6 compile into exactly the same 19 state, 192 arc transducer.

Because the order of application of Elision and Contraction2 makes no difference, the two transductions could also be done parallel. To try that alternative, replace the composition operator .o. in Figure 2.6 by a double comma. That is, the parallel replacement should have the form A -> B || X _ Y ,, C -> D || Z _ W .

This example illustrates that fact that the interaction of two phonological or orthographical alternations can in general be resolved with or without explicit ordering, although one or the other of the alternative solutions may seem more natural or simpler in a particular case. As phonologists know, much ink has been spilled over this issue. Karttunen 1991 discusses similar problems from the perspective of two-level rules (Koskenniemi 1983).

3. Noun Phrase Parser

Finite-state networks have been used since the late 1950s for lightweight syntactic analysis. To illustrate the basic idea, we represent text as strings of part-of-speech tags, using "d" for determiner, "a" for adjective, "n" for noun, and "v" verb, etc. In other words, in this example the regular expression symbols represent whole words rather than single letters in a text.

Let us assume that a noun phrase consists of an optional determiner, any number of adjectives, and one or more nouns. This simple definition can be expressed as a regular expression: [(d) a* n+]. This expression denotes an infinite set of strings, such as "n" (cats), "aan" (discriminating aristocratic cats), "nn" (cat food), "dn" (many cats), "dann" (that expensive cat food) etc.

A simple noun phrase parser can be thought of as a transducer that inserts markers, say, a pair of braces { }, around noun phrases in a text. The task is not as trivial as it seems at first glance. Consider the expression

[(d) a* n+ -> %{ ... %}]
Figure 3.1: A simple NP parser

The resulting 4-state transducer, shown in Figure 3.2 inserts curly brackets around any expression matching the [(d) a* n+] pattern. Note that ? here matches symbols, such as "v", that are not in the alphabet (sigma) of the network.

Figure 3.2: [(d) a* n+ -> %{ ... %}]

Applied to the input "danvn" (many small cats like milk) this transducer yields three alternative bracketings:

APPLY DOWN> danvn
da{n}v{n}
d{an}v{n}
{dan}v{n}
Figure 3.3: Three analyses of "danvn"

The reason for the ambiguity is that the transducer may loop in the start state over the initial "d" and "da" strings before inserting the first left bracket. This correct, of course, because "n", "an", and "dan" are all instances of the [(d) a* n+] pattern.

For certain applications it may be desirable to produce a unique parse, marking the maximal expansion of each NP: "{dan}v{n}". Using the left-to-right, longest-match replace operator @-> instead of the simple replace operator -> in Figure 3.1 yields the desired result.

[(d) a* n+ @-> %{ ... %}]
Figure 3.4: Left-to-right, longest-match NP parser

The corresponding network consists of 8 states and 28 arcs. The diagram is shown in Figure 3.5.

Figure 3.5: [(d) a* n+ @-> %{ ... %}]

As illustrated in Figure 3.6, the string "danvn" is now unambiguously transduced to "{dan}v{n}", ignoring the embedded smaller noun phrases. The corresponding path in the transducer is <0 0:{ 7 d 3 a 3 n 4 0:} 5 ? 0 0:{ 7 n 4 0:} 5>.

APPLY DOWN> danvn
{dan}v{n}
Figure 3.6: Left-to-right, longest-match markup

With longer input strings, the difference between the two parsers becomes even more striking. Whereas the left-to-right, longest match parser in Figure 3.5 maps the input "dannvaan" unambiguously to "{dann}v{aan}, the simple NP parser in Figure 3.2 produces 18 alternative analyses for the same input.

The brackets inserted by the noun phrase parse can in turn be used to define larger syntactic units, such as prepositional phrases and verb phrases. A composition of two or more such transducers creates a transducer that produces nested bracketings.

4. Syllabification

A syllable is a unit of speech that consists of a vocalic center, called nucleus, surrounded by a consonantal onset and a consonantal coda, one or both of which may be empty. In languages such as Finnish, where the orthography is closely related to the pronounciation, the general structure of a syllable in written texts can be described by the regular expression [C* V+ C*]. In other words, the onset and the coda consist of zero or more consonants, C, the nucleus of one or more vowels, V.

We ignore here the specific constraints on the three components. For instance, if the onset or the coda consist of several consonants, the sonority level tends to diminish in proportion to the distance from the nucleus: "tri" and "irt" are possible syllables but "rti" and "itr" are less acceptable. The nucleus may consist of a single vowel, "a", a double long vowel, "aa", or a diphthong, such as "ai". On the other hand, in quotations of spoken Finnish, one may see words that do not conform to these principles. For instance, the emphatic lengthening of a syllable may be represented by a burst of letters: "jooo" ('yeees'), "perrrkele" ('shiiit'). The general template [C* V+ C*] allows for such exceptions.

In the following, we will use a single hyphen to mark the syllable boundaries. Without exception, a single consonant between two vowels is grouped with the following vowel as the onset: V - C V. Any additional intervocalic consonants belong to the coda of the preceding syllable: V C C - C V, as in "myrs-ky" ('storm'), "kilt-ti" ('nice'). Word-internal syllables of native Finnish words have a single initial consonant; the first syllable of the word may begin with a cluster of consonants. For example, the word "strukturalismi" syllabifies into "struk-tu-ra-lis-mi".

A simple syllable parser based on these principles can be derived with the expression in Figure 4.1.

[C* V+ C*] @-> ... "-" || _ [C V]
Figure 4.1: Schematic syllable parser

This formula denotes a relation that inserts a hyphen in front of a consonant vowel sequence after a maximal expansion of the [C* V+ C*] pattern. This expression contains the same left-to-right, longest-match operator as the simple NP parser in the preceding example but here we need an additional constraint on what must follow the inserted marker. The corresponding transducer is shown in Figure 4.2.

Figure 4.2: [[C* V+ C*] @-> ... "-" || _ [C V]]

Figure 4.3 illustrates the application of the parser on a skeletal CV structure.

APPLY DOWN> VCVCCCVCV
V-CVCC-CV-CV
Figure 4.3: Application of the schematic syllable parser.

To use the syllable parser on real text, we of course need to replace C by the set of consonant symbols and V by vowels. The most convenient way to accomplish that is to use a substitution expression, that is, a regular expression of the form `[[RE], SYMBOL, LIST-OF-SYMBOLS]. This denotes the language or a relation obtained by replacing the every instance of SYMBOL in RE by the union of the symbols on the list. Figure 4.4 shows the expression from which the final transducer can be compiled.

`[[`[[[C* V+ C*] @-> ... "-" || _ C V],
V, a e i o u y ä ö]],
C, b c d f g h j k l m n p q r s t v w x z]

Figure 4.4: Syllable parser

The transducer derived from the expression in 4.4 has the same five states as the network in Figure 4.3 but many more arcs. Figure 4.5 illustrates the application of the transducer to a test word.

APPLY DOWN> strukturalismi
struk-tu-ra-lis-mi
Figure 4.5: Application of the syllable parser

The hyphenation of words in Finnish texts generally follows the natural syllabification pattern. However, hyphenation rules also take into account other information, in particular, the compound boundaries. For example, in words like "autonajaja" ('car driver'), the first hyphenation point is at the compound boundary, "auton-ajaja", regardless of the actual pronounciation. In many cases correct hyphenation is difficult to achieve automatically because of ambiguities such as "auto-nosto" ('car lift') vs. "auton-osto" ('car purchase'), which cannot be resolved without real understanding of the text.

5. Plus/Minus One

In this example we construct a simple transducer that performs addition and subtraction by 1 on binary numbers. A downward application of the transducer corresponds to subtraction, upward application is addition. For example, "1100" (twelve) should get mapped to "1101" (thirteen) upwards, to "1011" (eleven) downwards. As the first step towards constructing a description of the relation, let us consider the regular expression in Figure 5.1. It denotes a parallel replacement of 1 by 0 and 0 by 1 when the string to the right is empty or consists of nothing but zeros up to the end, that is, to the end of the string, .#., or to something that is neither zero nor one, [\[%0|1]].

%0 -> 1, 1 -> %0 || _ %0* [\[%0|1] | .#.]
Figure 5.1: Binary adder/subtracter for 1 (first try).

This is almost correct. For example, it maps "1100" to "1011" downwards and to "1101" upwards. However, the transducer does not map correctly at the left edge of the number where a new 1 must sometimes be added and a leading 0 can be deleted. It does not include pairs like "10" and "1", "100" and "11", etc., as it should.

The problem can be corrected by composing the expression in Figure 5.1 with two auxiliary relations. To make the expressions more perspicuous, let us use END to abbreviate the complex expression [\[%0|1] | .#.]. The first relation, [1 <- [. .] || END _ [%0+ END]], adds an initial 1 to a string of one or more zeros in the upwards direction. (Using [. .] instead of [ ] to represent the empty string has the effect here that only a single 1 is inserted.) The second auxiliary relation, [%0 -> [] || END _ [1+ END]], eliminates the leading zero in strings such as "01" in the downwards direction. The two new relations are composed around the original expression. The result is shown in Figure 5.2 in complete detail.

1 <- [. .] || [\[%0|1]|.#.] _ [%0+ [\[%0|1]|.#.]]
.o.
%0 -> 1, 1 -> %0 || _ [ %0* [\[%0|1]|.#.]]
.o.
%0 -> [] || [\[%0|1]|.#.] _ [1+ [\[%0|1]|.#.]]
Figure 5.2: Binary adder/subtracter for 1 (final).

Although the expression looks complicated, the size of the network is only 5 states, 12 arcs, Figure 5.3. To clearly distinguish between the digit zero, and the special use of zero as the epsilon symbol, we exceptionally represent the single 1-to-epsilon pair in this diagram as 1:. All instances of 0 in Figure 5.3 represent zero as a digit.

Figure 5.3: Plus/minus 1 transducer.

The reader is invited to find a simpler regular expression that compiles into the same transducer. (It is harder than it looks.) As Figure 5.4 demonstrates, the transducer applies correctly to strings that contain binary numbers in the middle of other text. The input equation asserts that 12-7 equals 8-3.

APPLY DOWN> (1100 - 111) = (1000 - 11) in binary
(1011 - 110) = (111 - 10) in binary
APPLY UP> (1100 - 111) = (1000 - 11) in binary
(1101 - 1000) = (1001 - 100) in binary
Figure 5.4: Addition and subtraction on binary numbers

An interesting feature of this transducer is that we can derive from it an adder/subtracter for any natural number by composition. The composition of the transducer in Figure 5.3 with itself yields a transducer that adds and subtracts 2. The composition of the plus/minus 2 transducer with itself adds and subtracts 4, and so on.

Although the transducer in Figure 5.3 is unambiguous in either direction, it is not efficient on large numbers. For example, consider subtraction from "10000000001" (1025 in decimal). The initial "1" on the left must be realized in one of three ways depending on what follows on the right: as "0", as the empty string, or as "1". The first alternative, mapping to "0", is correct only if the initial "1" is also the last digit of the number. This is not the case here. The remaining choice depends on whether there is another "1" somewhere further to the right. Both alternatives have to be explored. In the case at hand, there is another "1" at the very end, so the first "1" stays, yielding "10000000000". The alternative path that eliminates the initial "1" and maps all the immediately following "0"s to "1"s dies out at the last moment.

Since the ambiguity in general cannot be resolved within a fixed amount of lookahead, the transducer in Figure 5.3 is yet another example (see Finite-State Networks) of a transducer that is unambiguous in a given direction, but neither sequential or sequentiable.

Because the mapping on the left depends on what follows on the right, and not vice versa, it would be more efficient to proceed backwards, starting from the right end of the string. With the help of the reverse operator, .r, we can easily turn the transducer in Figure 5.3 around for right-to-left application. The regular expression for the reverse transducer is of the form [--- the expression in figure 5.2 ---].r . Figure 5.5 shows the resulting right-to-left binary adder/subtracter. (To clearly distinguish between zero and epsilon, we again represent the single 1-to-epsilon label as 1:. All "0"s in Figure 5.5 are digits.)

Figure 5.5: Right-to-left plus/minus 1 transducer.

The right-to-left version of the transducer is nearly sequential in both directions as it has only one local ambiguity in state 1. We can now run our original example in Figure 5.4 backwards:

APPLY DOWN> yranib ni (11 - 0001) = (111 - 0011)
yranib ni (01 - 111) = (011 - 1101)
APPLY UP> yranib ni (11 - 0001) = (111 - 0011)
yranib ni (001 - 1001) = (0001 - 1011)
Figure 5.6: Right-to-left addition and subtraction

6. Soft-drink Machine

Let us imagine a machine that dispenses soft drinks for the price of 65 cents per can in US currency. The machine accepts nickels, dimes, and quarters. If you put in the right amount of money in any combination of coins, a can of soft drink drops into a bin; otherwise nothing happens.

We can think of the sequences of coins that add up to 65 cents as a language consisting of the letters "n" (nickel, worth 5 cents), "d" (dime, worth 10 cents) and "q" (quarter, worth 25 cents). For example, "qqdn", "ddddddn", and "nnnnnnnnnnnnn" are in this language because the total value of the string is exactly 65 cents. Any string whose value is less or more than 65 cents is excluded, for example, "qdq" (too little) and "qqq" (too much).

The task in this example is to construct a regular expression that compiles into a finite-state automaton that implements the behavior of the soft drink machine. In order to avoid unnecessary technical complexities, let us simulate the dropping of the can by just an appropriate sound, say "PLONK". Thus we can think of the soft drink machine as a transducer that converts an appropriate sequence of coins into a PLONK. (Unlike the vending machines we are familiar with, our transducer could also be applied in the opposite direction as a buying machine, converting PLONKs to money.)

We can solve this problem in a few simple steps. The first task is to associate the coin symbols, n, d, and q, with appropriate values. We do this by constructing for each coin a transducer that maps it to a string of "c"s (cents) whose length represents the value. The simplest way to do this is to use the construction A^n, which denotes the concatenation of A with itself n times. For example, c^5 denotes the language consisting of the string "ccccc". Thus the expression [n .x. c^5] expresses the fact that a nickel is worth 5 cents. The regular expression in Figure 6.1 denotes a mapping from all possible sequences of the three symbols to the corresponding value.

[[n .x. c^5] | [d .x. c^10] | [q .x. c^25]]*
Figure 6.1: A relation between coin sequences and their values

This relation contains infinitely many pairs. For example, the pair consisting of "nd" and "ccccccccccccccc" expresses the fact that a nickle and a dime are worth 15 cents.

The next task is to relate the dropping of the soft-drink can to the appropriate amount of money. The expression [c^65 .x. PLONK] does just that. What remains is the problem of allowing the price to be paid by any sequence of coins that has the appropriate value. The solution is given in Figure 6.2.

[[n .x. c^5] | [d .x. c^10] | [q .x. c^25]]*
.o.
[c^65 .x. PLONK]
Figure 6.2: The 65 cent soft drink machine

The result of the composition in Figure 6.2 is a transducer whose upper language consists of all coin sequences whose value is exactly 65 cents. This effect comes about because the upper language of the relation denoted by [c^65 .x. PLONK], acts like a filter in the composition. It consists of a single string of 65 "c"s. Consequently, all coin sequences that map into some other value are eliminated. The lower language of the relation denoted by the expression in Figure 6.2 consists of "PLONK". (Note that PLONK is a single symbol.) Because the composition yields a relation between the outermost languages, the result does not contain any "c"s.

The size of the network compiled from the regular expression in Figure 6.2 is 14 states, 34 arcs. The upper language consists of 634 strings. The number is larger than one might think because it includes, for each combination of coins, all the possible permutations. That is, "qqdn", "qqnd", "dqqn", etc. are counted separately.

If we are interested in knowing just the number of combinations of nickels, dimes, and quarters that add up to 65 cents, ignoring the order of insertion, we can compute that easily by forcing the coins into a canonical order. The solution is shown in Figure 6.3. The effect of the topmost expression [q* d* n*] is to force the coins to be inserted in a strict order: first the quarters, if any; then the dimes, if any; and only then the nickels.

[q* d* n*]
.o.
[[n .x. c^5] | [d .x. c^10] | [q .x. c^25]]*
.o.
[c^65 .x. PLONK]
Figure 6.3: A strict soft drink machine

We can enumerate the allowed ways of paying by doing a lookup on PLONK in the resulting transducer. See Figure 6.4.

APPLY UP> PLONK
ddddddn
dddddnnn
ddddnnnnn
dddnnnnnnn
ddnnnnnnnnn
dnnnnnnnnnnn
nnnnnnnnnnnnn
qdddd
qdddnn
qddnnnn
qdnnnnnn
qnnnnnnnn
qqdn
qqnnn
Figure 6.4: Ordered coin combinations

This strict machine allows only fourteen sequences of coins.

7. Better Soft-drink Machine

The clients of the soft drink machine just described cannot be very satisfied. The machine behaves in an outrageous manner. If the client fails to put in exactly the right combination of coins for a soft drink, the machine just keeps the money without delivering anything. Good business practice dictates that the machine should reimburse the client if the sum is too small and return the extra change along with the soft drink if the client has overpaid. We will show how these shortcomings can be corrected. To make things simpler, we will also fix a less important defect. Instead of selling just a single soft drink can, the improved transducer will convert any amount money into the appropriate number of PLONKs and return the extra change if any. In the case where the inserted money is less than the price of one can, the entire sum will be returned.

In other words, we wish to modify the lower language of Figure 6.2 to make it a subset of [PLONK* q* d* n*], that is, any number of PLONKs followed by the extra change. To ensure that the extra change is paid out only once, we need to make sure that quarters get paid before dimes and dimes before nickels. Because all of the components need to be associated with the appropriate value in cents, the last line of Figure 6.2 is replaced by [[c^65 .x. PLONK]* [c^25 .x. q]* [c^10 .x. d]* [c^5 .x. n]*].

The last refinement is to ensure that as much money as possible is converted into soft drinks and to remove any ambiguity in how the extra change is to be reimbursed. For example, if the client is entitled to get back 60 cents, we give him two quarters and a dime instead of six dimes or some other combination of coins that could be paid. If there is any ambiguity about how reimbursements are made, the transducer would produce multiple outputs, with the danger that the client would get paid multiple times the same amount in different combinations of coins.

To achieve these goals, we add yet another layer of constraints on the bottom of the composition. The expression [~$[q q q | [q q | d] d [d | n] | n n]] has the effect of disallowing the reimbursement of any sum that could be converted to a PLONK, such as "qqq" and "qqdn". It also prohibits sequences of dimes and nickels that add up to a quarter or more, such as "ddd" and "ddn". Finally, it prohibits a sequence of two nickels thus ensuring that ten cents is returned only as a dime.

The new solution to the soft drink machine problem is given in Figure 7.1.

[[n .x. c^5] | [d .x. c^10] | [q .x. c^25]]*
.o.
[[c^65.x.PLONK]* [c^25.x.q]* [c^10.x.d]* [c^5.x.n]*]
.o.
[~$[q q q | [q q | d] d [d | n] | n n]]
Figure 7.1: A better 65 cent soft drink machine

The transducer compiled from the expression in Figure 7.1 contains 35 states and 102 arcs. Because there is no limit on how many soft drinks can be bought at the same time, the number of pairs of strings in the relation encoded by the transducer is infinite. For example, it includes the pair consisting of "nnnnnqqqqddddnnn" and "PLONKPLONKqq". Here the sum of the coins is 180 cents, thus the client gets two soft drinks, worth $1.30, and 50 cents back paid as two quarters.

Note that the improved soft drink machine can also be used to reduce the amount of small change. If a sequence of coins is not enough to buy a drink, the entire amount is returned using as few coins as possible. For example, in exchange for 60 cents in whatever combination, the client will get back two quarters and a dime.

To make the machine completely foolproof, we need one final improvement. Some clients may insert unwanted items into the machine (subway tokens, foreign coins, etc.). These objects should not be accepted; they should passed right back to the client. This goal can be achieved easily by wrapping the entire expression in Figure 7.1 inside an ignore operator, /. To highlight the addition, we omit the common part: [ [ --- the expression in Figure 7.1 --- ]/[\[q | d | n]]]. Here the second operand, [\[q | d | n]], denotes the set that contains every single-symbol string other that "q", "d", or "n".

The effect of the ignore operation is to create a transducer that is in other respects equivalent to the original one except that it allows the other symbols to occur freely, without affecting the relation specified in Figure 7.1. Any unknown symbol is simply mapped to itself, as shown in Figure 7.2, and ignored.

APPLY DOWN> quadriphonique
PLONKuariphoiue
Figure 7.2: Ignoring foreign objects

Here "quadriphonique" is transduced to "PLONKuariphoiue". The two quarters, the dime and the nickel in the input string are transduced to a PLONK; everything else is passed through unchanged.

The size of the final transducer is 35 states, 172 arcs. The number of states is the same as in the network compiled from Figure 7.1. The increase in the number of arcs comes from having two new looping arcs at each state. One is for the unknown symbol; the other is for PLONK, that is, a can of something dropped into the machine by the client himself.

8. Soundex

Related names such as "Johnson", "Johnsen", "Johansson", "Johanson", "Johansen", "Jensen", etc. are often confused with one another because they are close in spelling and pronounciation. In 1918 Margaret K. Odell and Robert C. Russel developed a system of mapping such sets of similar surnames into a unique common representation. They called it the "Soundex" method. Odell and Russel obtained two US patents (long since expired) for the idea.

Donald Knuth describes the Soundex method in his book on the Art of Computer Programming (Vol 3. pp. 391-392) as follows:

  1. Retain the first letter of the name, and drop all occurrences of a, e, h, i, o, u, w, y in other positions.
  2. Assign the following numbers to the remaining letters after the first:
  3. If two or more letters with the same code were adjacent in the original name (before step 1), omit all but the first.
  4. Convert to the form "letter, digit, digit, digit" by adding trailing zeros (if there are less than three digits), or by dropping rightmost digits (if there are more than three).

Knuth's description of the Soundex mapping could be translated almost literally into a cascade of replacement expressions. As Figure 8.1 shows, we can also achieve the same result in a somewhat more compact way.

[ [ [b | f | p | v]+ @-> 1, [c | g | j | k | q | s | x | z]+ @-> 2,
[d | t]+ @-> 3, l+ @-> 4, [m | n]+ @-> 5, r+ @-> 6,
[a | e | h | i | o | u | w | y]+ @-> [] ]
.o.
[?^4 [?:0]* | ?^3 0:%0 | ?^2 [0:%0]^2 | ? [0:%0]^3] ]
Figure 8.1: Soundex mapping

The expression in Figure 8.1 denotes the composition of two relations. The first one, a set of parallel longest-match replacements, maps sequences of adjacent consonants into the corresponding code and deletes the vowels. Thus it does the work of steps 1, 2 and 3 in the original Soundex algorithm. Because the first letter of the name is capitalized, it remains unaffected by these replacements. The second relation in Figure 8.1 corresponds to the step 4 in the original method. It consists of a union of four relations. The first one preserves the first four symbols of a string and eliminates the rest. The remaining three add zeros to the end of short strings to increase the length to four.

The Soundex transducer compiled from the expression in Figure 8.1 consists of 26 states and 805 arcs. Figure 8.2 illustrates its effect.

APPLY DOWN> Johansson
J525
APPLY DOWN> Johansen
J525
APPLY DOWN> Jensen
J525
Figure 8.2: Same code, different names

Because there is no limit on the length of a name or the number of ignored vowels and consonants, every Soundex code represents an infinite set of strings. Consequently, applying the Soundex transducer in the upward direction produces mostly gibberish. However, if we have a lexicon of names, composing the lexicon with the Soundex relation creates a restricted name-to-code transducer that yields useful results in either direction. Consider the expression in Figure 8.3.

[ [ J o h n s o n | J o h n s e n | J o h a n s s o n |
J o h a n s o n | J o h a n s e n | J e n s e n |
O d e l l | R u s s e l | K n u t h ]
.o.
------ the expression in Figure 8.1 -------- ]
Figure 8.3: Name-to-code relation

The transducer compiled from the expression in Figure 8.3 maps the strings in the name lexicon into exactly the same Soundex codes as the general Soundex transducer. It is more limited of course because the upper language of the relation is a finite list. On the other hand, for that very reason, applying it in the upward direction yields useful results, as shown in Figure 8.4.

APPLY DOWN> Johnson
J525
APPLY UP> J525
Jensen
Johansson
Johanson
Johansen
Johnsen
Johnson
Figure 8.4: From name to code, from code to name

In order to know what names "Johnson" could be confused with, one can first apply the restricted name-to-code transducer downwards to get the corresponding Soundex code J525. By applying the transducer upwards to J525 we obtain all the strings in the name lexicon that are associated with that particular code.

There is a way to get that information in just one step. By composing the relation in Figure 8.3 with its own inverse, we can create a transducer that maps every name in the lexicon directly to all the other names that have the same Soundex code. The actual Soundex codes are eliminated in the composition. The resulting transducer simply maps every name to a set of names, possibly just to itself, without any indication of what property it is that the names have in common.


For more detailed information please look at our pointers to finite-state literature. We welcome your comments and suggestions.
Copyright © The Document Company - Xerox 1997 -97. All rights reserved.
Written by Lauri Karttunen. Last modified on Jan. 12, 1998 .