### Finite-State Networks

We use the term finite-state network to cover both simple automata and transducers. A simple automaton encodes a regular language, a transducer represents a regular relation. If you are not familiar with these terms, start with Regular Expressions.

A finite-state network consists of states and arcs. Small networks are most easily viewed in graphical form as transition diagrams. For the sake of clarity, we use the graphical representation to explain the structure of networks. However, our regular expression compiler can display networks only as transition tables. The correspondence between the two formats is explained below.

#### Introduction

A network has a single start state and any number of final states. In transition diagrams, we represent states by numbered circles. By convention, state 0 is the start state of the network. Final states are distinguished by a double circle.

The simplest kind of network consists of just one state, the initial state, without any outgoing transitions. If that state is final, the network represents the empty-string language; if it is non-final, the network represents the null language. The empty-string language (see Regular Expressions), denoted by the expression [], consists of the empty string; the null language, denoted by [\?], contains no strings at all.

Figure 1.

For reasons that will be explained shortly, the specification of a network always includes a sigma, the symbol alphabet of the network. Because the networks in Figure 1 have no arcs, their symbol alphabets are empty.

A state is associated with zero or more arcs that represent transitions from that state to other states in the network. Each arc has a label and a destination state. In a network that encodes a language (or the identity relation on that language), each label is a single symbol whose print-name may consist of one or more characters, such as a or +Noun. Figure 2 presents a simple network for the language a b* c. It encodes an infinite set of strings: "ac", "abc", "abbc", "abbbc", etc.

Figure 2: The language [a b* c]

For every string in the language of the network, there is a path, a sequence of zero or more arcs, from the initial state to a final state such that the arc labels along the path constitute a match for the string. For example, the string "abbc" is encoded in this network by the path 0-a-1-b-1-b-1-c-2, where the state-symbol-state sequence identifies the arcs that make up the "abbc" path. In this example, the letters in "abbc" and the symbols along the path match perfectly. In general the correspondence need not be so exact because the path may also contain epsilon symbols that represent the empty string. Consequently, the number of arcs in a path may be larger than the length of the string it matches. Another special network symbol, the unknown symbol, is explained below.

#### Tabular representation

Figure 3 shows a nongraphical representation for the network in Figure 2.

 s0: a -> s1.  s1: b -> s1, c -> fs2. fs2: (no arcs) Sigma: a, b, c
Figure 3: The language [a b* c]

The tabular representation lists the states, s0, s1, and fs2, and the arcs that originate at each state. Nonfinal states have an "s" prefixed to the state number, final states are distinguished by the "fs" prefix. The arcs, e.g. a -> s1, show the arc label and the destination state. (Click here for more discussion and another example of a network in tabular form.)

#### Unknown symbol and the Sigma alphabet

In network diagrams, the symbol ? stands for an unknown symbol, that is, any symbol that does not belong to the symbol alphabet of the network. Figure 4 shows a simple example of a network containing an arc with ? as a label. It represents the universal language because there are no other symbols in the sigma.

Figure 4: The universal language ?*

Since the start state is final and the single looping arc points back to it, the network encodes all strings of any length. For example, the string "abbc" is represented by the path 0-?-0-?-0-?-0-?-0 where ?, the unknown symbol, matches "a" or "b" as needed since neither one is included in the sigma.

In contrast, the language in Figure 5 is different although the two networks in 4 and 5 are identical as far as the states and arcs are concerned.

Figure 5: The language [\a]*

Recall that \ (see Regular Expressions) is the term complement operator. Thus [\a] denotes the set of single symbol strings other than "a". This example illustrates a slightly confusing aspect of our notation: in a regular expression, the symbol ? stands for any symbol that occurs in the regular expression as well as any unknown symbols; in network representations ? stands only for the unknown symbols.

Another example illustrating this point is shown in Figure 6.

Figure 6: The languages [a \a] and [a ?]

Note that the language [a ?] includes the string "aa" but the language [a \a] does not.

This example can explain the apparent inconsistency in the interpretation of ?. The regular expression [a ?] denotes the concatenation of two languages [a] and [?]. The latter one, [?], contains all single symbol strings, including "a". The minimal network for [?] contains a single arc labeled with the unknown symbol. In constructing the network for [a ?], that represents the concatenation of [a] with [?], we need to make explicit the fact that the language [?] also includes "a". Because we interpret the regular expression [a ?] as representing the concatenation of [a] and [?], it is evident that the symbol ? in the context of [a ?] must cover "a" as well as any unknown symbol. It is not an arbitrary convention that ? in a regular expression means any symbol.

For the sake of clarity, in the network for [a ?] in Figure 6, we show the ? and a transitions from state 1 to state 2 as two different arcs. However, in our diagrams we generally represent multiple transitions to the same state by just a single arc labeled with multiple symbols. With that convention, the network for [a ?] would look just like the diagram for [a \a] except that the arc from state 1 to state 2 would have two labels: ? and a.

#### Transducers

If some arc label in the network is a symbol pair, the network is a transducer. A transducer encodes a regular relation, a set of pairs of strings. For example, the transducer in Figure 7 encodes the relation denoted by the regular expression [a b -> x], the replacement of any instances of "ab" by "x".

Figure 7: The relation [a b -> x]

A tabular representation of the same relation is given in Figure 8.

 fs0: ? -> fs0, a -> fs1, b -> fs0, x -> fs0, a:x -> s2. fs1: ? -> fs0, a -> fs1, x -> fs0, a:x -> s2.  s2: b:0 -> fs0. Sigma: ?, a, b, x
Figure 8: The language [a b -> x]

For every pair of strings in the relation [a b -> x], such as <"abacab", "xacx">, there is a path in the transducer such that the upper and lower symbols of the labels along the path form the two strings in question. For example. the pair <"abacab", "xacx"> is encoded by the path

 0-a:x-2-b:0-0-a-1-?-0-a:x-2-b:0-0
Figure 9: The path relating "abacab" to "xacx" in Figure 7

We follow here the convention of writing a pair of identical symbols as just a single symbol, that is, a here represents the pair a:a.

Another point worth noting is that a transition over ? along the path is a transition involving some unknown symbol that is paired with itself. For example, the 1-?-0 transition in Figure 9 relates the "c" in the upper string with the "c" in the lower string. Note that "c" is not in the sigma of the network.

The relation encoded by a finite-state transducer is always a relation between two regular languages, which we call the upper and the lower languages. In the case of [a b -> x], the upper language of the relation is the universal language, ?*. That is, the transducer maps any string of any length to another string just like it except that any instances of "ab" are replaced by "x". The lower language of the relation is ~\$[a b], that is, the infinite language of strings that do not contain instances of "ab".

The relation encoded by a transducer is generally not one-to-one. In the case of [a b -> x] every string of the upper language is unambiguously mapped to a unique string of the lower language, but the same does not hold in the other direction. For example, the string "xacx" in the lower language is paired with four strings in the upper language, "abacab", "abacx", "xacab", and "xacx", because each instance of "x" in "xacx" could correspond either to "ab" or to an "x" in the upper string.

In addition to the path shown in Figure 9, there are three other paths that have "xacx" on the lower side.

 0-a:x-2-b:0-0-a-1-?-0-x-0 0-x-0-a-1-?-0-a:x-2-b:0-0 0-x-0-a-1-?-0-x-0
Figure 10: Other paths in Figure 7 with "xacx" on the lower side

This example underscores the fact that finite-state transducers, as we represent them, are bidirectional. In some cases, the mapping encoded by a transducer is useful only in one direction. A case in point is a tokenizing transducer that introduces token boundaries and eliminates unwanted whitespace. In other cases, both sides of the mapping are of equal interest. For example, a lexical transducer (Karttunen et al. 1992, Karttunen 1994) that relates inflected forms to their lemmas may be used either to analyze or to generate inflected forms. A lemma consists of the canonical citation form of the word, e.g. "leave", followed by one or more morphological tags, e.g. "+VBZ" (verb 3rd person singular), describing the inflection. An example of a lexical transducer is shown in Figure 11.

Figure 11: Lexical transducer for the verb "leave"

The transducer in Figure 11 associates "leave+VB" with "leave", "leave+VBZ" with "leaves" and "leave+VBD" with "left". For each pair of forms, there is a path that contains the inflected form on the lower side and its lemma on the upper side. This path can be located by using either the upper string or the lower string as the key. The document on Application of Finite-State Networks explains in more detail how transducers are used to map strings to strings.

#### Properties of networks

Our regular expression compiler produces networks that have the properties listed in Figure 12.
• epsilonfree
There are no arcs labeled with the epsilon symbol.
• deterministic
No state has more than one outgoing arc with the same label.
• minimal
There is no other network with exactly the same paths that has fewer states.

Figure 12: Properties of simple networks

These are important properties for networks that encode a simple regular language (i.e. not a relation) because for any regular language there is a unique network (ignoring the ordering of the arcs) that has these properties.

These properties do not have the same relevance for transducers. A transducer represents one of many possible encodings of a regular relation. Even if a transducer is minimal in the sense of Figure 12 there may be some equivalent transducer with fewer states and arcs for the same relation. A transducer that is epsilonfree in the sense of Figure 12 may have 'one-sided' epsilon arcs, that is, arcs labeled with a symbol pair that contains an epsilon as the upper or the lower symbol. One-sided epsilon arcs may be a source of nondeterminism when the transducer is applied to a string even if the network is 'deterministic' in the sense of Figure 12.

For transducers, the intuitive notion of determinism makes sense only with respect to a given direction of application. For example, the transducer in Figure 11 is 'deterministic' in the upward direction because, for all of the strings in the lower side language, "leave", "leaves", and "left",

1. there is a unique path containing that string on the relevant side in the transducer; and
2. in the process of searching for the path, there is never more than one arc to consider that matches the current input symbol.
The term sequential describes transducers that exhibit this kind of one-directional determinism. Note that the transducer in Figure 11 is not sequential in the downward direction. Condition 1. is fulfilled because each of the strings in the upper language, "leave+VB", "leave+VBZ", "leave+VBD" corresponds to a unique string in the lower language.

However, the stronger condition 2 is not met. In state 2 there are two arcs with "a" on the upper side of the label. For example, if we are looking for a path with "leave+VBD" on the upper side, the two continuations from state 2 in Figure 13 are both valid up to the very last transition.

 2- a -3- v -4- e -5-          (Fail! No match for +VBD) 2-a:0-7-v:f-8-e:0-9-+VBD:t-6  (Success!)
Figure 13: Local ambiguity

Although the relation encoded by the transducer in Figure 11 is in fact unambiguous in both directions, the two paths have both to be explored. The local ambiguity is resolved a few transitions later when it turns out that only one of them can be successfully completed.

If the relation itself is unambiguous in the relevant direction and if all the ambiguities in the transducer resolve themselves within some fixed number of steps, the transducer is called sequentiable. That is, we can construct an equivalent sequential transducer in the same direction (Roche and Schabes 1995, Mohri 1996).

In the case at hand, this involves combining the two continuations in Figure 13 into a single path that produces only empty strings as output until the point where the ambiguity is resolved at state 5. The resulting transducer in Figure 14 is sequential when applied downwards, but only in that direction because there are two arcs with "a" as the lower side label in state 5.

Figure 14: Sequential (downward) transducer for "leave"

Note that the three paths from state 5 to state 9 in Figure 14, 'discharge' the delayed output for "ave" in ways that correspond to the two branches starting at state 2 in Figure 11. Each of these three paths could be compacted into a single arc by using a multicharacter symbol on the lower side of the label: +VBZ:aves, +VB:ave, +VBD:ft.

Subsequential and p-subsequential transducers are 'nearly' sequential in that they allow alternate outputs but only in a final state at the end of an input string. (See Mohri 1996).

In the light of the discussion above, it is instructive to look at again at the transducer for the relation [a b -> x] in Figure 7. The relation is not unambiguous in the upward direction because any instance of "x" in the lower language can be mapped either to and "a" or to an "ab" in the upper language. Consequently, the transducer is not sequential or sequentiable in that direction. It is not sequential in the downward direction either because states 0 and 1 both have two arcs on with "a" on the upper side of the arc label. On the other hand, the ambiguity is very local: the a:x arcs presuppose "b" as the next symbol in the input string. Thus an equivalent sequential transducer can be constructed for the downward mapping.

Even if a transducer is unambiguous, it may well be unsequentiable if one of the languages in the relation is infinite. For example, the simple transducer for a+ b @-> x in Figure 15 cannot be sequentialized.

Figure 15: Unsequentiable transducer for [a+ b @-> x]

This transducer replaces any string of "a"s by "x" or copies it to the output unchanged depending on whether the string eventually terminates at "b". It is obviously impossible for any finite-state device to accumulate an unbounded amount of delayed output.

For more detailed information please look at our pointers to finite-state literature. We welcome your comments and suggestions.