| a | ||||
We need to distinguish two kinds of sets. A set of strings is called a language. The expression above, a, denotes the language consisting of just the string "a". As in set theory, a set consisting of pairs of strings is called a relation. The terms regular language and regular relation refer to sets that can be described by a regular expression.
The simplest kind of regular expression that denotes a relation is a symbol pair, such as
| a:b | ||||
In order to distinguish the two languages that are involved in a regular relation, we call the first one the upper and the second one the lower language of the relation. Correspondingly, in the pair a:b, the first symbol, a, is called the upper symbol and the second one, b, the lower symbol. The two components of a symbol pair are separated by a colon, :, without any whitespace before or after.
| a b | ||||
The union operator, |, constructs a regular language that includes all the strings of the component languages. For example,
| a | b | ||||
A regular expression can be enclosed in square brackets without changing its interpretation. For example, [a | b] and [a] | [b] describe the same language as a | b. Brackets are sometimes necessary to ensure that the expression has the intended meaning. Note that
| a b [a | b] | ||||
| a b a | b | ||||
The minus operator, -, subtracts all the strings of one language from another language. Thus
| [a | b] - b | ||||
Enclosing a regular expression in round parentheses (as opposed to square brackets) represents a union with the empty-string language. Consequently, (a | b) is the same language as [a | b | []]. Less formally, a regular expression like (b) represents an optional occurrence of "b", i.e. either "b" or the empty string.
Subtracting the empty-string language (or any other language) from itself, [[] - []], yields a language that contains no strings at all. We call this the null language. Because the null language has no strings, subtracting or adding it to another language obviously has no effect. Concatenation is different. If one of the component expressions in concatenation is the null language, so is the result. Another expression for the null language is mentioned in the discussion of the term complement operator.
The opposite of the null language is the universal language. The universal language contains all strings of any length including the empty string. To see examples of expressions that denote the universal language, see the discussion of the complement and Kleene star operators.
We also do not always carefully distinguish between a regular expression and the language it denotes. Occasionally we speak sloppily of "the language [a | b]" in the sense of "the language denoted by the regular expression [a | b]."
| 0 | epsilon. A symbol that represents the empty string "". | ||||||||
| ? | any. A symbol that represents any symbol that occurs in the same regular expression and any unknown symbol. | ||||||||
In order to understand the special interpretation of the any symbol in regular expressions, it is useful to take note of the fact that the expression [?] represents the same language as [? | a]. The reason is simple. Because ? stands for any symbol whatsoever, the language denoted by [?] contains all single symbol strings, including "a". Consequently, the union [? | a] does not result in the addition of any new strings. Similarly, [a ?], the concatenation of [a] and [?], is equivalent to [a [? | a]].
In the network corresponding to a complex regular expression, an instance of ? in the expression generally does not correspond to a single arc. The compiler interprets ? by constructing one arc labeled with ? and a duplicate arc for every other symbol in the expression. Consequently, in the resulting network the ? arcs only have the role of representing unknown symbols.
The difference between ? as a regular expression symbol (any) and ? as an arc label in a network (unknown) is an unfortunate source of confusion for novice users. See the discussion of ? as the unknown symbol in the article on finite-state networks.
In a later section we introduce one more special symbol, the boundary marker .#., to refer to the beginning or the end of a string in certain expressions.
The escape character, %, eliminates any special meaning of the following character. Thus |, 0, ?, etc. can be used as ordinary symbols by prefixing them with %. For example, %0 represents the string "0" rather than the epsilon symbol; %| is the vertical bar itself, as opposed to the the union operator |. The ordinary percent sign may be expressed as %%.
The print name of a symbol can contain more than one character. For example, %+V represents the string "+V". Beware of multicharacter symbols that coincide with a string concatenated from shorter symbols. For example, the expression [a a | aa] denotes a language that contains the string "aa" twice. One is a single symbol and the other is the concatenation of two "a"s. If such ambiguities arise in practice, our system resolves them left-to-right in favor of the longest symbol.
Because we intentionally ignore the distinction between a language and its identity relation, a pair of identical symbols may be collapsed into a single symbol. Instead of a:a we may simply write a.
| \A | term complement. The set of all single-symbol strings that are not in the language A. \A is equivalent to [? - A]. Thus \a contains "b" but not "a" or the empty string, and no concatenated strings such as "bb". Note that \? denotes the null language because ? covers all single-symbol strings. | |||||
| ~A | complement. The set of all strings that are not in the language A. For example, ~a contains "", "aa", and "aaa" but not "a". The component expression, A, cannot denote a relation. The complement of the null language, ~\?, is the universal language, and vice versa. | |||||
| A+ | iteration (Kleene plus). The language or relation A concatenated with itself any number of times, including zero times. A+ includes [A], [A A], [A A A], and so on ad infinitum. ?+ is the language of all nonempty strings. | |||||
| A* | Kleene star. The union of A+ with the empty-string language. A* is equivalent to (A+). ?* denotes the universal language. | |||||
| $A | containment. The set of strings or string pairs containing at least one instance of A as a subpart. For example, "bac" is a string in the language denoted by $a, speaking informally, "bac" is in $a. Here A can denote a relation. $A is equivalent to [?* A ?*]. $[] denotes the universal language. | |||||
A B | concatenation of the languages or relations A and B. See the discussion above. | |||||
A^n | n-ary concatenation of the language or relation A with itself. The set of strings or pairs of strings obtained by concatenating A with itself n times. For example A^3 is equivalent to [A A A]. | |||||
A | B | union of the languages or relations A and B. The set of strings or pairs of strings that are in A or in B or in both. See the discussion above. | |||||
A & B | intersection of the languages A and B. The set of strings that are both in A and in B. A and B must be simple languages, not relations. For example [a | b] & [b | c] consists of the string "b". | |||||
A - B | A minus B. The set of strings that are in the language A but not in the language B. A and B must be simple languages, not relations. [A - B] is equivalent to [A & ~B]. Note also that \a could be defined as to [? - a]. | |||||
A/B | ignore. The language or relation obtained from A by allowing strings of B to occur freely in all positions. For example, [[a b]/x] is equivalent to [x* a x* b x*]. In other words, the B strings are viewed as a kind of "noise" that is to be ignored in what is basically an A string. | |||||
A .x. B | crossproduct of the languages A and B. It denotes a regular relation that maps every string in A (upper side) to all strings in B (lower side), and vice versa. Thus [a .x. [b | c]] is equivalent to [a:b | a:c]. A and B must be simple languages, not relations. | |||||
A .o. B | composition of the relations A and B. The intersection of the lower language of A and the upper language of B works like a filter here. For example, [a:b | b:c] .o. [a:x | b:y] is equivalent to [a:y]. The b:c and a:x pairs do not contibute anything to the result because neither "c" nor "a" is in the intersection of [b | c] (the lower language of A) and [a | b] (the upper language of B). | |||||
A -> B | replacement of the language A by the language B. This denotes a relation that consists of pairs of strings that are identical except that every instance of A in the upper-side string corresponds to an instance of B in the lower-side string. For example, [a -> b] pairs "b" with "b" (no change) and "aba" with "bbb" (replacing both "a"s by "b"s). Replacement may be constrained by a specific context. For example, [A -> B || L _ R] makes the A -> B replacement only if the A string is preceded by a string from L and followed by a string from R. See Other expressions below for more discussion. | |||||
A @-> B | left-to-right, longest match replacement of the language A by the language B. Like [A -> B] except that the instances of A in the upper-side string are replaced selectively, starting from the left, choosing the longest candidate string at each point. For example, [[a | a b] @-> x] maps "ab" unambiguously to "x", whereas [[a | a b] -> x] relates it to both "xb" ("x" replacing "a") and "x" ("x" replacing "ab"). The @-> replacement may be further constrained by requiring a specific context: [A @-> B || L _ R]. | |||||
A => L _ R | restriction of the language A to the context L _ R. This denotes the language of strings in which any instance of A is preceded by an instance of L and followed by an instance of R. A, L, and R must denote simple languages, not relations. For example, [a => b _ c] contains "bac", "cbacb", etc. but not "a", "ba", or "cab". | |||||
A -> B || L _ R A -> B // L _ R |
conditional replacement of the language A by
the language B. This is like the relation [A -> B]
except that any instance of A in the upper string
corresponds to an instance of B in the lower string
only when it is preceded by an instance of L and followed
by an instance of R. Other instances of A in the
upper string remain unchanged. A, B,
L, and R must all denote simple languages,
not relations.
The slant of the double bars indicates whether the precede/follow constraints refer to the instance of A in the upper string or to its image in the lower string. In the || version, both contexts refer to the upper string. The // variant requires the left context on the lower side of the replacement, the right context on the upper side. For other versions of conditional replacement, see More about Regular Expressions. For example, [a -> b || b _ a] maps "baaaa" to "bbaaa". Here only the first "a" is replaced by "b" because it is the only one in the upper string that has "b" on the left and "a" on the right. In contrast, [a -> b // b _ a] maps all but the last "a" of "baaaa" to "b", giving "bbbba". Here the first three "a"s are all replaced because each of them is followed by an "a" in the upper string and the corresponding "b" in the lower string is preceded by "b". In other words, in the // version, the result of one replacement may serve as the left context of another replacement further to the right. | |||||
A @-> L ... R | left-to-right, longest-match markup of the language A by the prefix language L and the suffix language R. This is like the relation [A @-> B] except that any instance of A in the upper-side string corresponds to a substring that contains the original instance between a string of L and a string of R. The instances of A in upper-side strings are selected for markup under the left-to-right, longest match regimen. For example, [[a | a a] @-> %{ ... %}] relates "baaac" to "b{aa}{a}c". A, L, and R must all denote simple languages, not relations. As with other replacement relations, the markup may be further constrained by a context. For instance, [[ a | a a] @-> %{ ... %} || a _ a] maps "aaaa" into "a{aa}a" and "aaa" into "a{a}a" but the string "aa" remains unmarked because the context condition is not fulfilled. | |||||
A -> L ... R | simple markup of the language A by the prefix language L and the suffix language R. This is like [A @-> L ... R] except that all ways of factoring the input string are marked. For instance, [a | a a] -> %{ ... %} maps "aaa" into "{aa}{a}", "{a}{aa}", and "{a}{a}{a}", whereas the left-to-right, longest-match version, [a | a a] @-> %{ ... %}, gives only the output "{aa}{a}". Simple markup may be constrained by context. | |||||
A -> B, C -> D |
parallel replacement of the language A by the language B and the language C by the language D. This is like the relation [A -> B] except that there are two correspondences to consider simultaneously. For example, [a -> b, b -> a] relates "abba" to "baab". The left-to-right, longest match operator, @->, can also be used in parallel replacement expressions, but -> and @-> cannot be mixed in the same expression. | |||||
`[[A], S, L] |
substitution of symbol S in A by the symbols in list L. Here A may denote a simple language or a relation. For example, the substitution `[[k -> g || Vowel _ Vowel], Vowel, a e i o u] modifies the relation designated by the original expression so that all pairs of strings containing an instance of Vowel are replaced by pairs that instead contain one of the symbols in the list. In this case, the result is equivalent to [k -> g || [a | e | i | o | u] _ [a | e | i | o | u]]. If the list is empty, all the strings or string pairs containing S are eliminated. If L contains S, the original S strings remain and new modified strings are added for the other symbols on the list. | |||||
The boundary marker is not allowed as a constituent in a symbol pair; otherwise it can be used in a regular expression like any other symbol. For example, in the replacement expression [a -> b || [.#. | a] _ ] the union in the left context indicates that "a" is to be replaced by "b" at the beginning of a string and after another "a".
| 1. | Unary operators: ~, \, $, +, * | ||||
| 2. | Concatenation | ||||
| 3. | Binary 'Boolean' operators: |, &, - | ||||
| 4. | Other operators: .x., .o., ->, @->, => | ||||