Xerox Research Centre Europe
Copyright © 1996-1997 Xerox Corporation. All rights reserved.

Syntax and Semantics of Regular Expressions

If you already know what regular expressions are or you only want to find out some details about our notation or about certain specific operators, you should skip the introduction and go directly to the relevant section. If you are a novice reader who is not completely familiar with the concept of a regular expression, start with the introduction. This document describes a subset of the available regular expression operators. See More about Regular Expressions for a complete reference

Introduction

The language of regular expressions is a formal language, similar to formulas of Boolean logic. It has a simple syntax but the expressions can be arbitrarily complex. Like formulas of Boolean logic, regular expressions denote sets. We assume that the reader is familiar with the basic concepts of set theory (membership, union, intersection, ordered pair, etc.).
Simple expressions
The simplest kind of regular expression consists of a single symbol. For example, expression
a
denotes the set that contains the string "a" and nothing else. To distinguish between regular expressions and the sets they denote, we adopt a typographical convention: the expressions appear in fixed typewriter font; strings are enclosed in double quotes. A special kind of string, the empty string, is expressed with an empty pair of double quotes: "".

We need to distinguish two kinds of sets. A set of strings is called a language. The expression above, a, denotes the language consisting of just the string "a". As in set theory, a set consisting of pairs of strings is called a relation. The terms regular language and regular relation refer to sets that can be described by a regular expression.

The simplest kind of regular expression that denotes a relation is a symbol pair, such as
a:b
It denotes the set consisting of the ordered pair <"a", "b">. A regular relation may always be viewed as a mapping between two regular languages. In this case, the relation is simply the crossproduct of the languages denoted by the expressions a and b.

In order to distinguish the two languages that are involved in a regular relation, we call the first one the upper and the second one the lower language of the relation. Correspondingly, in the pair a:b, the first symbol, a, is called the upper symbol and the second one, b, the lower symbol. The two components of a symbol pair are separated by a colon, :, without any whitespace before or after.

Complex expressions
Complex regular expressions can be built up from simpler ones by means of regular expression operators. Whitespace is used as the concatenation operator. For example,
a b
the concatenation of a and b, denotes the language that contains just the string "ab". In general, the concatenation of two regular languages consists of strings that extend each string of the first language with all the strings of the second language.

The union operator, |, constructs a regular language that includes all the strings of the component languages. For example,
a | b
denotes the language that contains the strings "a" and "b", but not "ab".

A regular expression can be enclosed in square brackets without changing its interpretation. For example, [a | b] and [a] | [b] describe the same language as a | b. Brackets are sometimes necessary to ensure that the expression has the intended meaning. Note that
a b [a | b]
represents the language of "aba" and "abb", whereas the language described by
a b a | b
contains the strings "aba" and "b". By convention, concatenation takes precedence over union if the ambiguity is not resolved by bracketing. (More about that topic in the section on Operator precedence.)

The minus operator, -, subtracts all the strings of one language from another language. Thus
[a | b] - b
contains just the string "a". This complex expression is equivalent to the simple expression [a], that is, they denote the same language.

Empty string, Null, and Universal languages
An empty pair of brackets, [], is interpreted as denoting the empty-string language. The only string in this language is "", the empty string. Concatenating something with the empty-string language does not give rise to any new strings. Thus [a []] denotes the same language as [a]. On the other hand, the empty string makes a difference for the union operator. The language [a | []] contains two strings, "a" and "".

Enclosing a regular expression in round parentheses (as opposed to square brackets) represents a union with the empty-string language. Consequently, (a | b) is the same language as [a | b | []]. Less formally, a regular expression like (b) represents an optional occurrence of "b", i.e. either "b" or the empty string.

Subtracting the empty-string language (or any other language) from itself, [[] - []], yields a language that contains no strings at all. We call this the null language. Because the null language has no strings, subtracting or adding it to another language obviously has no effect. Concatenation is different. If one of the component expressions in concatenation is the null language, so is the result. Another expression for the null language is mentioned in the discussion of the term complement operator.

The opposite of the null language is the universal language. The universal language contains all strings of any length including the empty string. To see examples of expressions that denote the universal language, see the discussion of the complement and Kleene star operators.

Notational conventions
To make the notation less cumbersome, we systematically ignore the distinction between a language and the identity relation that maps every string of the language to itself. For example, the expression [a | b] can be used to denote either the language consisting of "a" and "b" or the relation that maps "a" to "a" and "b" to "b". For the same reason, symbol pairs with identical upper and lower symbols may be collapsed into a simple symbol. Thus we can write a:a simply as a.

We also do not always carefully distinguish between a regular expression and the language it denotes. Occasionally we speak sloppily of "the language [a | b]" in the sense of "the language denoted by the regular expression [a | b]."

Compilation of regular expressions
A regular expression can be converted into a finite-state network that encodes the language or the relation denoted by the expression. In general, each symbol or symbol pair in the expression is represented in the network by an arc with the corresponding label. For example, the regular expression [a] compiles into a network that contains a single arc labeled a. On the other hand, the relationship between complex regular expressions and the corresponding networks is typically much less transparent. For instance, there is no arc labeled with a in the network that encodes the language [\a], the union of all single symbol strings other than "a". For more information on this topic, please read the article on finite-state networks.

More about Symbols

Epsilon and Any
Two regular expression symbols have a special interpretation:
0 epsilon. A symbol that represents the empty string "".
? any. A symbol that represents any symbol that occurs in the same regular expression and any unknown symbol.
A regular expression consisting of just the epsilon symbol, 0, denotes the same empty-string language as []. The epsilon symbol is also used as the upper or the lower symbol of a symbol pair. The expression 0:a maps "" to "a", and vice versa. The inverse relation is denoted by a:0.

In order to understand the special interpretation of the any symbol in regular expressions, it is useful to take note of the fact that the expression [?] represents the same language as [? | a]. The reason is simple. Because ? stands for any symbol whatsoever, the language denoted by [?] contains all single symbol strings, including "a". Consequently, the union [? | a] does not result in the addition of any new strings. Similarly, [a ?], the concatenation of [a] and [?], is equivalent to [a [? | a]].

In the network corresponding to a complex regular expression, an instance of ? in the expression generally does not correspond to a single arc. The compiler interprets ? by constructing one arc labeled with ? and a duplicate arc for every other symbol in the expression. Consequently, in the resulting network the ? arcs only have the role of representing unknown symbols.

The difference between ? as a regular expression symbol (any) and ? as an arc label in a network (unknown) is an unfortunate source of confusion for novice users. See the discussion of ? as the unknown symbol in the article on finite-state networks.

In a later section we introduce one more special symbol, the boundary marker .#., to refer to the beginning or the end of a string in certain expressions.

Escape character

The escape character, %, eliminates any special meaning of the following character. Thus |, 0, ?, etc. can be used as ordinary symbols by prefixing them with %. For example, %0 represents the string "0" rather than the epsilon symbol; %| is the vertical bar itself, as opposed to the the union operator |. The ordinary percent sign may be expressed as %%.

Double quotes
The special meaning of any character may also be turned off by enclosing the symbol in double quotes: "0" and "?" are interpreted as ordinary letters. On the other hand, certain other strings receive a special interpretation within a doubly quoted string. For example, "\n" is interpreted as the newline character, "\t" as a tab, etc. following C conventions. The backslash is also used as a prefix in octal and hexadecimal numbers: "\101" and "\x41" are alternate ways to specify the symbol A.
Multicharacter symbols

The print name of a symbol can contain more than one character. For example, %+V represents the string "+V". Beware of multicharacter symbols that coincide with a string concatenated from shorter symbols. For example, the expression [a a | aa] denotes a language that contains the string "aa" twice. One is a single symbol and the other is the concatenation of two "a"s. If such ambiguities arise in practice, our system resolves them left-to-right in favor of the longest symbol.

Because we intentionally ignore the distinction between a language and its identity relation, a pair of identical symbols may be collapsed into a single symbol. Instead of a:a we may simply write a.

More about Operators

In addition to concatenation, union, and minus, mentioned in the Introduction, there are many other regular expression operators. The list given here is representative but not exhaustive (see More about Regular Expressions). Some of the operators below can be combined with all expressions regardless of whether they denote languages or relations. Concatenation and union are of this kind. Other operators presuppose that the component expression is of a particular type. For example, complementation, intersection, and minus can be used only with expressions that denote a simple language because regular relations are not closed with respect to these operations. Composition presupposes that the component expressions denote relations. For each of the operators described below, we specify what kind of expressions it combines with.
Unary operators
\A term complement. The set of all single-symbol strings that are not in the language A. \A is equivalent to [? - A]. Thus \a contains "b" but not "a" or the empty string, and no concatenated strings such as "bb". Note that \? denotes the null language because ? covers all single-symbol strings.
~A complement. The set of all strings that are not in the language A. For example, ~a contains "", "aa", and "aaa" but not "a". The component expression, A, cannot denote a relation. The complement of the null language, ~\?, is the universal language, and vice versa.
A+ iteration (Kleene plus). The language or relation A concatenated with itself any number of times, including zero times. A+ includes [A], [A A], [A A A], and so on ad infinitum. ?+ is the language of all nonempty strings.
A* Kleene star. The union of A+ with the empty-string language. A* is equivalent to (A+). ?* denotes the universal language.
$A containment. The set of strings or string pairs containing at least one instance of A as a subpart. For example, "bac" is a string in the language denoted by $a, speaking informally, "bac" is in $a. Here A can denote a relation. $A is equivalent to [?* A ?*]. $[] denotes the universal language.
Binary operators
A B
concatenation of the languages or relations A and B. See the discussion above.
A^n
n-ary concatenation of the language or relation A with itself. The set of strings or pairs of strings obtained by concatenating A with itself n times. For example A^3 is equivalent to [A A A].
A | B
union of the languages or relations A and B. The set of strings or pairs of strings that are in A or in B or in both. See the discussion above.
A & B
intersection of the languages A and B. The set of strings that are both in A and in B. A and B must be simple languages, not relations. For example [a | b] & [b | c] consists of the string "b".
A - B
A minus B. The set of strings that are in the language A but not in the language B. A and B must be simple languages, not relations. [A - B] is equivalent to [A & ~B]. Note also that \a could be defined as to [? - a].
A/B
ignore. The language or relation obtained from A by allowing strings of B to occur freely in all positions. For example, [[a b]/x] is equivalent to [x* a x* b x*]. In other words, the B strings are viewed as a kind of "noise" that is to be ignored in what is basically an A string.
A .x. B
crossproduct of the languages A and B. It denotes a regular relation that maps every string in A (upper side) to all strings in B (lower side), and vice versa. Thus [a .x. [b | c]] is equivalent to [a:b | a:c]. A and B must be simple languages, not relations.
A .o. B
composition of the relations A and B. The intersection of the lower language of A and the upper language of B works like a filter here. For example, [a:b | b:c] .o. [a:x | b:y] is equivalent to [a:y]. The b:c and a:x pairs do not contibute anything to the result because neither "c" nor "a" is in the intersection of [b | c] (the lower language of A) and [a | b] (the upper language of B).
A -> B
replacement of the language A by the language B. This denotes a relation that consists of pairs of strings that are identical except that every instance of A in the upper-side string corresponds to an instance of B in the lower-side string. For example, [a -> b] pairs "b" with "b" (no change) and "aba" with "bbb" (replacing both "a"s by "b"s). Replacement may be constrained by a specific context. For example, [A -> B || L _ R] makes the A -> B replacement only if the A string is preceded by a string from L and followed by a string from R. See Other expressions below for more discussion.
A @-> B
left-to-right, longest match replacement of the language A by the language B. Like [A -> B] except that the instances of A in the upper-side string are replaced selectively, starting from the left, choosing the longest candidate string at each point. For example, [[a | a b] @-> x] maps "ab" unambiguously to "x", whereas [[a | a b] -> x] relates it to both "xb" ("x" replacing "a") and "x" ("x" replacing "ab"). The @-> replacement may be further constrained by requiring a specific context: [A @-> B || L _ R].
Other Expressions
We describe here some more complex regular expressions including additional variants of the replacement expressions discussed above.In all of these cases, the L and R components are optional. The absence of the L or R part in these expressions is interpreted as denoting empty-string language.
A => L _ R
restriction of the language A to the context L _ R. This denotes the language of strings in which any instance of A is preceded by an instance of L and followed by an instance of R. A, L, and R must denote simple languages, not relations. For example, [a => b _ c] contains "bac", "cbacb", etc. but not "a", "ba", or "cab".
A -> B || L _ R

A -> B // L _ R
conditional replacement of the language A by the language B. This is like the relation [A -> B] except that any instance of A in the upper string corresponds to an instance of B in the lower string only when it is preceded by an instance of L and followed by an instance of R. Other instances of A in the upper string remain unchanged. A, B, L, and R must all denote simple languages, not relations.

The slant of the double bars indicates whether the precede/follow constraints refer to the instance of A in the upper string or to its image in the lower string. In the || version, both contexts refer to the upper string. The // variant requires the left context on the lower side of the replacement, the right context on the upper side. For other versions of conditional replacement, see More about Regular Expressions.

For example, [a -> b || b _ a] maps "baaaa" to "bbaaa". Here only the first "a" is replaced by "b" because it is the only one in the upper string that has "b" on the left and "a" on the right. In contrast, [a -> b // b _ a] maps all but the last "a" of "baaaa" to "b", giving "bbbba". Here the first three "a"s are all replaced because each of them is followed by an "a" in the upper string and the corresponding "b" in the lower string is preceded by "b". In other words, in the // version, the result of one replacement may serve as the left context of another replacement further to the right.

A @-> L ... R
left-to-right, longest-match markup of the language A by the prefix language L and the suffix language R. This is like the relation [A @-> B] except that any instance of A in the upper-side string corresponds to a substring that contains the original instance between a string of L and a string of R. The instances of A in upper-side strings are selected for markup under the left-to-right, longest match regimen. For example, [[a | a a] @-> %{ ... %}] relates "baaac" to "b{aa}{a}c". A, L, and R must all denote simple languages, not relations. As with other replacement relations, the markup may be further constrained by a context. For instance, [[ a | a a] @-> %{ ... %} || a _ a] maps "aaaa" into "a{aa}a" and "aaa" into "a{a}a" but the string "aa" remains unmarked because the context condition is not fulfilled.
A -> L ... R
simple markup of the language A by the prefix language L and the suffix language R. This is like [A @-> L ... R] except that all ways of factoring the input string are marked. For instance, [a | a a] -> %{ ... %} maps "aaa" into "{aa}{a}", "{a}{aa}", and "{a}{a}{a}", whereas the left-to-right, longest-match version, [a | a a] @-> %{ ... %}, gives only the output "{aa}{a}". Simple markup may be constrained by context.
A -> B, C -> D
parallel replacement of the language A by the language B and the language C by the language D. This is like the relation [A -> B] except that there are two correspondences to consider simultaneously. For example, [a -> b, b -> a] relates "abba" to "baab". The left-to-right, longest match operator, @->, can also be used in parallel replacement expressions, but -> and @-> cannot be mixed in the same expression.
`[[A], S, L]
substitution of symbol S in A by the symbols in list L. Here A may denote a simple language or a relation. For example, the substitution `[[k -> g || Vowel _ Vowel], Vowel, a e i o u] modifies the relation designated by the original expression so that all pairs of strings containing an instance of Vowel are replaced by pairs that instead contain one of the symbols in the list. In this case, the result is equivalent to [k -> g || [a | e | i | o | u] _ [a | e | i | o | u]]. If the list is empty, all the strings or string pairs containing S are eliminated. If L contains S, the original S strings remain and new modified strings are added for the other symbols on the list.
Boundary marker
In the restriction, =>, and conditional replacement, ->, @->, expressions we can use a special boundary marker, .#., to refer to the beginning or to the end of a string. In the left context, the boundary marker signals the beginning of the string; in the right context it means the end of the string. For example, the restriction expression [a => _ .#.] denotes the language where "a" appears only at the end of a string.

The boundary marker is not allowed as a constituent in a symbol pair; otherwise it can be used in a regular expression like any other symbol. For example, in the replacement expression [a -> b || [.#. | a] _ ] the union in the left context indicates that "a" is to be replaced by "b" at the beginning of a string and after another "a".

Operator precedence
Complex regular expressions can often be simplified by taking advantage of a convention that ranks the operators in a precedence hierarchy. For example, [\a] [b]* may be expressed more concisely as \a b* because the unary operators, term complement and Kleene star, by convention 'bind more tightly' to their component expressions than concatenation. Similarly, the brackets in a | [b c] are unnecessary because concatenation takes precedence over union; a | b c denotes the same language. In turn, union has precedence over replacement, allowing us to simplify a -> [b | c] to a -> b | c. The ranking of all the operators discussed in the previous sections is summarized in the table below. (High to Low.)
1.Unary operators: ~, \, $, +, *
2.Concatenation
3.Binary 'Boolean' operators: |, &, -
4.Other operators: .x., .o., ->, @->, =>
Unbracketed expressions with operators that have the same rank are disambiguated from left to right. Consequently, a | b & b* is interpreted as [a | b] & b* and not as a | [b & b*]. For the sake of clarity, we advise using brackets in such cases even if the ambiguity would be resolved in the intended way by the left-to-right ordering.


For more detailed information please look at our pointers to finite-state literature. We welcome your comments and suggestions.
Copyright © The Document Company - Xerox 1997 -1997. All rights reserved.
Written by Lauri Karttunen. Last modified on Jan. 12, 1998 .