Lauri Karttunen
Xerox Palo Alto Research Center
Abstract: The Finite-State Lexicon Compiler is an authoring tool for creating lexicons and lexical transducers. It is designed to be used in conjunction with transducers produced with the Xerox Two-level Rule Compiler.
1. What is it?
The Finite-State Lexicon Compiler ("lexc") is an authoring tool for
creating lexicons and lexical transducers. A lexical transducer is a
specialized finite-state automaton that maps lexical forms and
morphological specifications to corresponding inflected forms, and
vice versa. For example, a lexical transducer for English might relate
the word dining to dine+PresPart and swam to
swim+Past.
The input to the lexicon compiler is a text file in a format similar to
lexicons accepted by Kimmo Koskenniemi's two-l program [1] and Evan
Antworth's PC-KIMMO [2] but with several novel features. The most
notable innovation is that the entries in the source lexicon may specify a
relation between two forms. Idiosyncratic mappings, such as the i~a
alternation in swam may be encoded directly in the lexicon whereas
regular alternation, such as the loss of the stem-final e in
dining may be handled by general rules.
The output from the compiler is a simple finite-state automaton or a
transducer. The result of the compilation may be composed with a rule
transducer or transducers that effect morphological alternations or in
other ways modify the initial result. The lexicon compiler is designed
to be used with transducers produced with the Xerox two-level rule
compiler [3].
The compiler was written in C at Xerox PARC in 1992-93 by Lauri
Karttunen and Todd Yampol. It is based on Karttunen's 1990 Common Lisp
implementation of a similar utility. The compiler currently works on
the Macintosh and UNIX operating systems. The interface to the
compiler is the same in both versions except that the Macintosh
version allows the user to interact with the file system using mouse
clicks and dialogue boxes.
Section 2 of this document (Making simple automata) gives a
brief tutorial on how the compiler can be used to make finite-state
lexicons. Section 3 (Making lexical transducers) introduces
more sophisticated techniques. Section 4 (Checking the result)
illustrates ways to explore the compiled lexicon. Section 5
(Scripting) is about setting up files that make it easier to
update the lexicon. Section 6 (Regular expressions) lists the
regular expression operators that can be used to construct lexical
entries. On-line help messages are listed in APPENDIX 1. APPENDIX 2
is a detailed example of cascaded composition.
2. Making simple automata
When you start the compiler (by double-clicking the icon on a
Macintosh or by typing 'lexc' in UNIX), it prints a
welcoming banner as shown in (1) and waits for the user to type a
command terminated with a carriage return on the
lexc> prompt line. The commands are grouped under
three general headings Input/Output,
Operations, and Display. Each group is
further divided into subsections of related commands. The user may
review the list at any time by answering the prompt with a question
mark. The 'help' command, followed by a name of a
particular command, gives advice on what the commands do, 'help
all' prints out all help messages.
(1)
2.1 A few words
To start off with a very simple example, we construct a lexicon that
contains the forms of two lexemes, dine and line. Let us
assume that line is both a noun and a verb, dine only a
verb. The simplest way to define the lexicon is just to list all the
forms as shown in (2) under the heading LEXICON
Root. In general, there may be any number of sublexicons, but
one of them must have the name Root. The other
sublexicons, if any, are named according to the needs and tastes of
the lexicon writer.
Entries are separated by semicolons. Each entry consists of a
form, possibly null, and a continuation class. The
continuation class must be the name of another sublexicon or
#, a terminator symbol. The word END
at the end of the file is optional. The exclamation point,
!, marks a comment; the rest of the line is ignored.
Compilation of this lexicon creates a finite-state machine (fsm) that
accepts all only the listed word forms.
(2)
(3)
(4)
(5)
The command 'save-source' saves the compiled source
net as a binary file. In the case at hand, the source net is the
simple six-state automaton shown in Figure 1.
Figure 1
2.2 A lexicon of dates
To move on to a more challenging example, let us construct a lexicon
that contains all valid dates of the form "<month> <date>,
<year>", such as March 14, 1993. To keep things simple,
we adopt the restriction that dates must conform to this format
exactly; that is, a space between the month and the day, a comma and a
space between the day and the year. Let us also stipulate that
<year> must be a number between 0 and 9999. Every date in the
lexicon should of course be valid; March 31, 1993 is included
but April 31, 1993 is not because there are only 30 days in
April.
Because whitespace characters (blank, tab, carriage return, linefeed)
are generally ignored by the compiler, we must use a percent sign,
%, as an escape character in front of a blank that
should not count as whitespace. For reasons that will soon become
evident, the numeral 0 is also special. If it is used
as an ordinary digit in a word, we must prefix it with
%. To construct the date June 10, 2000, for
example, we need to specify it as June% 1%0,%
2%0%0%0. Zero is special only in the word, not in the
continuation class of an entry.
(6)
Example (7) illustrates the structure of the lexicon by showing how
the date March 14, 1993 is assembled by linking the components
in thirteen sublexicons.
(7)
(8)
(9)
The command 'lookup' enables the user to query whether
a particular word (a date in this case) is in the lexicon. If the word
is found, it is echoed; a line of stars indicates that the form was
not found in the lexicon.
(10)
2.3 Using regular expressions
Although the lexicon in (6) is correct (except for leap days), it is
more cumbersome than it needs to be. For example, the last sublexicon
(0To9Years) says, for each of the ten digits, that the string
may terminate with that digit.
(11)
(12) Inside a
regular expression, all symbols must be surrounded by whitespace or
delimiters. (13) illustrates an alternative form of the
Root lexicon in which 31 and 30 day months are
enumerated by regular expressions.
(13)
2.4 Composition with a rule transducer
Let us now consider the problem of how to eliminate from our lexicon
the dates for nonexistent leap days, such as February 29,
1993. February has 29 days only in years that are divisible by
four, except that centuries are leap years only if the number is still
divisible by four when the last two zeros are left out. For example,
1900 was not a leap year but in the year 2000 there will be 29 days in
February.
We could solve this problem by constructing a special chain of
sublexicons for February alone but it would be complicated. An
equivalent but notationally easier solution is to eliminate the
unwanted leap days by composing the source lexicon with a finite-state
transducer (fst) that expresses the rule stated above. This provides
us with a simple example of how the lexicon compiler interacts with
transducers produced with the two-level rule compiler [3].
A two-level grammar for prohibiting the occurrence of February
29 with an invalid year is presented in (14).
(14)
The rows in (15) represent numbered states, the columns are labeled
with a symbol that represents an "equivalence class," a set of symbols
that are equivalent within the fst. The equivalence classes are listed
at the bottom of the table; the first symbol in each class is used in
the table to represent the class. This transducer is trivial in the
sense that it maps every valid date to itself without changing
anything. The implicit boundary symbol, #, at the end
and at the beginning of a date is mapped to the epsilon symbol
(printed as 0). Invalid dates that contain February
29 in the wrong type of year disappear in the transduction.
The top half of the table tracks the string February 29, to
state 14, the bottom half tracks the years up to the implicit final
word boundary #:0 (here zero represents the epsilon
symbol). The gaps in the last column indicate states where February
29 is followed by the wrong kind of year or no year at
all. Because these states do not allow the date to end there, the
unwanted leap dates are eliminated.
(15)
(16)
An argument to the 'read-rules' command should be a
name of a Xerox format [6] binary file containing one or more
transducers. Any file of this type is acceptable regardless of whether
it was produced with the two-level rule compiler or by other means.
The command 'compose-result' creates the result net by
applying the rule transducer (or transducers) to the source net.
(17)
(18)
(19)
(20)
(21)
3.1 A simple lexical transducer
A lexical transducer, first discussed in Karttunen, Kaplan, and Zaenen
[7], is a lemmatizer that maps inflected surface forms to their
canonical dictionary representations and vice versa. In simple cases,
such transducers can be compiled directly from a source lexicon. For
example, we can modify the minilexicon in (3) to make it a
morphological analyzer/generator for the forms of dine and
line. The new lexicon is shown in (22).
(22)
The declaration Multichar_Symbols at the beginning of
the file tells the compiler that +Pl and the other
morphological tags are to be parsed as a single symbol rather than as
a sequence of symbols. Using multicharacter symbols conveniently makes
the morphological alphabet disjoint from ordinary letters. In pairs
such as +N:0, +Sg:0, the numeral
0 stands for epsilon, the empty string, whereas the
real zero is designated by %0.
(23) shows the composition of the surface form lines in (22),
given the interpretation of zero as the empty string. The single form
entries, e.g. line, are viewed as equivalent to
identity mappings (line:line).
(23)
(24)
(25)
(26)
(27)
The commands 'lookup swam' and 'lookdown
swim+V+Past' both locate this same path in the transducer by
matching the argument against one or the other side of the transducer
and printing out the symbols of the opposite side, or the same symbols
if there is no difference. Because of epsilons and because one input
symbol may correspond to more than one symbol on the opposite side,
these operations in a transducer are in general not deterministic. On
the other hand, lexical transducers have the advantage that they are
bidirectional: analysis and generation make use of the same data
structure and the same search algorithm.
The simple left-to-right conversion of the lexical and surface forms
to a sequence of symbol pairs can lead to nonoptimal results. If the
past tense of fight is encoded as fight:fought,
the symbols line up in the manner shown in (28).
(28)
3.2 Composing with rule transducers
Let us consider augmenting our minilexicon (22) with the present
participle forms dining, lining, and swimming. We
need a new sublexicon for present participle and a way of linking it
to the entries for dine, line, and swim. The
deletion of the stem final e in dining and lining
and the gemination of m in swimming could be handled in
the same way as the stem alternation in swim and
swum:
(30)
Our solution is to write general e-deletion and gemination
rules that apply to the canonical lexical form in the environment of a
vowel-initial suffix. When the source lexicon is composed with the
rule transducers, the result transducer will correctly represent these
alternations. The two rules are shown in (31). See [3] for details of
the syntax for specifying two-level rules.
(31)
(32)
(33)
(34) illustrates the effect of the composition.
(34)
(35)
(36) traces the construction of the final result with the compiler. We
assume here that the rules in (31) have already been compiled and
stored in the binary file ex5-rules.fst.
(36)
To verify that we have the expected result, we can again use
'lookup' and 'lookdown', first on the
result net:
(37)
(38)
3.3 Cascade of compositions
For languages with a more complex morphology than English, it may be
impractical to encode all the morphological alternations in a single
rule system. One such case is the plural formation of compound nouns
in French, discussed in Karttunen, Kaplan and Zaenen [7]. It can best
be described by two separate rule systems. The lexicon is composed
with the first rule system and the resulting transducer is composed
with the second set of rules. Appendix 2 discusses the case of French
nouns in more detail.
In a cascaded composition, the result net of the first composition
becomes the source net for the second step. The command
'result-to-source' makes that switch. When a new set
of rules has been read in, the composition can be continued. This
process is similar to the traditional derivation with an ordered set
of rewrite rules. The difference of course is that the rules are
applied to the lexicon as a whole rather than to individual words. At
every point in the cascade, the result net maps the lexical forms to
the output of the last rule or set of rules in the composition.
4. Checking the result
We have already illustrated three commands that display words in the
source and the result net: 'random',
'lookup', and 'lookdown'. The
'check-all' command systematically compares the source
net to the result net and displays the effect of the compose
operation. It looks at every word in the source net and tries to find
all words in the result net that have been derived from it by
composition. The results of the search can be written to a file or
displayed on the screen as in (39).
(39)
The output from 'check-all' is controlled by three
variables that can be toggled on and off. If the variable
'singles' (default OFF) is ON, all lexical forms that
have only one surface realization are printed. If
'duplicates' (default ON) is ON, lexical forms with
more than one realization are printed. If 'failures'
(default ON) is ON, entries in the source net that have no surface
realization are printed with a row of asterisks in place of the
missing output. In a typical case, most words in a source lexicon have
one and only one surface realization. It may be unnecessary and time
consuming to print them all if the lexicon is large. By turning the
'singles' flag off, the lexicon writer sees only the
entries in the source lexicon that have several surface realizations
or no realizations at all. This is useful information when the rule
set is still in development. Badly constrained alternations may result
in invalid multiple outputs, and conflicting rules may completely
block some valid forms. If all the flags are turned off,
'check-all' produces no output other than the
statistics at the end.
A common source of error is a mismatch between the lexicon and the
rule alphabets. If some words in the source net contain symbols that
are not in the rule alphabet, they disappear in the composition. The
'compose-result' command checks the alphabets and
prints a warning if some symbol in the lower side of the source net
does not appear on the upper side of the rule transducers. The command
'labels' allows the user to display the pair alphabet
on the screen:
(40)
The lexicon compiler converts strings to symbols from left to right
always preferring the longest match. It is important to keep this in
mind in designing the alphabet. It might be safe to declare
qu as a single symbol in English because it is nearly
always pronounced [k]. On the other hand, declaring
th a multicharacter symbol would clearly be a bad
decision because words like hothouse would be misparsed as
h-o-th-o-u-s-e. If the lexicon consists of ordinary alphabetic
symbols and multicharacter tags as our examples do, it is best to use
some nonalphabetic character such as + as the first
letter of all multicharacter symbols. In this way the regular alphabet
and the special alphabet are disjoint and every string is guaranteed
to parse correctly.
5. Scripting
5.1 Simple script
The commands 'begin-script',
'end-script', and 'run-script' speed
up the development of a lexicon by allowing the user to record a
sequence of commands to a file and execute them all at once from
inside the lexicon compiler. 'begin-script' starts the
recording, 'end-script' closes the script file:
(41)
(42)
The simple script facility is useful on the Macintosh. In UNIX, it is
possible to automate the development even better with the help of the
'make' command which is invoked from the UNIX command
line. Let us illustrate this by constructing a Makefile for the
result net of the current example.
The "target" here is a situation in which the file containing the
result net is up-to-date. Let us assume that the net was saved as
ex5-result.fst, the source net as ex5-source.fst, and
the rules as ex5-rules.fst. The source net was compiled from
ex5-lex.txt, the rules from ex5-rules.txt. If the rule
file gets modified, ex5-rules.fst needs to be recomputed; if
ex5-lex.txt changes, ex5-source.fst should be
recompiled; if either ex5-source.fst or ex5-rules.fst
has changed, ex5-result.fst has to be recomposed.
To get 'make' to keep track of the situation and
update the files as needed, the dependencies and the update actions
have to be specified in a file called Makefile. (44) is an
example of a Makefile that keeps ex5-rules.fst
up-to-date by using 'source' to run a specific update
script when one of the dependent files has changed.
(44)
(45)
The update-source script in (46) is simpler.
(46)
(47)
In Section 2.3 we saw that entries in a source lexicon may consist of
a regular expression and a continuation class. In this section we
discuss regular expressions in more detail. The lexicon compiler
follows the same syntactic conventions as the two-level rule compiler
(see Section 3.2 in [3]) but it has a few additional operators. The
common operators are listed in (48) in a descending order of
precedence. A few comments are added to explain the meaning of some
nonstandard operators.
(48)
In addition to (48), the lexicon compiler recognizes the following
regular expression operators:
(49)
The three main commands are 'compile-source',
'read-rules', and
'compose-result'. The term source net is used
for the automaton (either a simple lexicon or a transducer) compiled
from the input lexicon. The command 'compile-source'
creates the source net. The term rule nets refers to a set of
rule transducers created by the two-level rule compiler (or by other
means) and saved in binary form. The command
'read-rules' reads the rule nets from a file. The
command 'compose-result' composes the source net with
the rule net or nets and produces the result net. We will start
with these commands and explain the rest as we go along. ******************************************************
* Two-Level Lexicon Compiler 1.2 *
* created by *
* Lauri Karttunen and Todd Yampol *
* Copyright (c) 1992 by the Xerox Corporation. *
* All Rights Reserved. *
******************************************************
Input/Output
--------------------------------------------------------------
Source: compile-source, read-source, result-to-source, save-source.
Rules: read-rules.
Result: read-result, save-result, source-to-result.
Operations
----------------------------------------------------------------
Composition: compose-result, extract-surface-forms.
Checking: check-all, lookdown, lookup, random.
Switches: duplicates, failures, singles, time.
Scripts: begin-script, end-script, run-script.
Display
-------------------------------------------------------------------
Misc: banner, labels, status, storage.
Help: help, ?.
Type 'quit' to exit.
lexc>
To get a more interesting example, let us define another version of
the lexicon that distinguishes stems from suffixes. The new version is
listed in (3). Some entries have no content other than the
continuation class, some have a sublexicon name as the continuation
class.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ex0-lex.txt
LEXICON Root
dine #; dines #; dined #;
line #; lines #; lined #;
END
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Example (4) shows what happens when this file is processed by the
compiler. Note that the name of the text file containing the lexicon
specification is ex1-lex.txt.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ex1-lex.txt
LEXICON Root
Noun;
Verb;
LEXICON Noun
line NounSuffix;
LEXICON Verb
dine VerbSuffix;
line VerbSuffix;
LEXICON NounSuffix
s #;
#;
LEXICON VerbSuffix
s #;
d #;
#;
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
The resulting lexicon contains the same six words we saw in (2). The
form lines actually gets constructed twice, once as a verb,
once as a noun. After minimization, only one of them remains. The
compiler first processes each sublexicon separately, keeping track of
continuation pointers, and then joins the structures to a single
network which is determinized and minimized. The form dines,
for example, is created by joining its components in the manner shown
in (5). The first component is a null string.lexc> compile-source ex1-lex.txt
Opening 'ex1-lex.txt'...
Root...2, Noun...1, Verb...2, NounSuffix...2, VerbSuffix...3
Building lexicon...Minimizing...Done!
SOURCE: 6 states, 7 arcs, 6 words
Sublexicon: Root
Verb VerbSuffix
Substring:
dine s

The complete listing for the date lexicon is given in (6). It is a
little more complicated than one might expect because it enforces the
constraint that the day numeral must not be larger than the number of
days in the month. (February is a special case; we will come to that
shortly.) Another reason for the complexity is that we decided to
avoid leading zeros. For example, January 1, 100 is included
but January 1, 0100 is not considered a valid date. A more
compact encoding for the same lexicon is presented in (52), Section
6.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ex2-lex.txt
LEXICON Root
January 31days; February 29days; March 31days; April 30days;
May 31days; June 30days; July 31days; August 31days;
September 30days; October 31days; November 30days; December 31days;
LEXICON 31days
% 31 Years; 30days;
LEXICON 30days
% 3%0 Years; 29days;
LEXICON 29days
% 1To9Days; % 1 0To9Days; % 2 0To9Days;
LEXICON 0To9Days
%0 Years; 1To9Days;
LEXICON 1To9Days
1 Years; 2 Years; 3 Years; 4 Years; 5 Years;
6 Years; 7 Years; 8 Years; 9 Years;
LEXICON Years
,% ThousandsOfYears; ,% HundredsOfYears; ,% TensOfYears;
,% 0To9Years;
LEXICON ThousandsOfYears
1 0To999Years; 2 0To999Years; 3 0To999Years; 4 0To999Years;
5 0To999Years; 6 0To999Years; 7 0To999Years; 8 0To999Years;
9 0To999Years;
LEXICON 0To999Years
%0 0To99Years; HundredsOfYears;
LEXICON HundredsOfYears
1 0To99Years; 2 0To99Years; 3 0To99Years; 4 0To99Years;
5 0To99Years; 6 0To99Years; 7 0To99Years; 8 0To99Years;
9 0To99Years;
LEXICON 0To99Years
%0 0To9Years; TensOfYears;
LEXICON TensOfYears
1 0To9Years; 2 0To9Years; 3 0To9Years; 4 0To9Years; 5 0To9Years;
6 0To9Years; 7 0To9Years; 8 0To9Years; 9 0To9Years;
LEXICON 0To9Years
%0 #; 1 #; 2 #; 3 #; 4 #; 5 #; 6 #; 7 #; 8 #; 9 #;
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
The listing in (8) traces the compilation of the date lexicon. Sublexicon Substring
Root March
31days
30days
29days 1
0To9Days
1To9Days 4
Years ,
ThousandsOfYears 1
0To999Years
HundredsOfYears 9
0To99Years
TensOfYears 9
0To9Years 3
As it reads the source file, the compiler prints the name of each
sublexicon and the number of entries in it. When the compilation is
finished, the command 'random' is a convenient way to
examine the contents of the lexicon. It randomly prints 15 words from
the lexicon. Any gross errors are likely to show up quickly. Note that
there are altogether 3.66 million dates in our 64-state lexicon. A few
too many for ten thousand years--we have yet to eliminate unwanted
leap days.lexc> compile-source ex2-lex.txt
Opening 'ex2-lex.txt'...
Root...12, 31days...2, 30days...2, 29days...3, 0To9Days...2,
1To9Days...9, Years...4, ThousandsOfYears...9,
0To999Years...2,
HundredsOfYears...9, 0To99Years...2, TensOfYears...9,
0To9Years...10
Building lexicon...Minimizing...Done!
SOURCE: 64 states, 147 arcs, 3660000 words.
lexc> save-source ex2-lex.fsm
Opening 'ex2-lex.fsm'...
Done.
The note Using SOURCE means that the examples are
generated from the source net. In the case at hand, there is no result
net so far (recall that a "result net" is the result of composing a
source net with a rule net or nets). The randomization algorithm is
useful but not optimal; short words are much more likely to show up
than long ones.lexc> random
NOTE: Using SOURCE.
November 18, 2
September 10, 3
June 4, 2985
October 4, 8
December 9, 5253
March 8, 6
February 4, 7
October 9, 7103
November 9, 4805
October 31, 0
August 6, 28
September 4, 63
September 9, 1
September 8, 8767
December 3, 160
If the argument to a command is not included on the same line with the
command, the program prompts for it, as in the last example.lexc> lookup March 14, 1993
NOTE: Using SOURCE.
March 14, 1993
lexc> lookup February 30, 1993
NOTE: Using SOURCE.
****************
lexc> lookup
NOTE: Using SOURCE.
Word: February 29, 1993
February 29, 1993
It would be simpler to say--just once--that any of the ten digits
counts as an acceptable termination. The lexicon compiler also accepts
the latter type of specification. The form part of an entry may be not
just a simple string but a regular expression describing a set
(possibly infinite) of forms that share a common continuation
class. In the case at hand, we may combine the ten entries in (11) to
the single entry in (12).LEXICON 0To9Years
%0 #; 1 #; 2 #; 3 #; 4 #; 5 #; 6 #; 7 #; 8 #; 9 #;
The angle brackets indicate that the expression contained within them
is to be interpreted as a regular expression and not as an ordinary
string; the vertical bar is a symbol for disjunction.LEXICON 0To9Years
<%0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9> # ;
More about regular expressions in Section 6.LEXICON Root
<J a n u a r y | M a r c h | M a y | J u l y |
A u g u s t | [O c t o | D e c e m] b e r > 31days;
<A p r i l | J u n e | [S e p t | N o v ] e m b e r> 30days;
February 29days;
Please consult [3] for a detailed explanation of the rule formalism
used here. Ritchie et al. [4] and Sproat [5] are good general
introductions to two-level morphology. For the sake of this example,
it is sufficient to know that the relatively straightforward encoding
of the leap day constraint can easily be compiled to the transducer
shown in (15) in the form of a transition table.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ex3-rule.txt : Restricting "February 29" to leap years
Alphabet
A D F J M N O S a b c e g h i l m n o p r s t u v y % %,
%0 1 2 3 4 5 6 7 8 9 #:0;
Sets
Digit = %0 1 2 3 4 5 6 7 8 9 ;
Even = %0 2 4 6 8 ;
Odd = 1 3 5 7 9 ;
Definitions
BadOne = \[%0 | 4 | 8] ; ! Any symbol other than 0, 4, or 8.
BadTwo = Even BadOne | Odd \[2|6] ;
NotDivisibleByFour = BadOne | Digit* BadTwo ;
Rules
"No February 29" ! The string "February 29, " must not be followed
! by a year not divisible by four or
! by a century that is not divisible by four
F e b r u a r y % 2 9 %, % /<= _ NotDivisibleByFour (%0 %0) #: ;
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Let us now see how the transducer is applied to the lexicon. We assume
that the "No February 29" rule has already been compiled by the
two-level rule compiler and is stored in the binary file
ex3-rule.fst. In (16), we first compile the source lexicon as
before, then load the rule transducer with the command
'read-rules'. "No February 29"
A F a b e r u y , 0 1 2 4 6 9 #:0
1: 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2: 1 2 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1
3: 1 2 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1
4: 1 2 1 1 1 5 1 1 1 1 1 1 1 1 1 1 1
5: 1 2 1 1 1 1 6 1 1 1 1 1 1 1 1 1 1
6: 1 2 7 1 1 1 1 1 1 1 1 1 1 1 1 1 1
7: 1 2 1 1 1 8 1 1 1 1 1 1 1 1 1 1 1
8: 1 2 1 1 1 1 1 9 1 1 1 1 1 1 1 1 1
9: 1 2 1 1 1 1 1 1 10 1 1 1 1 1 1 1 1
10: 1 2 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1
11: 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 12 1
12: 1 2 1 1 1 1 1 1 1 13 1 1 1 1 1 1 1
13: 1 2 1 1 1 1 1 1 14 1 1 1 1 1 1 1 1
14: 17 20 17 17 17 17 17 17 17 17 14 15 16 14 16 15 17
15: 17 20 17 17 17 17 17 17 17 17 23 15 14 16 14 15
16: 17 20 17 17 17 17 17 17 17 17 21 15 16 14 16 15
17: 1 2 1 1 1 1 1 1 1 1 18 1 1 1 1 1
18: 1 2 1 1 1 1 1 1 1 1 19 1 1 1 1 1 1
19: 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
20: 1 2 1 1 3 1 1 1 1 1 18 1 1 1 1 1
21: 17 20 17 17 17 17 17 17 17 17 22 15 16 14 16 15 17
22: 17 20 17 17 17 17 17 17 17 17 14 15 16 14 16 15
23: 17 20 17 17 17 17 17 17 17 17 24 15 16 14 16 15
24: 17 20 17 17 17 17 17 17 17 17 22 15 16 14 16 15
Equivalence classes:
((A D J M N O S c g h i l m n o p s t v) (F) (a) (b) (e) (r)
(u) (y) ( ) (,) (0) (1 3 5 7) (2) (4 8) (6) (9) (#:0))
Alternatively, we could start by reading the rules. The order does not
matter in this case, although it may be relevant if the alphabet of
the lexicon contains multicharacter symbols. We will come back to this
later.lexc> compile-source ex2-lex.txt
Opening 'ex2-lex.txt'...
Root...12, 31days...2, 30days...2, 29days...3, 0To9Days...2,
1To9Days...9, Years...4, ThousandsOfYears...9,
0To999Years...2,
HundredsOfYears...9, 0To99Years...2, TensOfYears...9,
0To9Years...10
Building lexicon...Minimizing...Done!
SOURCE: 64 states, 147 arcs, 3660000 words.
lexc> read-rules ex3-rule.fst
Opening 'ex3-rule.fst'...
1 net read.
The result of the composition looks promising: the result net has 17
states more than the source net, but the word count has decreased by
7575 dates. The 'status' command in (18) gives a quick
overview of the situation.lexc> compose-result
No epenthesis.
Initial and final word boundaries added.
..............Done.
90 states, 320 arcs, 3652425 words.
Minimizing...Done.
81 states, 262 arcs, 3652425 words.
We can now use 'lookup' to verify that the composition
had the intended effect. Because we now have a source and a result
net, we can choose which one to use. If a result net has been
computed, it is the default target for lookup. As we see by the string
of asterisks returned, February 29, 1993 has disappeared from
the lexicon.lexc> status
SOURCE: ex2-lex.txt = 64 states, 147 arcs, 3660000 words.
RULES: ex3-rule.fst = 1 transducer.
RESULT: ex2-lex.txt+ex3-rule.fst = 81 states, 262 arcs, 3652425 words.
Switches: singles=OFF, duplicates=ON, failures=ON, time=OFF, verbose=OFF
As intended, the year 2000 is correctly included as a leap year whereas the
year 1900 has been eliminated as it should be.lexc> lookup February 29, 1993
Use (s)ource or (r)esult? [r]:
NOTE: Using RESULT.
****************
Because everything appears to be in order, we save the result with the
'save-result' command. The command
'save-source' would save the source net. More about
saving in Section 7.lexc> lookup February 29, 1900
Use (s)ource or (r)esult? [r]:
NOTE: Using RESULT.
****************
lexc> lookup February 29, 2000
Use (s)ource or (r)esult? [r]:
NOTE: Using RESULT.
February 29, 2000
3. Making lexical transducerslexc> save-result ex3-lex.fsm
Opening 'ex3-lex.fsm'...
Done.
In addition to the citation form of the word itself, the entries in
(22) contain morphological information: part of speech
(+N, +V) and type of inflection
(+Pl, +Sg, +Base,
+Sg3, +Past,
+PastPart). The main technical difference between the
old and the new version is that some of the entries now contain a pair
of forms: a lexical form and a surface form, separated
by a colon. The compiler interprets such a pair as a regular relation,
a sequence of pairwise correspondences between the symbols. For
example, the entry +Pl:s pairs the lexical form
+Pl with the surface form s. In other
words, in forms constructed with this entry, plural is realized as
s.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ex4-lex.txt
Multichar_Symbols
+N +Sg +Pl +V +Base +Past +PastPart +Sg3
LEXICON Root
Nouns; Verbs;
LEXICON Nouns
line N;
LEXICON Verbs
dine V; line V;
LEXICON N
+N:0 NounInfl;
LEXICON NounInfl
+Sg:0 #; +Pl:s #;
LEXICON V
+V:0 RegVerbConj;
LEXICON RegVerbConj
VCommon; RegPast; RegPastPart;
LEXICON VCommon
+Base:0 #; +Sg3:s #;
LEXICON RegPast
+Past:d #;
LEXICON RegPastPart
+PastPart:d #;
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
The command 'compile-source' converts the lexicon in
(22) to a transducer, which can be used bidirectionally for analysis
('lookup') as well as generation
('lookdown'):Sublexicon: Root Nouns N NounInfl
Lexical form: line +N +Sg
Surface form: line s
The 'lookup' command matches its argument against the
lower side of the lexical transducer (ignoring epsilons) and produces
the corresponding lexical string or strings if the match succeeds;
'lookdown' does the same operation in the opposite
direction.lexc> lookup lines
NOTE: Using SOURCE.
line+N+Pl
lexc> lookup dined
NOTE: Using SOURCE.
dine+V+PastPart
dine+V+Past
Because entries in the lexicon specification may consist of a pair of
forms, it is simple to deal with words that have irregular surface
forms in their paradigm. For example, strong verbs like swim
have irregular past and past participle forms; in other respects they
are regular. To add swim to our minilexicon, we need three new
entries in LEXICON Verbs for the verb itself and three
new sublexicons. The new sublexicons, shared by irregular forms of
other strong verbs, encode the fact that in these verbs the past tense
and past participle categories are expressed as a stem alternation and
not by a suffix. These additions are listed in (26).lexc> lookdown dine+V+Sg3
NOTE: Using SOURCE.
dines
lexc> lookdown line+N+Pl
NOTE: Using SOURCE.
lines
When the lexicon is compiled into a transducer, the pairs
swim:swam and swim:swum are combined
into a sequence of symbol pairs in which every character of the
lexical form is paired with the corresponding symbol of the surface
form. If the symbols are identical, we represent the result as a
single symbol (s instead of s:s),
otherwise as a pair (i:a and i:u). If
one of the forms is shorter than the other, epsilons are automatically
added to the end to make the lengths equal. The past tense of
swim is represented in the transducer as the path of symbols
and symbol pairs shown in (27). swim VIrreg;
swim:swam VIrregPast;
swim:swum VIrregPastPart;
LEXICON Virreg
+V:0 VCommon; ! to get "swims"
LEXICON VIrregPast
+V+Past:0 #; ! or <+V:0 +Past:0>
LEXICON VIrregPastPart
+V+PastPart:0 #;
![]()
Because by default the compiler adds an epsilon to the end of the
shorter string, the resulting path, <f i:o g:u h:g t:h
0:t>, is very irregular. We can avoid such ill-matched
sequences by specifying in the lexical entry directly where epsilons
should go. The optimal solution in this case is
fi0ght:fought, which yields the path <f i:o
0:u g h t>. Here, as elsewhere, the 0 that
represents an epsilon is not preceded by a percent sign.Lexical string: f i g h t
Surface string: f o u g h t
But this would be unsatisfactory because the alternations in (30) are not
idiosyncratic to these particular words; they apply productively to all words
of the same type.dine:din
line:lin
swim:swimm
(32) is our minilexicon in its final form.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ex5-rules.txt
Alphabet
a b c d e f g h i j k l m n o p q r s t u v w x y z %^:0
#:0;
Sets
Cons = b c d f g h j k l m n p q r s t v w x y z ;
Vowel = a e i o u ;
Rules
"E-deletion" ! A stem final e is deleted between a
! consonant and a suffix initial vowel
! Note: ^ marks the suffix boundary.
e:0 <=> Cons _ %^: Vowel;
"Gemination" ! The stem final consonant in a CVC
structure
! is doubled before a suffix initial vowel
0:Cx <=> Cons Vowel Cx %^: _ Vowel ; where Cx in (b d g m n p r
s t z);
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
After compilation, the present participle form of swim has the
structure shown in (33).!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ex5-lex.txt
Multichar_Symbols
+N +Sg +Pl +V +Base +Past +PastPart +Sg3 +PresPart
LEXICON Root
Nouns; Verbs;
LEXICON Nouns
line N;
LEXICON Verbs
dine V; line V; stop V;
swim VIrreg; swim:swam VIrregPast; swim:swum
VIrregPastPart;
LEXICON N
+N:0 NounInfl;
LEXICON NounInfl
+Sg:0 #; +Pl:s #;
LEXICON V
+V:0 RegVerbConj;
LEXICON RegVerbConj
VCommon; RegPast; RegPastPart;
LEXICON VCommon
+Base:0 #; +Sg3:s #; 0+PresPart:^ing #;
LEXICON RegPast
0+Past:^ed #;
LEXICON RegPastPart
0+PastPart:^ed #;
LEXICON VIrreg
+V:0 VCommon;
LEXICON VIrregPast
+V+Past:0 #;
LEXICON VIrregPastPart
+V+PastPart:0 #;
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
The lower side of the source transducer contains "intermediate"
forms. The surface form or forms will be determined by the application
of the two-level rules in (31). In order to control the environment in
which these rules apply, we explicitly mark the boundary between the
stem and the present participle ending in the lexicon. The boundary
marker, ^, corresponds to an epsilon in the
lexical form (see the last entry in LEXICON
VCommon). Because the only surface realization of
^ allowed in the rule alphabet is 0
(epsilon), the boundary marker disappears from the result when the
source net is composed with the rule transducers. Lexical Form: s w i m +V +PresPart
Intermediate Form: s w i m ^ i n g
In the composition the intermediate form is eliminated. The result net maps
swim+V+PresPart directly to swimming:Lexical Form: s w i m +V +PresPart
Intermediate Form: s w i m ^ i n g
Surface Form: s w i m m i n g
Because we need the e-deletion rule anyway for the present
participle forms of dine and line, we changed the past
and past participle suffix from d to
ed. This enables us to handle all forms of regular
verbs ending in a consonant, such as stop.Lexical Form: s w i m +V +PresPart
Surface Form: s w i m m i n g
Let us get an overview of the situation with the 'status'
command:lexc> read-rules ex5-rules.fst
Opening 'ex5-rules.fst'...
2 nets read.
lexc> compile-source ex5-lex.txt
Opening 'ex5-lex.txt'...
Root...2, Nouns...1, Verbs...6, N...1, NounInfl...2, V...1,
RegVerbConj...3, VCommon...3, RegPast...1, RegPastPart...1,
VIrreg...1, VIrregPast...1, VIrregPastPart...1
Building lexicon...Minimizing...Done!
SOURCE: 30 states, 42 arcs, 22 words.
lexc> compose-result
Epenthetical forms are possible.
Initial and final word boundaries added.
.....Done.
45 states, 59 arcs, 22 words.
Minimizing...Done.
33 states, 48 arcs, 22 words.
As we see in (36) there are two transducers in the rule file, compiled
from the two rules in (31) with the two-level rule compiler [3]. In
the previous composition example in Section 2.4, we only had a single
rule transducer. The result here is what we get by composing the
source net with the intersection of the two rule transducers.lexc> status
SOURCE: ex5-lex.txt = 30 states, 42 arcs, 22 words.
RULES: ex5-rules.fst = 2 transducers.
RESULT: ex5-lex.txt+ex5-rules.fst = 33 states, 48 arcs, 22 words.
Switches: singles=OFF, duplicates=ON, failures=ON, time=OFF, verbose=OFF
If we choose the source net as the target, we can still see the
intermediate forms. The source net is left unchanged by the compose
operation.lexc> lookup swimming
Use (s)ource or (r)esult? [r]:
NOTE: Using RESULT.
swim+V+PresPart
lexc> lookup dining
Use (s)ource or (r)esult? [r]:
NOTE: Using RESULT.
dine+V+PresPart
lexc> lookdown line+V+PresPart
Use (s)ource or (r)esult? [r]:
NOTE: Using RESULT.
lining
Section 4 discusses other ways of examining the result of the composition.lexc> lookdown dine+V+PresPart
Use (s)ource or (r)esult? [r]: s
NOTE: Using SOURCE.
dine^ing
The left column shows the lexical (upper) side of all the words in the
source net, the right column displays the corresponding surface
(lower) forms in the result net. For example, we see that the
composition has produced swimming from
swim+V+PresPart.lexc> check-all
FLAGS: singles=ON, duplicates=ON, failures=ON.
Output file (- = stdout, ; = dialog box) [cancel]: -
==================================================================
Mon Mar 22 17:11:52 1993
SOURCE: ex5-lex.txt
RESULT: ex5-lex.txt+ex5-rules.fst
FLAGS: singles=ON, duplicates=ON, failures=ON.
==================================================================
dine+V+Base dine
dine+V+Sg3 dines
dine+V+PastPart dined
dine+V+Past dined
dine+V+PresPart dining
stop+V+Base stop
stop+V+Sg3 stops
stop+V+PastPart stopped
stop+V+Past stopped
stop+V+PresPart stopping
swim+V+Base swim
swim+V+Sg3 swims
swim+V+PresPart swimming
swim+V+Past swam
swim+V+PastPart swum
line+V+Base line
line+V+Sg3 lines
line+V+PastPart lined
line+V+Past lined
line+V+PresPart lining
line+N+Sg line
line+N+Pl lines
Source word count: 22
Single: 22, Multiple: 0, Failure: 0
Output word count: 22
Any symbol in the source lexicon that consists of more than one letter
must be declared in a Multichar_Symbols declaration
(cf. (22) and (32)), otherwise it will be misparsed as a symbol
sequence.lexc> labels
1 = SOURCE, 2 = RULES, 3 = RESULT.
Choose 1, 2, or 3 [Cancel]: 1
d e i l m n o p s t w i:a i:u 0:d 0:g 0:n 0:^ +Base:0 +N:0 +Past:e
+Past:0 +PastPart:e +PastPart:0 +Pl:s +PresPart:i +Sg:0 +Sg3:s +V:0
Size: 28
lexc> labels
1 = SOURCE, 2 = RULES, 3 = RESULT.
Choose 1, 2, or 3 [Cancel]: 2
a b c d e f g h i j k l m n o p q r s t u v w x y z e:0 #:0 0:b 0:d
0:g 0:m 0:n 0:p 0:r 0:s 0:t 0:z ^:0
Size: 39
lexc> labels 3
d e i l m n o p s t w e:0 i:a i:u 0:d 0:g 0:m 0:n 0:p +Base:0
+N:0 +Past:e +Past:0 +PastPart:e +PastPart:0 +Pl:s +PresPart:i
+Sg:0 +Sg3:s +V:0
Size: 30
A script file, such as (42) may also be produced with a text
editor and executed with a single command as in (43).lexc> begin-script update
Recording commands in 'update'. Type 'end-script' to stop.
lexc> compile-source ex5-lex.txt
....
lexc> end-script
End of script. Closing 'update'
(43)compile-source ex5-lex.txt
compose-result
check-all -
save-result ex5-result.fst
5.2 Using makelexc> run-script update
The UNIX command 'make ex5' checks the dependencies
and carries out any required update actions in the appropriate
order. The scripts for the updates are in three separate files,
update-result.txt (45), update-source.txt (46), and
update-rules.txt (47).##########################################################
# Makefile for ex5
ex5: ex5-result.fst
ex5-result.fst: ex5-source.fsm ex5-rules.fst
csh<<!source update-result.txt
!
ex5-source.fsm: ex5-lex.txt
csh<<!source update-source.txt
!
ex5-rules.fst: ex5-rules.txt
csh<<!source update-rules.txt
!
##########################################################
The update-result script in (45) invokes the lexicon compiler, reads
in the source and the rule nets, composes them, turns on the singles
flag, produces the check-all file ex5-check.log, saves the
result net as ex5-result.fst and quits.lexc<<!
read-source ex5-source.fst
read-rules ex5-rules.fst
compose-result
singles
check-all ex5-check.log
save-result ex5-result.fst
quit
!
Updating the rules requires the two-level rule compiler. The
update-rules script in (47) invokes the rule compiler (twolc) and runs
the appropriate commands. For details, see [3].lexc<<!
compile-source ex5-lex.txt
save-source ex5-source.fst
quit
!
6. Regular expressionstwolc<<!
read-grammar ex5-rules.txt
compile
save-binary ex5-rules.fst
quit
!
The lexicon compiler requires that regular expressions be enclosed in
a pair of angle brackets, as shown in (12) and (13), to distinguish
them from simple lexical forms and form pairs of type
string1:string2. All operators, special characters, grouping
elements, angle brackets, punctuation and whitespace characters must
be prefixed with the escape character (%) if they are
used as ordinary symbols.Unary operators
~
complement
\
term complement
\a is any symbol other than a.
$
containment
$ais any string that contains at least one
a.
*
Kleene star
+
Kleene plus
()
option
(A) B is equivalent to A B | B.
Binary operators
/
ignore
[a b c]/x is equivalent to
x* a x* b x* c x*.
concatenation
|
union
&
intersection
-
minus
A - B is equivalent to
A & ~B.
Special symbols
0
epsilon
?
other
%
escape
"
multiple escape
"a b c" is equivalent to
a% b% c.
!
begin comment
Rest of the line is ignored.
Grouping elements
[ ]
[A] is equivalent to A.
{ }
can only be used around a union: {A | B}.
| A^n | is equivalent to exactly n concatenations of A with itself. |
| A^n,m | is the language consisting of at least n and at most m concatenations of A. Here m must be larger than n. |
| A^<n | is less than n concatenations of A with itself, including the empty string. |
| A^>n | more than n concatenations of A with itself. |
The iteration operator has the same precedence as * and +.
(50)
| A .x. B | is the regular relation consisting of the Cartesian product of A and B. For example, if A is [a b | c] and B is [y z], [A .x. B] is the relation [a:y b:z | c:y 0:z]. The crossproduct [s w i m .x. s w u m] is equivalent to the expression [s w i:u m]. The crossproduct operator has lower precedence than any of the operators in (48). |
(51)
| @"file" | is the regular relation saved in file in binary format by a 'save-source' or 'save-result' command, or by other means. For example, if the file noun.fsm contains a lexicon of nouns, the regular expression [@"noun.fsm" [# @"noun.fsm"]*] constructs an infinite lexicon of nouns and noun compounds: noun1#noun2, noun1#noun2#noun3, and so on. The insert operator has the same precedence as ~, \, and $. |
Using the insert and iteration constructs we can abbreviate our date lexicon--cf. (6), (12), (13) in Section 2--to the form shown in (52). We assume here that the files 0to9.fsm and 1to9.fsm have the contents suggested by the name, a net with the digits 0 to 9 and 1 to 9, respectively.
(52)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! ex6-lex.txt
LEXICON Root
<J a n u a r y | M a r c h | M a y | J u l y |
A u g u s t | [O c t o | D e c e m] b e r > 31days;
<A p r i l | J u n e | {S e p t | N o v} e m b e r> 30days;
February 29days;
LEXICON 31days
% 31 Years; 30days;
LEXICON 30days
% 3%0 Years; 29days;
LEXICON 29days
<% {@"1to9.fsm" | {1|2} @"0to9.fsm"}> Years;
LEXICON Years
<%,% {@"0to9.fsm" | @"1to9.fsm" @"0to9.fsm"^1,3}> #;
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
7. Miscellaneous
7.1 Saving
The two output commands ('save-source' and
'save-result') save the net in the binary format
described in [6]. This same format is used by the rule compiler to
store transducers. If the user is only interested in the surface forms
of the result net, the surface language of the transducer can be
extracted with the command
'extract-surface-forms'. This operation converts the
result net to a simple automaton.
7.2 Storage
The command 'storage' gives statistics of memory
allocation:
(53)
The command 'time' toggles an ON/OFF switch (default OFF). If
the switch is ON, the execution time is printed for all commands.
Acknowledgments
Thanks to Beth Bryson, Ronald M. Kaplan, Hyuk-Chul Kwon, Christian
Rohrer and Annie Zaenen for comments and suggestions. Special thanks
to Kenneth R. Beesley for editorial assistance.
References
[1] Koskenniemi, Kimmo. Two-level Morphology: A General
Computational Model for Word-Form Recognition and
Production. Department of General Linguistics. University of
Helsinki. 1983.
[2] Antworth, Evan L. PC-KIMMO: a two-level processor for
morphological analysis. Occasional Publications in Academic
Computing No. 16, Summer Institute of Linguistics, Dallas,
Texas. 1990.
[3] Karttunen, Lauri and Kenneth R Beesley. Two-Level Rule
Compiler Technical Report. ISTL-92-2. Xerox Palo Alto Research
Center. Palo Alto, California. 1992.
[4] Ritchie, Graeme D., Graham J. Russell, Alan W. Black, and Stephen
G. Pulman. Computational Morphology: Practical Mechanisms for the
English Lexicon. MIT Press. 1992.
[5] Sproat, Richard. Morphology and Computation. MIT
Press. Cambridge, Massachusetts. 1992.
[6] Karttunen, Lauri. Binary Encoding Format for Finite-State
Networks. Technical Report SSL-90-17. Xerox Palo Alto Research
Center. Palo Alto. March 1990.
[7] Karttunen, Lauri, Ronald M. Kaplan and Annie Zaenen. "Two-Level
Morphology with Composition". In the Proceedings of the Fifteenth
International Conference on Computational
Linguistics. COLING-92. Nantes, 23-28 August 1992. Vol. I
141-48. ICCL. 1992.
7.3 Timelexc> storage
States: 64 in use, 20 free, 2340 allocated (32k)
Arcs: 147 in use, 27 free, 2048 allocated (32k)
Nets: 1 in use, 11 free, 327 allocated (16k)
Alphs: 2 in use, 22 free, 1638 allocated (16k)
These heaps: 96k
Other heaps: 64k