Next: Comparison with Alvey Up: Evaluation and Results Previous: Chunking and Dependencies in

Comparison with IBM

The evaluation in this section was done with the earlier 1995 release of the grammar. This section describes an experiment to measure the crossing bracket accuracy of the XTAG-parsed IBM-manual sentences. In this experiment, XTAG parses of 1100 IBM-manual sentences have been ranked using certain heuristics. The ranked parses have been compared^31.3 against the bracketing given in the Lancaster Treebank of IBM-manual sentences^31.4. Table G.5 shows the results of XTAG obtained in this experiment, which used the highest ranked parse for each system. It also shows the results of the latest IBM statistical grammar ([#!jelineketal94!#]) on the same genre of sentences. Only the highest-ranked parse of both systems was used for this evaluation. Crossing Brackets is the percentage of sentences with no pairs of brackets crossing the Treebank bracketing (i.e. ( ( a b ) c ) has a crossing bracket measure of one if compared to ( a ( b c ) ) ). Recall is the ratio of the number of constituents in the XTAG parse to the number of constituents in the corresponding Treebank sentence. Precision is the ratio of the number of correct constituents to the total number of constituents in the XTAG parse.

System	# of	Crossing Bracket	Recall	Precision
	sentences	Accuracy
XTAG	1100	81.29%	82.34%	55.37%
IBM Statistical	1100	86.20%	86.00%	85.00%
grammar

{Performance of XTAG on IBM-manual sentences

As can be seen from Table G.5, the precision figure for the XTAG system is considerably lower than that for IBM. For the purposes of comparative evaluation against other systems, we had to use the same crossing-brackets metric though we believe that the crossing-brackets measure is inadequate for evaluating a grammar like XTAG. There are two reasons for the inadequacy. First, the parse generated by XTAG is much richer in its representation of the internal structure of certain phrases than those present in manually created treebanks (e.g. IBM: [_N your personal computer], XTAG: [_NP [_G your] [_N [_N personal] [_N computer]]]). This is reflected in the number of constituents per sentence, shown in the last column of Table G.6.^31.5

System	Sent.	# of	Av. # of	Av. # of
	Length	sent	words/sent	Constituents/sent
XTAG	1-10	654	7.45	22.03
	1-15	978	9.13	30.56
IBM Stat.	1-10	447	7.50	4.60
Grammar	1-15	883	10.30	6.40

{Constituents in XTAG parse and IBM parse

A second reason for considering the crossing bracket measure inadequate for evaluating XTAG is that the primary structure in XTAG is the derivation tree from which the bracketed tree is derived. Two identical bracketings for a sentence can have completely different derivation trees (e.g. kick the bucket as an idiom vs. a compositional use). A more direct measure of the performance of XTAG would evaluate the derivation structure, which captures the dependencies between words.

Next: Comparison with Alvey Up: Evaluation and Results Previous: Chunking and Dependencies in

XTAG Project
http://www.cis.upenn.edu/~xtag