Parsing Corpora

In the XTAG project, we have used corpus analysis in two main ways: (1) to measure the performance of the English grammar on a given genre and (2) to identify gaps in the grammar. The second type of evaluation involves performing detailed error analysis on the sentences rejected by the parser, and we have done this several times on WSJ and Brown data. Based on the results of such analysis, we prioritize upcoming grammar development efforts. The results of a recent error analysis are shown in Table G.1. The table does not show errors in parsing due to mistakes made by the POS tagger which contributed the largest number of errors: 32. At this point, we have added a treatment of punctuation to handle #1, an analysis of time NPs (#2), a large number of multi-word prepositions (part of #3), gapless relative clauses (#7), bare infinitives (#14) and have added the missing subcategorization (#3) and missing lexical entry (#12). We are in the process of extending the parser to handle VP coordination (#9) (See Section 22 on recent work to handle VP and other predicative coordination). We find that this method of error analysis is very useful in focusing grammar development in a productive direction.
Rank No of errors Category of error
#1 11 Parentheticals and appositives
#2 8 Time NP
#3 8 Missing subcat
#4 7 Multi-word construction
#5 6 Ellipsis
#6 6 Not sentences
#7 3 Relative clause with no gap
#8 2 Funny coordination
#9 2 VP coordination
#10 2 Inverted predication
#11 2 Who knows
#12 1 Missing entry
#13 1 Comparative?
#14 1 Bare infinitive
{Results of Corpus Based Error Analysis


To ensure that we are not losing coverage of certain phenomena as we extend the grammar, we have a benchmark set of grammatical and ungrammatical sentences from this technical report. We parse these sentences periodically to ensure that in adding new features and constructions to the grammar, we are not blocking previous analyses. There are approximately 590 example sentences in this set.
