DANIEL ZEMAN will talk at the XTAG meeting, July 29

	Learning Verb Subcategorization from Corpora

	Daniel Zeman (joint work with Anoop Sarkar)

	Thursday July 29th, 10:30 (Large Conference Room)


The subcategorization of verbs is one of the most essential issues in
parsing, helping us to attach the right arguments to the verb and also
to understand their function and meaning. Several techniques and
results have been reported on learning subcat frames (SFs) from text
corpora. All of this work deals with English. We were able to
automatically extract SFs for Czech, which is a free-word-order
language, where verb complements are marked by inflection. Unlike
existing work, we do not assume that the set of SFs is known to us in
advance. Also in contrast, we work with syntactically annotated data (a
dependency treebank) where the subcategorization information is NOT
given; although this is less noisy compared to using raw text, we have
discovered interesting problems that a user of a raw or tagged corpus
is unlikely to face.

The talk will give a brief description of those properties of Czech
that have to be taken into account when searching for SFs. Then I will
show the differences from the other research efforts: we are able to
find all the complements of verbs but since many of them are to be
treated as adjuncts we have to filter them out. I will describe a novel
technique that uses intersections of observed frames to distinguish
arguments from adjuncts.

Using our techniques, we are able to achieve nearly 90% accuracy in
distinguishing arguments from adjuncts in new parsed text. We will also
mention some future work which tries to extract the same SF information
using only morphologically tagged text, and also the use of SFs in
parsing, improving annotations in treebanks, and treebank-grammar