Machine Learning and Natural Language

Fall 2005

Experimental Assignment I                    Lexical Paraphrasing: A Statistical Study (Due 10/21/05)

General

Lexical Paraphrasing

The goal of this problem set is to write a program that will identify verbs pairs with the following property: one of the two verbs can replace the other within some sentences such that the meaning of the resulting sentence will entail the meaning of the original one.
You will The assignment is based on the paper
 
O. Glickman and I. Dagan
Acquiring lexical paraphrases from a single corpus,,
In Recent Advances in Natural Language Processing III, Nicolov, Nicolas, Kalina Bontcheva, Galia Angelova and Ruslan Mitkov (eds.)

The assignment has 5 parts: The Preparation and Evaluation Team will be responsible for the first and last stage. However, I will provide you with already preprocessed text. The data is taken from the AQUAINT corpus, and it is already processed using Dekang Lin's dependency parser, minipar.
A preliminary snapshot of the data is available at here.
Some details on the files is at here.
This is a very large corpus; you may decide not to use all of it. Please note that the data is not freely available and id given to you only for the purpose of this project. Please do not place it anywhere that is freely accessible. The Preparation and Evaluation Team will need to decide what other pre-processing need to be done, and they will will present these details in class on Oct. 11. They will also make the details available from a web page on the course web page. At this stage, the task definition as well as the form of the output need to be clear to all other teams.
At the end, on Oct. 26, they will also submit a final document (and give a presentation) which summarizes the results of all teams, compares the results of all teams on the common tasks, and highlights other interesting experiments teams chose to perform.

In between, all other teams will do the the other stages. These reports will be due on Oct. 21.Do not wait until the detailed presentation of data and task on Oct. 11 to start your work. After you read this document and the relevant papers you have enough information to start your work.

The Assignment

Please note that the goal of stages (2) and (3) is to identify verbs pairs with the following property: one of the two verbs can replace the other within some sentences such that the meaning of the resulting sentence will entail the meaning of the original one.
That is, the task is not context sensitive. In each of these stages you will generate a list of verb pairs, and provide a way to evaluate how well you do. For stage (4) you may design you own extension -- change the definition of the task in some way (e.g., make it context sensitive), extend the coverage of the algorithm, propose a different algorithm, etc. Consequently, the output might be different, and you will have to choose it, and figure out a way to evaluate it.

Report

  1. Describe what you did, the specifics of your models, additional decision you had to make that are not detailed in the paper and the rational behind your decisions
  2. Do the same for your proposed extension.
  3. Report on the comparison between the two approaches.
  4. Provide the code for your preprocessing and for your model estimation and evaluation.
  5. Present the output of your program on the training corpus and the test corpus.
  6. Package the code so that one can run it(details: P&E team)


The Assignment

We are interested in understanding the task (well defined? things have to be modified for one reason or another?) and a comparative study of the two approaches. When designing the experiments (P&E team) care should be taken that the comparison is fair. If you decide to do other experiments, make sure you have something to compare with. What to submit (Updated) The P&E team will add here a detailed list of what is expected, and how to submit it.

Here are the details courtesy of the P&E Team. Also, look here for Additional files

Grading

Your grade depends on:
  1. The quality of your report
  2. The quality of your results.
  3. Your originality in going beyond the minimal requirements.

Due date

Thursday, Oct. 21. (with some other due dates relevant to P&E).
Dan Roth