CIS Homeline
   
arrow About CIS
spacer spacer
arrow Events
  CIS events in Penn Calendar
spacer spacer
arrow People
spacer spacer
arrow Research
spacer spacer
arrow Undergraduate program
spacer spacer
arrow Graduate program
spacer spacer
arrow Job Openings
   

 

CIS Home divider Penn Engineering divider PENN   spacer  

 
 CIS Research Seminar Series, 2008 

 

Tuesday, November 4th, 2008


Ani Nenkova

Computer Science Department

University of Pennsylvania


"Modeling text quality in newspaper text and machine translation"

Abstract

 

What are the characteristics of well written text? People have strong intuitions about this, but rarely can give a precise answer. General computational models of text quality don't exist either, even though they are a critical component for a range of text producing applications such as summarization, machine translation and text generation.

 

The goal of our work is to develop a model of text quality for use in language applications. For newspaper text, we combine lexical, syntactic, and discourse features to produce a highly predictive model of human readers' judgments of text readability. This is the first study to take into account such a variety of linguistic factors and the first to empirically demonstrate that discourse relations are strongly associated with the perceived quality of text. We show that various surface metrics generally expected to be related to readability are not very good predictors of readability judgments in our Wall Street Journal corpus. We also establish that readability predictors behave differently depending on the task: predicting text readability or pairwise comparison of readability. Our experiments indicate that discourse relations are the one class of features that exhibits robustness across these two tasks.

In the context of machine translation, we study sentence fluency, which is an important component of overall text readability. We report the results of an initial study into the predictive power of surface syntactic statistics and language model features to predict fluency originally assessed for the purpose of evaluating machine translation. We find that these features are weakly but significantly correlated with readability. Machine and human translation can be distinguished with accuracy over 80% and performance on pairwise comparison of fluency is also very high, over 90%.

 

Tuesday, November 4, 2008
3:00 - 4:15
Wu & Chen
101 Levine Hall


_____________________________________________________________________________________________________

 

Archived Lectures

2007

2006

Speakers prior to

2006

 




 
 
CIS Home divider Penn Engineering divider PENN   spacer
  Send comments on this page to