Optimizing Statistical Machine Translation for Text Simplification.
Wei Xu, Courtney Napoles, Ellie Pavlick, Jim Chen, and Chris Callison-Burch.
TACL-2016.
Abstract
Most recent sentence simplification systems use basic machine translation models to learn lexical and syntactic paraphrases from a manually simplified parallel corpus. These methods are limited by the quality and quantity of manually simplified corpora, which are expensive to build. In this paper, we conduct an in-depth adaptation of statistical machine translation to perform text simplification, taking advantage of large-scale paraphrases learned from bilingual texts and a small amount of manual simplifications with multiple references. Our work is the first to design automatic metrics that are effective for tuning and evaluating simplification systems, which will facilitate iterative development for this task.
BibTex
@article{Xu-EtAl:2016:TACL,
author = {Wei Xu and Courtney Napoles and Ellie Pavlick and Quanze Chen and Chris Callison-Burch},
title = {Optimizing Statistical Machine Translation for Text Simplification},
journal = {Transactions of the Association for Computational Linguistics},
volume = {4},
year = {2016},
url = {http://www.cis.upenn.edu/~ccb/publications/optimizing-machine-translation-for-text-simplifciation.pdf},
pages = {401--415}
}
|
Clustering Paraphrases by Word Sense.
Anne Cocos and Chris Callison-Burch.
NAACL-2016.
Abstract
Automatically generated databases of English paraphrases have the drawback that they return a single list of paraphrases for an input word or phrase. This means that all senses of polysemous words are grouped together, unlike WordNet which partitions different senses into separate synsets. We present a new method for clustering paraphrases by word sense, and apply it to the Paraphrase Database (PPDB). We investigate the performance of hierarchical and spectral clustering algorithms, and systematically explore different ways of defining the similarity matrix that they use as input. Our method produces sense clusters that are qualitatively and quantitatively good, and that represent a substantial improvement to the PPDB resource.
BibTex
@inproceedings{Cocos-Callison-Burch:2016:NAACL,
author = {Anne Cocos and Chris Callison-Burch},
title = {Clustering Paraphrases by Word Sense},
booktitle = {The 2016 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2016)},
month = {June},
year = {2016},
address = {San Diego, California},
url = {http://www.cis.upenn.edu/~ccb/publications/clustering-paraphrases-by-word-sense.pdf}
}
|
End-to-End Statistical Machine Translation with Zero or Small Parallel Texts.
Ann Irvine and Chris Callison-Burch.
Journal of Natural Language Engineering-2016.
Abstract
We use bilingual lexicon induction techniques, which learn translations from monolingual texts in two languages, to build an end-to-end statistical machine translation (SMT) system without the use of any bilingual sentence-aligned parallel corpora. We present detailed analysis of the accuracy of bilingual lexicon induction, and show how a discriminative model can be used to combine various signals of translation equivalence (like contextual similarity, temporal similarity, orthographic similarity and topic similarity). Our discriminative model produces higher accuracy translations than previous bilingual lexicon induction techniques. We reuse these signals of translation equivalence as features on a phrase-based SMT system. These monolingually-estimated features enhance low resource SMT systems in addition to allowing end-to-end machine translation without parallel corpora.
BibTex
@article{Irvine-Callison-Burch:2015:JNLE,
author = {Ann Irvine and Chris Callison-Burch},
title = {End-to-End Statistical Machine Translation with Zero or Small Parallel Texts},
journal = {Journal of Natural Language Engineering},
volume = {22},
issue = {4},
year = {2016},
url = {http://www.cis.upenn.edu/~ccb/publications/end-to-end-smt-with-zero-or-small-bitexts.pdf},
pages = {517-548}
}
|
A Comprehensive Analysis of Bilingual Lexicon Induction.
Ann Irvine and Chris Callison-Burch.
Computational Linguistics-2016.
Abstract
Bilingual lexicon induction is the task of inducing word translations from monolingual corpora in two languages. In this paper we present the most comprehensive analysis of bilingual lexicon induction to date. We present experiments on a wide range of languages and data sizes. We examine translation into English from 25 foreign languages -- Albanian, Azeri, Bengali, Bosnian, Bulgarian, Cebuano, Gujarati, Hindi, Hungarian, Indonesian, Latvian, Nepali, Romanian, Serbian, Slovak, Somali, Spanish, Swedish, Tamil, Telugu, Turkish, Ukrainian, Uzbek, Vietnamese and Welsh. We analyze the behavior of bilingual lexicon induction on low frequency words, rather than testing solely on high frequency words, as previous research has done. Low frequency words are more relevant to statistical machine translation, where systems typically lack translations of rare words that fall outside of their training data. We systematically explore a wide range of features and phenomena that affect the quality of the translations discovered by bilingual lexicon induction. We give illustrative examples of the highest ranking translations for orthogonal signals of translation equivalence like contextual similarity and temporal similarity. We analyze the effects of frequency and burstiness, and the sizes of the seed bilingual dictionaries and the monolingual training corpora. Additionally, we introduce a novel discriminative approach to bilingual lexicon induction. Our discriminative model is capable of combining a wide variety of features, which individually provide only weak indications of translation equivalence. When feature weights are discriminatively set, these signals produce dramatically higher translation quality than previous approaches that combined signals in an unsupervised fashion (e.g. using minimum reciprocal rank). We also directly compare our model's performance against a sophisticated generative approach, the matching canonical correlation analysis (MCCA) algorithm used by Haghighi et al (2008). Our algorithm achieves an accuracy of 42% versus MCCA's 15%.
|
The Gun Violence Database: A new task and data set for NLP.
Ellie Pavlick, Heng Ji, Xiaoman Pan and Chris Callison-Burch.
EMNLP-2016.
Abstract
We argue that NLP researchers are especially well-positioned to contribute to the national discussion about gun violence. Reasoning about the causes and outcomes of gun violence is typically dominated by politics and emotion, and data-driven research on the topic is stymied by a shortage of data and a lack of federal funding. However, data abounds in the form of unstructured text from news articles across the country. This is an ideal application of NLP technologies, such as relation extraction, coreference resolution, and event detection. We introduce a new and growing dataset, the Gun Violence Database, in order to facilitate the adaptation of current NLP technologies to the domain of gun violence, thus enabling better social science research on this important and under-resourced problem.
BibTex
@inproceedings{Pavlick-EtAl:2016:EMNLP,
author = {Ellie Pavlick and Heng Ji and Xiaoman Pan and Chris Callison-Burch},
title = {The Gun Violence Database: A new task and data set for {NLP}},
booktitle = {Proceedings of The 2016 Conference on Empirical Methods on Natural Language Processing (EMNLP)},
month = {November},
year = {2016},
address = {Austin, TX},
url = {http://www.cis.upenn.edu/~ccb/publications/gun-violence-database.pdf}
}
|
PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification.
Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevich, Ben Van Durme, Chris Callison-Burch.
ACL-2015.
Abstract
We present a new release of the Paraphrase Database. PPDB 2.0 includes a discriminatively re-ranked set of paraphrases that achieve a higher correlation with human judgments than PPDB 1.0’s heuristic rankings. Each paraphrase pair in the database now also includes fine-grained entailment relations, word embedding similarities, and style annotations.
BibTex
@InProceedings{PavlickEtAl-2015:ACL:Semantics,
author = {Ellie Pavlick and Pushpendre Rastogi and Juri Ganitkevich and Ben Van Durme, Chris Callison-Burch},
title = {{PPDB} 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification}
booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015)},
month = {July},
year = {2015},
address = {Beijing, China},
publisher = {Association for Computational Linguistics},
}
|
Adding Semantics to Data-Driven Paraphrasing.
Ellie Pavlick, Johan Bos, Malvina Nissim, Charley Beller, Benjamin Van Durme, and Chris Callison-Burch.
ACL-2015.
Abstract
We add an interpretable semantics to the paraphrase database (PPDB). To date, the relationship between the phrase pairs in the database has been weakly defined as approximately equivalent. We show that in fact these pairs represent a variety of relations, including directed entailment (little girl/girl) and exclusion (nobody/someone). We automatically assign semantic entailment relations to entries in PPDB using features derived from past work on discovering inference rules from text and semantic taxonomy induction. We demonstrate that our model assigns these entailment relations with high accuracy. In a downstream RTE task, our labels rival relations from WordNet and improve the coverage of a proof-based RTE system by 17%.
Figures
BibTex
@inproceedings{Pavlick-EtAl:2015:ACL,
author = {Ellie Pavlick and Johan Bos and Malvina Nissim and Charley Beller and Benjamin Van Durme and Chris Callison-Burch},
title = {Adding Semantics to Data-Driven Paraphrasing},
booktitle = {The 53rd Annual Meeting of the Association for Computational
Linguistics (ACL 2015)},
month = {July},
year = {2015},
address = {Beijing, China},
url = {http://www.cis.upenn.edu/~ccb/publications/adding-semantics-to-data-driven-paraphrasing.pdf}
}
|
Domain-Specific Paraphrase Extraction.
Ellie Pavlick, Juri Ganitkevich, Tsz Ping Chan, Xuchen Yao, Ben Van Durme, Chris Callison-Burch.
ACL-2015.
Abstract
The validity of applying paraphrase rules depends on the domain of the text that they are being applied to. We develop a novel method for extracting domain-specific paraphrases. We adapt the bilingual pivoting paraphrase method to bias the training data to be more like our target domain of biology. Our best model results in higher precision while retaining complete recall, giving a 10% relative improvement in AUC.
BibTex
@InProceedings{PavlickEtAl-2015:ACL:Domain,
author = {Ellie Pavlick and Juri Ganitkevich and Tsz Ping Chan and Xuchen Yao and Ben Van Durme, Chris Callison-Burch},
title = {Domain-Specific Paraphrase Extraction},
booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015)},
month = {July},
year = {2015},
address = {Beijing, China},
publisher = {Association for Computational Linguistics},
}
|
Problems in Current Text Simplification Research: New Data Can Help.
Wei Xu, Chris Callison-Burch, and Courtney Napoles.
TACL-2015.
Abstract
Simple Wikipedia has dominated simplification research in the past 5 years. In this opinion paper, we argue that focusing on Wikipedia limits simplification research. We back up our arguments with corpus analysis and by highlighting statements that other researchers have made in the simplification literature. We introduce a new simplification dataset that is a significant improvement over Simple Wikipedia, and present a novel quantitative-comparative approach to study the quality of simplification data resources.
Figures
Abstract: Simple Wikipedia has dominated simplification research in the past 5 years. In this opinion paper, we argue that focusing on Wikipedia limits simplification research. We back up our arguments with corpus analysis and by highlighting statements that other researchers have made in the simplification literature. We introduce a new simplification dataset that is a significant improvement over Simple Wikipedia, and present a novel quantitative-comparative approach to study the quality of simplification data resources.
Table 1: Example sentence pairs (NORM-SIMP) aligned between English Wikipedia and Simple English Wikipedia. The breakdown in percentages is obtained through manual examination of 200 randomly sampled sentence pairs in the Parallel Wikipedia Simplification (PWKP) corpus.
Table 2: The vocabulary size of the Parallel Wikipedia Simplification (PWKP) corpus and the vocabulary difference between its normal and simple sides (as a 2×2 matrix). Only words consisting of the 26 English letters are counted.
Table 3: Example of sentences written at multiple levels of text complexity from the Newsela data set. The Lexile readability score and grade level apply to the whole article rather than individual sentences, so the same sentences may receive different scores, e.g. the above sentences for the 6th and 7th grades. The bold font highlights the parts of sentence that are different from the adjacent version(s).
Figure 1: Manual classification of aligned sentence pairs from the Newsela corpus. We categorize randomly sampled 50 sentence pairs drawn from the Original-Simp2 and 50 sentences from the Original-Simp4.
Table 4: Basic statistics of the Newsela Simplification corpus vs. the Parallel Wikipedia Simplification (PWKP) corpus. The Newsela corpus consists of 1130 articles with original and 4 simplified versions each. Simp-1 is of the least simplified level, while Simp-4 is the most simplified. The numbers marked by * are slightly different from previously reported, because of the use of different tokenizers.
Table 5: This table shows the vocabulary changes between different levels of simplification in the Newsela corpus (as a 5×5 matrix). Each cell shows the number of unique word types that appear in the corpus listed in the column but do not appear in the corpus listed in the row. We also list the average frequency of those vocabulary items. For example, in the cell marked *, the Simp-4 version contains 583 unique words that do not appear in the Original version. By comparing the cells marked **, we see about half of the words (19,197 out of 39,046) in the Original version are not in the Simp-4 version. Most of the vocabulary that is removed consists of low-frequency words (with an average frequency of 2.6 in the Original).
Table 6: Top 50 tokens associated with the complex text, computed using the Monroe et al. (2008) method. Bold words are shared by the complex version of Newsela and the complex version of Wikipedia.
Table 7: Top 50 tokens associated with the simplified text.
Table 8: Frequency of example words from Table 6. These complex words are reduced at a much greater rate in the simplified Newsela than they are in the Simple English Wikipedia. A smaller odds ratio indicates greater reduction.
Table 9: Top 30 syntax patterns associated with the complex text (left) and simplified text (right). Bold patterns are the top patterns shared by Newsela and Wikipedia.
Figure 2: Distribution of document-level compression ratio, displayed as a histogram smoothed by kernel density estimation. The Newsela corpus is more normally distributed, suggesting more consistent quality.
Figure 3: A radar chart that visualizes the odds ratio (radius axis) of discourse connectives in simple side vs. complex side. An odds ratio larger than 1 indicates the word is more likely to occur in the simplified text than in the complex text, and vice versa. Simple cue words (in the shaded region), except “hence”, are more likely to be added during Newsela’s simplification process than in Wikipedia’s. Complex conjunction connectives (in the unshaded region) are more likely to be retained in Wikipedia’s simplifications than in Newsela’s.
Previous
Next
BibTex
@article{Xu-EtAl:2015:TACL,
author = {Wei Xu and Chris Callison-Burch and Courtney Napoles},
title = {Problems in Current Text Simplification Research: New Data Can
Help},
journal = {Transactions of the Association for Computational Linguistics},
volume = {3},
year = {2015},
url = {http://www.cis.upenn.edu/~ccb/publications/new-data-for-text-simplification.pdf},
pages = {283--297}
}
|
Translations of the CALLHOME Egyptian Arabic corpus for conversational speech translation.
Gaurav Kumar, Yuan Cao, Ryan Cotterell, Chris Callison-Burch, Daniel Povey, and Sanjeev Khudanpur.
IWSLT-2014.
Abstract
Translation of the output of automatic speech recognition (ASR) systems, also known as speech translation, has received a lot of research interest recently. This is especially true for programs such as DARPA BOLT which focus on improving spontaneous human-human conversation across languages. However, this research is hindered by the dearth of datasets developed for this explicit purpose. For Egyptian Arabic-English, in particular, no parallel speech-transcription-translation dataset exists in the same domain. In order to support research in speech translation, we introduce the Callhome Egyptian Arabic-English Speech Translation Corpus. This supplements the existing LDC corpus with four reference translations for each utterance in the transcripts. The result is a three-way parallel dataset of Egyptian Arabic Speech, transcriptions and English translations.
Figures
BibTex
@InProceedings{kumar-EtAl:2014:IWSLT,
author = {Matt Post and Gaurav Kumar and Adam Lopez and Damianos Karakos and Chris Callison-Burch and Sanjeev Khudanpur},
title = {Translations of the {CALLHOME} {Egyptian} {Arabic} corpus for conversational speech translation},
booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)}
month = {December},
year = {2014},
address = {Lake Tahoe, USA},
publisher = {Association for Computational Linguistics},
url = {http://cis.upenn.edu/~ccb/publications/callhome-egyptian-arabic-speech-translations.pdf}
}
|
The Language Demographics of Amazon Mechanical Turk.
Ellie Pavlick, Matt Post, Ann Irvine, Dmitry Kachaev, and Chris Callison-Burch.
TACL-2014.
Abstract
We present a large scale study of the languages spoken by bilingual workers on Mechanical Turk (MTurk). We establish a methodology for determining the language skills of anonymous crowd workers that is more robust than simple surveying. We validate workers' self-reported language skill claims by measuring their ability to correctly translate words, and by geolocating workers to see if they reside in countries where the languages are likely to be spoken. Rather than posting a one-off survey, we posted paid tasks consisting of 1,000 assignments to translate a total of 10,000 words in each of 100 languages. Our study ran for several months, and was highly visible on the MTurk crowdsourcing platform, increasing the chances that bilingual workers would complete it. Our study was useful both to create bilingual dictionaries and to act as census of the bilingual speakers on MTurk. We use this data to recommend languages with the largest speaker populations as good candidates for other researchers who want to develop crowdsourced, multilingual technologies. To further demonstrate the value of creating data via crowdsourcing, we hire workers to create bilingual parallel corpora in six Indian languages, and use them to train statistical machine translation systems.
Figures
BibTex
@article{Pavlick-EtAl-2014:TACL,
author = {Ellie Pavlick and Matt Post and Ann Irvine and Dmitry Kachaev and Chris Callison-Burch},
title = {The Language Demographics of {Amazon Mechanical Turk}},
journal = {Transactions of the Association for Computational Linguistics},
volume = {2},
number = {Feb},
year = {2014},
pages = {79--92},
publisher = {Association for Computational Linguistics},
url = {http://cis.upenn.edu/~ccb/publications/language-demographics-of-mechanical-turk.pdf}
}
|
Crowd-Workers: Aggregating Information Across Turkers To Help Them Find Higher Paying Work.
Chris Callison-Burch.
HCOMP Poster-2014.
Abstract
The Mechanical Turk crowdsourcing platform currently fails to provide the most basic piece of information to enable workers to make informed decisions about which tasks to undertake: what is the expected hourly pay? Mechanical Turk advertises a reward amount per assignment, but does not give any indication of how long each assignment will take. We have developed a browser plugin that tracks the length of time it takes to complete a task, and a web service that aggregates the information across many workers. Crowd-Workers. com allows workers to discovery higher paying work by sorting tasks by estimated hourly rate.
Figures
BibTex
@InProceedings{Chen-et-al:HCOMP:2014,
author = {Chris Callison-Burch},
title = {Crowd-Workers: Aggregating Information Across Turkers To Help Them Find Higher Paying Work},
booktitle = {The Second AAAI Conference on Human Computation and Crowdsourcing (HCOMP-2014)},
month = {November},
year = {2014},
url = {http://cis.upenn.edu/~ccb/publications/crowd-workers.pdf}
}
|
Joshua 5.0: Sparser, better, faster, server.
Matt Post, Juri Ganitkevitch, Luke Orland, Jonathan Weese, Yuan Cao, and Chris Callison-Burch.
WMT-2013.
Abstract
We describe improvements made over the past year to Joshua, an open-source translation system for parsing-based machine translation. The main contributions this past year are significant improvements in both speed and usability of the grammar extraction and decoding steps. We have also rewritten the decoder to use a sparse feature representation, enabling training of large numbers of features with discriminative training methods.
Figures
BibTex
@InProceedings{post-EtAl:2013:WMT,
author = {Post, Matt and Ganitkevitch, Juri and Orland, Luke and Weese, Jonathan and Cao, Yuan and Callison-Burch, Chris},
title = {Joshua 5.0: Sparser, Better, Faster, Server},
booktitle = {Proceedings of the Eighth Workshop on Statistical Machine Translation},
month = {August},
year = {2013},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {206--212},
url = {http://www.aclweb.org/anthology/W13-2226}
}
|
PPDB: The Paraphrase Database.
Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch.
NAACL-2013.
Abstract
We present the 1.0 release of our paraphrase database, PPDB. Its English portion, PPDB:Eng, contains over 220 million paraphrase pairs, consisting of 73 million phrasal and 8 million lexical paraphrases, as well as 140 million paraphrase patterns, which capture many meaning-preserving syntactic transformations. The paraphrases are extracted from bilingual parallel corpora totaling over 100 million sentence pairs and over 2 billion English words. We also release PPDB:Spa, a collection of 196 million Spanish paraphrases. Each paraphrase pair in PPDB contains a set of associated scores, including paraphrase probabilities derived from the bitext data and a variety of monolingual distributional similarity scores computed from the Google n-grams and the Annotated Gigaword corpus. Our release includes pruning tools that allow users to determine their own precision/recall tradeoff.
Figures
BibTex
@InProceedings{ganitkevitch-EtAl:2013:NAACL,
author = {Juri Ganitkevitch and Benjamin VanDurme and Chris Callison-Burch},
title = {{PPDB}: The Paraphrase Database},
booktitle = {Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2013)},
month = {June},
year = {2013},
address = {Atlanta, Georgia},
publisher = {Association for Computational Linguistics},
url = {http://cis.upenn.edu/~ccb/publications/ppdb.pdf}
}
|
Supervised Bilingual Lexicon Induction with Multiple Monolingual Signals.
Ann Irvine and Chris Callison-Burch.
NAACL-2013.
Abstract
Prior research into learning translations from source and target language monolingual texts has treated the task as an unsupervised learning problem. Although many techniques take advantage of a seed bilingual lexicon, this work is the first to use that data for supervised learning to combine a diverse set of signals derived from a pair of monolingual corpora into a single discriminative model. Even in a low resource machine translation setting, where induced translations have the potential to improve performance substantially, it is reasonable to assume access to some amount of data to perform this kind of optimization. Our work shows that only a few hundred translation pairs are needed to achieve strong performance on the bilingual lexicon induction task, and our approach yields an average relative gain in accuracy of nearly 50% over an unsupervised baseline. Large gains in accuracy hold for all 22 languages (low and high resource) that we investigate.
Figures
BibTex
@InProceedings{irvine-callisonburch:2013:NAACL,
author = {Ann Irvine and Chris Callison-Burch},
title = {Supervised Bilingual Lexicon Induction with Multiple Monolingual Signals},
booktitle = {Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2013)},
month = {June},
year = {2013},
address = {Atlanta, Georgia},
publisher = {Association for Computational Linguistics},
url = {http://cis.upenn.edu/~ccb/publications/supervised-bilingual-lexicon-induction.pdf}
}
|
Improved Speech-to-Text Translation with the Fisher and Callhome Spanish–English Speech Translation Corpus.
Matt Post, Gaurav Kumar, Adam Lopez, Damianos Karakos, Chris Callison-Burch and Sanjeev Khudanpur.
IWSLT-2013.
Abstract
Research into the translation of the output of automatic speech recognition (ASR) systems is hindered by the dearth of datasets developed for that explicit purpose. For Spanish-English translation, in particular, most parallel data available exists only in vastly different domains and registers. In order to support research on cross-lingual speech applications, we introduce the Fisher and Callhome Spanish-English Speech Translation Corpus, supplementing existing LDC audio and transcripts with (a) ASR 1-best, lattice, and oracle output produced by the Kaldi recognition system and (b) English translations obtained on Amazon’s Mechanical Turk. The result is a four-way parallel dataset of Spanish audio, transcriptions, ASR lattices, and English translations of approximately 38 hours of speech, with defined training, development, and held-out test sets. We conduct baseline machine translation experiments using models trained on the provided training data, and validate the dataset by corroborating a number of known results in the field, including the utility of in-domain (information, conversational) training data, increased performance translating lattices (instead of recognizer 1-best output), and the relationship between word error rate and BLEU score.
Figures
BibTex
@InProceedings{post-EtAl:2013:IWSLT,
author = {Matt Post and Gaurav Kumar and Adam Lopez and Damianos Karakos and Chris Callison-Burch and Sanjeev Khudanpur},
title = {Improved Speech-to-Text Translation with the Fisher and Callhome Spanish–English Speech Translation Corpus},
booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)}
month = {December},
year = {2013},
address = {Heidelberg, Germany},
publisher = {Association for Computational Linguistics},
url = {http://cis.upenn.edu/~ccb/publications/improved-speech-to-speech-translation.pdf}
}
|
Joshua 4.0: Packing, PRO, and Paraphrases.
Juri Ganitkevitch, Yuan Cao, Jonathan Weese, Matt Post, and Chris Callison-Burch.
WMT-2012.
Abstract
We present Joshua 4.0, the newest version of our open-source decoder for parsing-based statistical machine translation. The main contributions in this release are the introduction of a compact grammar representation based on packed tries, and the integration of our implementation of pairwise ranking optimization, J-PRO. We further present the extension of the Thrax SCFG grammar extractor to pivot-based extraction of syntactically informed sentential paraphrases.
Figures
BibTex
@InProceedings{ganitkevitch-EtAl:2012:WMT,
author = {Ganitkevitch, Juri and Cao, Yuan and Weese, Jonathan and Post, Matt and Callison-Burch, Chris},
title = {Joshua 4.0: Packing, PRO, and Paraphrases},
booktitle = {Proceedings of the Seventh Workshop on Statistical Machine Translation},
month = {June},
year = {2012},
address = {Montr{\'e}al, Canada},
publisher = {Association for Computational Linguistics},
pages = {283--291},
url = {http://cis.upenn.edu/~ccb/publications/joshua-4.0.pdf}
}
|
Toward Statistical Machine Translation without Parallel Corpora.
Alex Klementiev, Ann Irvine, Chris Callison-Burch, and David Yarowsky.
EACL-2012.
Abstract
We estimate the parameters of a phrase-based statistical machine translation system from monolingual corpora instead of a bilingual parallel corpus. We extend existing research on bilingual lexicon induction to estimate both lexical and phrasal translation probabilities for MT-scale phrase-tables. We propose a novel algorithm to estimate re-ordering probabilities from monolingual data. We report translation results for an end-to-end translation system using these monolingual features alone. Our method only requires monolingual corpora in source and target languages, a small bilingual dictionary, and a small bitext for tuning feature weights. In this paper, we examine an idealization where a phrase-table is given. We examine the degradation in translation performance when bilingually estimated translation probabilities are removed, and show that 82%+ of the loss can be recovered with monolingually estimated features alone. We further show that our monolingual features add 1.5 BLEU points when combined with standard bilingually estimated phrase table features.
Figures
BibTex
@InProceedings{klementiev-etal:2012:EACL,
author = {Alex Klementiev and Ann Irvine and Chris Callison-Burch and David Yarowsky},
title = {Toward Statistical Machine Translation without Parallel Corpora},
booktitle = {Proceedings of the 13th Conference of the European Chapter of the Association for computational Linguistics},
month = {April},
year = {2012},
address = {Avignon, France}
publisher = {Association for Computational Linguistics},
}
|
Machine Translation of Arabic Dialects.
Rabih Zbib, Erika Malchiodi, Jacob Devlin, David Stallard, Spyros Matsoukas, Richard Schwartz, John Makhoul, Omar F. Zaidan and Chris Callison-Burch.
NAACL-2012.
Abstract
Arabic dialects present many challenges for machine translation, not least of which is the lack of data resources. We use crowdsourcing to cheaply and quickly build Levantine-English and Egyptian-English parallel corpora, consisting of 1.1M words and 380k words, respectively. The dialect sentences are selected from a large corpus of Arabic web text, and translated using Mechanical Turk. We use this data to build Dialect Arabic MT systems. Small amounts of dialect data have a dramatic impact on the quality of translation. When translating Egyptian and Levantine test sets, our Dialect Arabic MT system performs 5.8 and 6.8 BLEU points higher than a Modern Standard Arabic MT system trained on a 150 million word Arabic-English parallel corpus -- over 100 times the amount of data as our dialect corpora.
Figures
BibTex
@InProceedings{Zbib-etal:2012:NAACL,
author = {Rabih Zbib and Erika Malchiodi and Jacob Devlin and David Stallard and Spyros Matsoukas and Richard Schwartz and John Makhoul and Omar F. Zaidan and Chris Callison-Burch},
title = {Machine Translation of Arabic Dialects},
booktitle = {The 2012 Conference of the North American Chapter of the Association for Computational Linguistics},
month = {June},
year = {2012},
address = {Montreal},
publisher = {Association for Computational Linguistics},
url = {http://cis.upenn.edu/~ccb/publications/machine-translation-of-arabic-dialects.pdf}
}
|
Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing.
Matt Post, Chris Callison-Burch, and Miles Osborne.
WMT-2012.
Abstract
Recent work has established the efficacy of Amazon's Mechanical Turk for constructing parallel corpora for machine translation research. We apply this to building a collection of parallel corpora between English and six languages from the Indian subcontinent: Bengali, Hindi, Malayalam, Tamil, Telugu, and Urdu. These languages are low-resource, under-studied, and exhibit linguistic phenomena that are difficult for machine translation. We conduct a variety of baseline experiments and analysis, and release the data to the community.
Figures
BibTex
@InProceedings{post-callisonburch-osborne:2012:WMT,
author = {Post, Matt and Callison-Burch, Chris and Osborne, Miles},
title = {Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing},
booktitle = {Proceedings of the Seventh Workshop on Statistical Machine Translation},
month = {June},
year = {2012},
address = {Montr{\'e}al, Canada},
publisher = {Association for Computational Linguistics},
pages = {401--409},
url = {http://www.aclweb.org/anthology/W12-3152}
}
|
Joshua 3.0: Syntax-based Machine Translation with the Thrax Grammar Extractor.
Jonathan Weese, Juri Ganitkevitch, Chris Callison-Burch, Matt Post and Adam Lopez.
WMT-2011.
Abstract
We present progress on Joshua, an open source decoder for hierarchical and syntax-based machine translation. The main focus is describing Thrax, a flexible, open source synchronous context-free grammar extractor. Thrax extracts both hierarchical (Chiang, 2007) and syntax-augmented machine translation (Zollmann and Venugopal, 2006) grammars. It is built on Apache Hadoop for efficient distributed performance, and can easily be extended with support for new grammars, feature functions, and output formats.
Figures
BibTex
@InProceedings{weese-EtAl:2011:WMT,
author = {Weese, Jonathan and Ganitkevitch, Juri and Callison-Burch, Chris and Post, Matt and Lopez, Adam},
title = {Joshua 3.0: Syntax-based Machine Translation with the Thrax Grammar Extractor},
booktitle = {Proceedings of the Sixth Workshop on Statistical Machine Translation},
month = {July},
year = {2011},
address = {Edinburgh, Scotland},
publisher = {Association for Computational Linguistics},
pages = {478--484},
url = {http://www.aclweb.org/anthology/W11-2160}
}
|
Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation.
Juri Ganitkevitch, Chris Callison-Burch, Courtney Napoles, and Benjamin Van Durme.
EMNLP-2011.
Abstract
Previous work has shown that high quality phrasalparaphrases can be extracted from bilingual parallel corpora. However, it is not clear whether bitexts are an appropriate resource for extracting more sophisticated sentential paraphrases, which are more obviously learnable from monolingual parallel corpora. We extend bilingual paraphrase extraction to syntactic paraphrases and demonstrate its ability to learn a variety of general paraphrastic transformations, including passivization, dative shift, and topicalization. We discuss how our model can be adapted to many text generation tasks by augmenting its feature set, development data, and parameter estimation routine. We illustrate this adaptation by using our paraphrase model for the task of sentence compression and achieve results competitive with state-of-the-art compression systems.
Figures
BibTex
@InProceedings{ganitkevitch-EtAl:2011:EMNLP,
author = {Ganitkevitch, Juri and Callison-Burch, Chris and Napoles, Courtney and {Van Durme}, Benjamin},
title = {Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation},
booktitle = {Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing},
month = {July},
year = {2011},
address = {Edinburgh, Scotland, UK.},
publisher = {Association for Computational Linguistics},
pages = {1168--1179},
url = {http://www.aclweb.org/anthology/D11-1108}
}
|
Crowdsourcing Translation: Professional Quality from Non-Professionals.
Omar Zaidan and Chris Callison-Burch.
ACL-2011.
Abstract
Naively collecting translations by crowdsourcing the task to non-professional translators yields disfluent, low-quality results if no quality control is exercised. We demonstrate a variety of mechanisms that increase the translation quality to near professional levels. Specifically, we solicit redundant translations and edits to them, and automatically select the best output among them. We propose a set of features that model both the translations and the translators, such as country of residence, LM perplexity of the translation, edit rate from the other translations, and (optionally) calibration against professional translators. Using these features to score the collected translations, we are able to discriminate between acceptable and unacceptable translations. We recreate the NIST 2009 Urdu-toEnglish evaluation set with Mechanical Turk, and quantitatively show that our models are able to select translations within the range of quality that we expect from professional translators. The total cost is more than an order of magnitude lower than professional translation.
Figures
BibTex
@InProceedings{zaidan-callisonburch:2011:ACL-HLT2011,
author = {Zaidan, Omar F. and Callison-Burch, Chris},
title = {Crowdsourcing Translation: Professional Quality from Non-Professionals},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
month = {June},
year = {2011},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {1220--1229},
url = {http://www.aclweb.org/anthology/P11-1122}
}
|
The Arabic Online Commentary Dataset: An Annotated Dataset of Informal Arabic with High Dialectal Content.
Omar Zaidan and Chris Callison-Burch.
ACL-2011.
Abstract
The written form of Arabic, Modern Standard Arabic (MSA), differs quite a bit from the spoken dialects of Arabic, which are the true “native” languages of Arabic speakers used in daily life. However, due to MSA’s prevalence in written form, almost all Arabic datasets have predominantly MSA content. We present the Arabic Online Commentary Dataset, a 52M-word monolingual dataset rich in dialectal content, and we describe our long-term annotation effort to identify the dialect level (and dialect itself) in each sentence of the dataset. So far, we have labeled 108K sentences, 41% of which as having dialectal content. We also present experimental results on the task of automatic dialect identification, using the collected labels for training and evaluation.
Figures
BibTex
@InProceedings{zaidan-callisonburch:2011:ACL-HLT2011,
author = {Zaidan, Omar F. and Callison-Burch, Chris},
title = {The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
month = {June},
year = {2011},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {37--41},
url = {http://www.aclweb.org/anthology/P11-2007}
}
|
Joshua 2.0: A Toolkit for Parsing-Based Machine Translationwith Syntax, Semirings, Discriminative Training and Other Goodies.
Zhifei Li, Chris Callison-Burch, Chris Dyer, Juri Ganitkevitch, Ann Irvine, Lane Schwartz, Wren N. G. Thornton, Ziyuan Wang, Jonathan Weese and Omar F. Zaidan.
WMT-2010.
Abstract
We describe the progress we have made in the past year on Joshua (Li et al., 2009a), an open source toolkit for parsing-based machine translation. The new functionality includes: support for translation grammars with a rich set of syntactic nonterminals, the ability for external modules to posit constraints on how spans in the input sentence should be translated, lattice parsing for dealing with input uncertainty, a semiring framework that provides a unified way of doing various dynamic programming calculations, variational decoding for approximating the intractable MAP decoding, hypergraph-based discriminative training for better feature engineering, a parallelized MERT module, document-level and tail-based MERT, visualization of the derivation trees, and a cleaner pipeline for MT experiments.
BibTex
@InProceedings{li-EtAl:2010:WMT,
author = {Li, Zhifei and Callison-Burch, Chris and Dyer, Chris and Ganitkevitch, Juri and Irvine, Ann and Khudanpur, Sanjeev and Schwartz, Lane and Thornton, Wren and Wang, Ziyuan and Weese, Jonathan and Zaidan, Omar},
title = {Joshua 2.0: A Toolkit for Parsing-Based Machine Translation with Syntax, Semirings, Discriminative Training and Other Goodies},
booktitle = {Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR},
month = {July},
year = {2010},
address = {Uppsala, Sweden},
publisher = {Association for Computational Linguistics},
pages = {133--137},
url = {http://www.aclweb.org/anthology/W10-1718}
}
|
Semantically-Informed Syntactic Machine Translation: A Tree-Grafting Approach.
Kathryn Baker, Michael Bloodgood, Chris Callison-Burch, Bonnie Dorr, Scott Miller, Christine Piatko, Nathaniel W. Filardo, and Lori Levin.
AMTA-2010.
Abstract
Figures
BibTex
|
Creating Speech and Language Data With Amazon’s Mechanical Turk.
Chris Callison-Burch and Mark Dredze.
NAACL Workshop on Creating Speech and Language Data With Amazon’s Mechanical Turk-2010.
Abstract
In this paper we give an introduction to using Amazon's Mechanical Turk crowdsourcing platform for the purpose of collecting data for human language technologies. We survey the papers published in the NAACL2010 Workshop. 24 researchers participated in the workshop's $100 challenge to create data for speech and language applications.
Figures
BibTex
@InProceedings{callisonburch-dredze:2010:MTURK,
author = {Callison-Burch, Chris and Dredze, Mark},
title = {Creating Speech and Language Data With {Amazon's Mechanical Turk}},
booktitle = {Proceedings of the {NAACL HLT} 2010 Workshop on Creating Speech and Language Data with {Amazon's Mechanical Turk}},
month = {June},
year = {2010},
address = {Los Angeles},
publisher = {Association for Computational Linguistics},
pages = {1--12},
url = {http://www.aclweb.org/anthology/W10-0701}
}
|
Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation.
Chris Callison-Burch, Philipp Koehn, Christof Monz, Kay Peterson, Mark Przybocki, Omar Zaidan.
WMT-2010.
Abstract
This paper presents the results of the WMT10 and MetricsMATR10 shared tasks,1 which included a translation task, a system combination task, and an evaluation task. We conducted a large-scale manual evaluation of 104 machine translation systems and 41 system combination entries. We used the ranking of these systems to measure how strongly automatic metrics correlate with human judgments of translation quality for 26 metrics. This year we also investigated increasing the number of human judgments by hiring non-expert annotators through Amazon’s Mechanical Turk.
Figures
BibTex
@InProceedings{callisonburch-EtAl:2010:WMT,
author = {Callison-Burch, Chris and Koehn, Philipp and Monz, Christof and Peterson, Kay and Przybocki, Mark and Zaidan, Omar},
title = {Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation},
booktitle = {Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR},
month = {July},
year = {2010},
address = {Uppsala, Sweden},
publisher = {Association for Computational Linguistics},
pages = {17--53},
url = {http://www.aclweb.org/anthology/W10-1703}
}
|
Joshua: An Open Source Toolkit for Parsing-based Machine Translation.
Zhifei Li, Chris Callison-Burch, Chris Dyer, Juri Ganitkevitch, Sanjeev Khudanpur, Lane Schwartz, Wren Thornton, Jonathan Weese and Omar Zaidan.
WMT-2009.
Abstract
Figures
BibTex
|
Findings of the 2009 Workshop on Statistical Machine Translation.
Chris Callison-Burch, Philipp Koehn, Christof Monz and Josh Schroeder.
WMT-2009.
Abstract
This paper presents the results of the WMT09 shared tasks, which included a translation task, a system combination task, and an evaluation task. We conducted a large-scale manual evaluation of 87 machine translation systems and 22 system combination entries. We used the ranking of these systems to measure how strongly automatic metrics correlate with human judgments of translation quality, for more than 20 metrics. We present a new evaluation technique whereby system output is edited and judged for correctness.
Figures
BibTex
@InProceedings{callisonburch-EtAl:2009:WMT,
author = {Callison-Burch, Chris and Koehn, Philipp and Monz, Christof and Schroeder, Josh},
title = {Findings of the 2009 {W}orkshop on {S}tatistical {M}achine {T}ranslation},
booktitle = {Proceedings of the Fourth Workshop on Statistical Machine Translation},
month = {March},
year = {2009},
address = {Athens, Greece},
publisher = {Association for Computational Linguistics},
pages = {1--28},
url = {http://www.aclweb.org/anthology/W09-0401}
}
|
Moses: Open source toolkit for statistical machine translation.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst.
ACL-2007.
Abstract
Figures
BibTex
|
Paraphrasing and Translation.
Chris Callison-Burch.
PhD Thesis, University of Edinburgh-2007.
Abstract
Paraphrasing and translation have previously been treated as unconnected natural language processing tasks. Whereas translation represents the preservation of meaning when an idea is rendered in the words in a different language, paraphrasing represents the preservation of meaning when an idea is expressed using different words in the same language. We show that the two are intimately related. The major contributions of this thesis are as follows:We define a novel technique for automatically generating paraphrases using bilingual parallel corpora, which are more commonly used as training data for statistical models of translation.We show that paraphrases can be used to improve the quality of statistical machine translation by addressing the problem of coverage and introducing a degree of generalization into the models.We explore the topic of automatic evaluation of translation quality, and show that the current standard evaluation methodology cannot be guaranteed to correlate with human judgments of translation quality.Whereas previous data-driven approaches to paraphrasing were dependent upon either data sources which were uncommon such as multiple translation of the same source text, or language specific resources such as parsers, our approach is able to harness more widely parallel corpora and can be applied to any language which has a parallel corpus. The technique was evaluated by replacing phrases with their paraphrases, and asking judges whether the meaning of the original phrase was retained and whether the resulting sentence remained grammatical. Paraphrases extracted from a parallel corpus with manual alignments are judged to be accurate (both meaningful and grammatical) 75% of the time, retaining the meaning of the original phrase 85% of the time. Using automatic alignments, meaning can be retained at a rate of 70%.Being a language independent and probabilistic approach allows our method to be easily integrated into statistical machine translation. A paraphrase model derived from parallel corpora other than the one used to train the translation model can be used to increase the coverage of statistical machine translation by adding translations of previously unseen words and phrases. If the translation of a word was not learned, but a translation of a synonymous word has been learned, then the word is paraphrased and its paraphrase is translated. Phrases can be treated similarly. Results show that augmenting a state-of-the-art SMT system with paraphrases in this way leads to significantly improved coverage and translation quality. For a training corpus with 10,000 sentence pairs, we increase the coverage of unique test set unigrams from 48% to 90%, with more than half of the newly covered items accurately translated, as opposed to none in current approaches.
Figures
BibTex
@PhdThesis{callisonburch:2007:thesis,
author = {Chris Callison-Burch},
title = {Paraphrasing and Translation},
school = {University of Edinburgh},
address = {Edinburgh, Scotland},
year = {2007},
url = {http://cis.upenn.edu/~ccb/publications/callison-burch-thesis.pdf}
}
|
Paraphrasing with Bilingual Parallel Corpora.
Colin Bannard and Chris Callison-Burch.
ACL-2005.
Abstract
Previous work has used monolingual parallel corpora to extract and generate paraphrases. We show that this task can be done using bilingual parallel corpora, a much more commonly available resource. Using alignment techniques from phrase-based statistical machine translation, we show how paraphrases in one language can be identified using a phrase in another language as a pivot. We define a paraphrase probability that allows paraphrases extracted from a bilingual parallel corpus to be ranked using translation probabilities, and show how it can be refined to take contextual information into account. We evaluate our paraphrase extraction and ranking methods using a set of manual word alignments, and contrast the quality with paraphrases extracted from automatic alignments.
Figures
BibTex
@InProceedings{bannard-callisonburch:2005:ACL,
author = {Bannard, Colin and Callison-Burch, Chris},
title = {Paraphrasing with Bilingual Parallel Corpora},
booktitle = {Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05)},
month = {June},
year = {2005},
address = {Ann Arbor, Michigan},
publisher = {Association for Computational Linguistics},
pages = {597--604},
url = {http://www.aclweb.org/anthology/P05-1074},
}
|