Treebank tokenization

Our tokenization is fairly simple:

Here is a simple sed script that does a decent enough job on most corpora, once the corpus has been formatted into one-sentence-per-line.