CIS 530 Fall 2012 - Project Resources
Instructor: Ani Nenkova
Writing your report
- We don't have requirement in format of our project report. You just need to make sure it includes all the things mentioned in the project. If you need a template to finish your write-up, we provide two versions below:
- If you are using word, you can refer to this Word-Version for details.
- If you are using Latex, you can refer to this: Latex-Version
- The other files required to make the above .tex file compile through if you're using Latex are in folder report:
sty-file   bst-file   Full-instructions files by NAACL-2013
- There are number of latex tutorials on the web. You can lookup for specific needs such as writing a formula or aligning tables. Here is a link to one beginner's tutorial.
You can find some of these resources already installed on eniac.
- Punkt: A NLTK utility for segmenting a text into sentences. You can see example usage here.
- Porter stemmer: A tool for stemming words.
- nltk.pos_tag: NLTK's part of speech tagger.
- Stanford Part of Speech tagger: Another widely used tagger.
- The Stanford Parser: Creates both phrase structure and dependency parse trees.
- Stanford NER: A named entity recognizer built by Stanford.
- ANC corpus annotations: A corpus of articles from different genres. These articles have been annotated with named entities, parts-of-speech, syntactic structure and some have coreference annotations as well.
- Stanford CoreNLP: Creates annotations at the same time with a couple of Stanford tools--tokenization, sentence splitting, lemmatization, NER, POS tagging, parsing and coreference.
- Cherrypicker: A coreference resolution tool.
- Sentiwordnet: WordNet synsets tagged with polarity values---positive, negative, objective.
- MPQA: A number of resources related to polarity and opinions. OpinionFinder identifies subjective sentences. There is also a lexicon with polarity values for words.
- Wordnet-Affect: WordNet synsets representing emotions are tagged. There are also identifiers for positive and negative emotions.
- General Inquirer: is a resource built by psychologists who created several word categories such as positive, negative, action, understatement, etc. Check out the wordlists here, positive tagged words, negative words
- MRC Psycholinguistic database: A combination of several resources on lexical semantics. Particularly interesting for your projects are ratings for each word depending on the degree of imagery and concreteness associated with them.
Other text analysis resources
Paths on eniac:
- Wordnet Domains: WordNet synsets tagged as belonging to a wide variety of domains such as economy, arts, transport, etc.
- AddDiscourse: A tool for marking up a text with explicit discourse connectives such as 'because' and 'but' and tagging them with the semantic sense that they indicate, such as "Cause" or "Comparison".
- TopicS: A tool for finding the topic words of an article or a collection of
- VerbOcean: A resource that will be useful for identifying temporal relations between verbs such as "happens before". Also has information on which verbs are similar or oppositely related.
- SRILM: is a language modeling toolkit developed at the SRI laboratory. A fast and easy way to compute large language models.
- Charniak parser and pronoun resolution tool: Tools from the Brown University's NLP group.
- tgrep and Tregex: Tools that help to search for patterns in syntactic parse trees. A short tutorial on tgrep is here.
- Roget's thesaurus: Another electronically available thesaurus with an API. Can be used to obtain synonyms, antonyms and lexical chains.
(Classification tools are not considered as tools/resources for Final-Project)
- SVM Light: An SVM classifier, easy to use and also provides scripts that do some tuning of the parameters".
- WEKA: An implementation of a number of classifiers.
- CLUTO: A tool to create clusters of similar documents.
Searching through huge data
- Apache Lucene: Build your own text search engine. There are several tutorials on Lucene online. A simple introduction is here.
- Graphviz: An easy to use tool for visualizing graphs. Already available on eniac--example usage: "dot -Tpng graphFiles/d061.gr -o graphFiles/d061.png".
- Stanford Topic Modeling Toolbox: The Stanford Topic Modeling Toolbox (TMT) brings topic modeling tools to social scientists and others who wish to perform analysis on datasets that have a substantial textual component.
- CLAIRlib: The Clair library is a suite of open-source Perl modules intended to simplify a number of generic tasks in natural language processing (NLP), information retrieval (IR), and network analysis (NA).
Paraphrase and Compression Corpora