CIS 530 Fall 2015 - Project Resources

Instructor: Ani Nenkova


Preprocessing

You can find some of these resources already installed on eniac.
  1. /project/cis/nlp/tools/postagger-2006-05-21
  2. /project/cis/nlp/tools/stanford-parser
  3. /project/cis/nlp/tools/stanford-ner
  4. /project/cis/nlp/tools/stanford-corenlp-2010-11-11
  5. /project/cis/nlp/tools/cherrypicker1.01

Sentiment Analysis

  • Sentiwordnet: WordNet synsets tagged with polarity values---positive, negative, objective.
  • MPQA: A number of resources related to polarity and opinions.
  • Wordnet-Affect: WordNet synsets representing emotions are tagged. There are also identifiers for positive and negative emotions.
  • General Inquirer: is a resource built by psychologists who created several word categories such as positive, negative, action, understatement, etc.
  • MRC Psycholinguistic database: A combination of several resources on lexical semantics.
    The MRC dictionary is at /project/cis/nlp/tools/MRC/mrc2.dct on ENIAC. A parsed version of the MRC dictionary is at /project/cis/nlp/tools/MRC/MRC_parsed, where the ratings for each word depending on the degree of age of acquisition (AOA), familarity (FAM), concreteness (CONC), imagability (IMAG) and meaningfulness (MEANC) are provided. The file MRC_words include 4,923 words with non-zero values in the familiary dictionary. The list of words can be used in a vector of bag-of-words representations for the leads.

Other text analysis resources

  • Wordnet Domains: WordNet synsets tagged as belonging to a wide variety of domains such as economy, arts, transport, etc.
  • AddDiscourse: A tool for marking up a text with explicit discourse connectives such as 'because' and 'but' and tagging them with the semantic sense that they indicate, such as "Cause" or "Comparison".
  • VerbOcean: A resource that will be useful for identifying temporal relations between verbs such as "happens before". Also has information on which verbs are similar or oppositely related.
  • SRILM: is a language modeling toolkit developed at the SRI laboratory. A fast and easy way to compute large language models. Available at /project/cis/nlp/tools/srilm/ on ENIAC.
  • Charniak parser and pronoun resolution tool: Tools from the Brown University's NLP group.
  • tgrep and Tregex: Tools that help to search for patterns in syntactic parse trees.
  • Roget's thesaurus: Another electronically available thesaurus with an API. Can be used to obtain synonyms, antonyms and lexical chains.

Word Categorization and Word Representation

  • word2vec
  • Brown clusters. Also available at /project/cis/nlp/tools/brown-cluster on ENIAC.
  • Other word representations for NLP available here

Classifiers

  • SVM Light
  • WEKA: An implementation of a number of classifiers.
  • LibSVM On ENIAC: /project/cis/nlp/tools/libsvm-3.16/
  • Liblinear On ENIAC: /project/cis/nlp/tools/liblinear-1.96/

Clustering toolkit

  • CLUTO: A tool to create clusters of similar documents.

Searching through huge data

  • Apache Lucene: Build your own text search engine. There are several tutorials on Lucene online. A simple introduction is here.

Visualization

  • Graphviz: An easy to use tool for visualizing graphs. Already available on eniac--example usage: "dot -Tpng graphFiles/d061.gr -o graphFiles/d061.png".

Topic Modeling

  • Stanford Topic Modeling Toolbox: The Stanford Topic Modeling Toolbox (TMT) brings topic modeling tools to social scientists and others who wish to perform analysis on datasets that have a substantial textual component.

CLAIRlib

  • CLAIRlib: The Clair library is a suite of open-source Perl modules intended to simplify a number of generic tasks in natural language processing (NLP), information retrieval (IR), and network analysis (NA).

Paraphrase and Compression Corpora


Paraphrase Resources


Opinion Summarization


Sentence Ordering