GeneTaggerCRF is a gene-entity tagger based on MALLET (MAchine Learning for LanguagE Toolkit). It uses conditional random fields to find genes in a text file. It is currently available only for UNIX.
To download GeneTaggerCRF click the following link :
You will also need to download MALLET and the WordFreak annotation tool:
Once these files have been downloaded simply follow the directions in the README file.
Initial results showed that GeneTaggerCRF performs at .84 precision and .74 recall in gene-entity tagging using 10-fold cross validation on only 90 data files. Once more training data is available, we hope for these numbers to improve.
If you have any questions or comments about GeneTaggerCRF, contact Rishi Talreja at rtalreja@uiuc.edu.
This is the Part-of-Speech Tagger currently being used in the Mining the Bibliome project. The project is using the Wordfreak Annotation and the POS tagger is being used by annotators with the Wordfreak GUI. The tagger is available here for those who wish to use it to tag files without needing to use Wordfreak, and use the text files as input and obtain output in TOKEN_POS format. Since the software in the Mining the Bibliome project is still being used internally, there are various jars that need to be downloaded. They, along with the licensing information and a README are all available as a gzip'd tarfile here. It has currently only been tested in a LINUX/UNIX environment, although it should work under Windows as well.
If you have any questions or comments about the POS tagger, contact Seth Kulick at skulick@linc.cis.upenn.edu.
This tagger was developed using the MALLET
implementation of Conditional Random Fields.
A description of the problem from the README file:
Our task was to develop an automated algorithm that would accurately recognize each
component of an acquired genomic aberration (hereafter referred to as a variation event)
within a cancer specific text (UPenn Biomedical Information Extraction Group, 2003).
Briefly, we define a variation event as a specific, one-time alteration at the genomic
level, and described at the nucleic acid level, amino acid level or both. Each variation
event is identified by the relationship between three variation components: variation
type, variation location, and variation state (both initial and subsequent states).
As an illustration:
"All cases with K-ras codon 12 mutations were found to be G to T transversion."
(Wang et al., 2002)
In this sentence variation component tags would be assigned as follows:
transversion, variation type; codon 12, variation location; G, variation state (initial);
and T, variation state (subsequent). The relationship between these components defines
a single variation event. This entity definition is suitable for a variety of
applications (e.g. other genetic diseases) and readily modified to include
naturally occurring variations (e.g. single nucleotide polymorphisms). Furthermore,
our experience indicates that this definition is generic and capable of capturing
the details of diverse variation events (e.g. point mutation, translocation, aneuploidy,
loss of heterozygosity). Therefore, the task was to properly identify each of the
components independently.
A trained tagger can be downloaded at vartag-v1.0,
This includes all the source code as well as the data the tagger was trained on.
For the original format of the training data in xml you can download the file
vartag-traindata.tar.gz.
A README file is included that gives instructions for
installation and running. MALLET is included only as a jar file. The source code
for MALLET is freely available here.
If you have any questions or comments, contact Ryan McDonald at ryantm at cis dot upenn dot edu