Unusually frequent or rare words are implicated in various facets of biological function and structure. With sequence data becoming massively available, tasks akin to an exhaustive enumeration and testing of word frequencies in a whole genome become increasingly appealing, and yet pose significant computational burdens even when limited to words of bounded maximum length. In addition, the display of the huge tables possibly resulting from these counts poses significant problems of visualization and inference.
In this talk we show efficient and practical algorithms for the problem of detecting words that are, by some measure, over- or under-represented in the context of larger sequences. We also shows that such anomaly detectors can be used successfully to discover (exact) patterns in biological sequences.
(Joint work with A. Apostolico, M. E. Bock, and F. Gong)