To contrast our algorithm let us compare it with
Latent Semantic Indexing (LSI, [3]), a commonly used approach
in Natural Language Processing.
LSI works in two steps. First it computes the singular value decomposition
of the document-keyword co-occurrance matrix:
.
The singular vectors
defines a new feature space where documents
can be clustered using similarity defined by
. Similarly, the
keywords can be clustered using
.
Note that the SVD of
is equivalent to the
eigendecomposition
. Therefore the clustering
information used is contained in
.
Expanding out the terms, we see that
defines a document-document
similarity measure by counting the number of keywords occuring in
both documents
and
.
As we have seen in the email example using the complete set of keywords as a basis for similarity measure
(5(e)) can obscure the true similarity
(5(d)) provided by the informative keywords.
A remedy to this problem is to try to identify non-informative keywords, e.g computing the keyword occurrance frequency count. This is insufficient. As we can see in figure 5(c) the non-informative keywords can have both high frequency (common keywords) and low frequency (random keywords).