Text Mining tutorials and referecnes
Database Mining References
What is the best Data Mining Textbook?
 A statistician would say Hastie, Tibshirani & Friedman, Elements of Statistical Learning
 a database person might say:
Jiawei Han and Micheline Kamber, (2001), Data Mining: Concepts and ...
 a business person would choose something like
Data Mining Techniques by G. Linoff and M. Berry 2nd ed,
 somewhat farther afield, an EE might choose: Pattern Classification 2nd
ed. by Duda, Hart, and Stork, 2001.
All are good in their own right. The question is what level of
math you want and what your data looks like (a single table, or a
database), etc.
KDD, Data Mining  overview
 Data Mining Techniques ,
M. Berry and G. Linhoff,
John Wiley, 1997
 a readable, if manageroriented, overview of data mining
 or their second book: Mastering Data Mining : Art and Science of Customer Relationship Management, Wiley and Sons, 1999
 KDNuggets: the best data mining site
Data Preparation
 Data Preparation for Data Mining,
D. Pyle,
Morgan Kaufmann, 1999.
Data Warehousing
 Data Mining Solutions,
C. Westphal and T. Blaxton,
John Wiley, 1998
Data Visualization
 E. Tufte,
The Visual Display of Quantitative Information,
Envisioning Information and
his other books, (Graphics Press).
 These are wonderful books about how to present data graphically.
 Visual Revelations,
H. Wainer,
Copernicus, 1997
Machine Learning
 The Elements of Statistical Learning
Hastie, Tibshirani & Friedman ,
 strongly biased towards a statistics viewpoint, but still the
best thing out there.
 Reinforcement Learning: An Introduction,
Sutton, R. and A. Barto
MIT Press, 1998

WEKA Java code library
 best free wide coverage Java code for machine learning;
very widely used

MLC++ code library
 best free wide coverage C++ code for machine learning;
not widely used
Clustering and Collaborative Filtering
 Recommender Systems
 Pointers to many companies and classic papers
 Everitt
Cluster Analysis, 3rd Edition,
Brian S.
Halsted Press, 1993.
 A very readable short overview of clustering methods.
 "Locally Weighted Learning",
C. G. Atkeson, S. A. Schaal and A. W. Moore,
AI Review,Volume 11, Pages 1173 (Kluwer Publishers) 1997
html
 a detailed overview of Knearest neighbor and related methods
 kmeans clustering code
 with a cumbersome input format, but it runs well
 standard packages like R, Matlab, and all data mining software have many more options
Decision trees, CART and MARS
 C4.5: Programs for Machine Learning,
J.R. Quinlan,
MorganKaufmann, 1992
 A modern presentation of decision tree methods. Very readable and
comes with code.
 Classification and regression trees,
Leo Breiman ... et al.,
Wadsworth International Group, 1984.
 The original CART book; a bit dated, but still a classic
 CART and MARS software
Neural Networks
 Neural Networks for Pattern Recognition,
Bishop, C.M.,
Oxford Press, 1995.
 An excellent overview of multilayer perceptron and radial basis
function neural networks from a statician's point of view.
 Neural Networks A Comprehensive Foundation,
Haykin, S.,
Macmillan, 1994.
 A good overview of Neural Nets from an electrical engineers viewpoint;
covers a wide range of neural network types
 The Neural network FAQ
 overview of neural nets and pointers to software

More Neural net pointers [postscript]
Statistical Methods
 stepwise regression
 logistic regression
 Linear Statistical Methods,
Fox,
Wiley
 logistic regression is nicely covered on pp. 307310.
 Statistical Models in S, Chambers and Hastie, Wadsworth, 1992
 covers a range of advanced statistical methods
Bayesian Belief Nets

Charniak, Eugene, "Bayesian Networks without tears", AI Magazine
12(4):5063, Winter 1991.
 Intro to Bayesian networks for beginners.

Neapolitan, Richard E., "Probabilistic Reasoning in Expert Systems:
Theory and Algorithms", John Wiley and Sons, 1990.
 Practical guide to implementation.
 Finn V. Jensen, "Introduction to Bayesian Networks" 1996,
Springer Verlag; ISBN: 0387915028
available at amazon

Pearl, Judea, "Probabilistic Reasoning in Intelligent Systems:
Networks of Plausible Inference", Morgan Kaufmann, San Mateo,
California, 1988.
 Theoretical framework for Bayesian networks  The book that got the whole field going
 Lots more references
 Bayesian networks
 What are belief nets good for and where to get code.
 other good free software: Netica
Genetic Algorithms
 "Genetic Algorithms.",
J. Holland,
Scientific American. July 1992. pp. 6672.
 a nice overview of genetic algorithms
 Genetic Algorithms in search, optimization, and machine learning,
Goldberg. D.,
AddisonWesley, 1989
 An introduction to Genetic Algorithms,
Mitchell, M.,
MIT Press, 1996
Hidden Markov Models and Speech
 Statistical Methods for Speech Recognition,
Jelinek, F.
MIT Press, 1998
Information Theory
 Information Theory, T.M. Cover and J. A. Thomas.
Wiley, 1991
 a solid introduction to Information theory
Sources of Data
Other
 Papers: supplemental material
 A industryoriented overview is in the article by Two Crows.
ungar@cis.upenn.edu