Otey intelligent systems division nasa ames research center moffett field
Survey of Text Mining II
Michael W. Berry, BS, MS, PhD Department of Computer Science University of Tennessee, USA
Malu Castellanos, PhD
Hewlett-Packard Laboratories Palo Alto, California, USA
The publisher makes no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that may be made.
Printed on acid-free paper.
As we enter the third decade of the World Wide Web (WWW), the textual revolution has seen a tremendous change in the availability of online information. Finding infor-mation for just about any need has never been more automatic—just a keystroke or mouseclick away. While the digitalization and creation of textual materials continues at light speed, the ability to navigate, mine, or casually browse through documents too numerous to read (or print) lags far behind.
What approaches to text mining are available to efficiently organize, classify, label, and extract relevant information for today’s information-centric users? What algorithms and software should be used to detect emerging trends from both text streams and archives? These are just a few of the important questions addressed at the Text Mining Workshop held on April 28, 2007, in Minneapolis, MN. This workshop, the fifth in a series of annual workshops on text mining, was held on the final day of the Seventh SIAM International Conference on Data Mining (April 26–28, 2007).
In Part I (Clustering), Howland and Park update their work on cluster-preserving dimension reduction methods for efficient text classification. Likewise, Senellart and Blondel revisit thesaurus construction using similarity measures between vertices in graphs. Both of these chapter were part of the first edition of this book (based on a SIAM text mining workshop held in April 2002). The next three chapters are completely new contributions. Zeimpekis and Gallopoulos implement and evaluate several clustering schemes that combine partitioning and hierarchical algorithms. Kogan, Nicholas, and Wiacek look at the hybrid clustering of large, high-dimensional data. AlSumait and Domeniconi round out this topic area with an examination of local semantic kernels for the clustering of text documents.
In Part II (Document Retrieval and Representation), Kobayashi and Aono re-vise their first edition chapter on the importance of detecting and interpreting minor document clusters using a vector space model based on principal component analy-sis (PCA) rather than the popular latent semantic indexing (LSI) method. This is followed by Xia, Xing, Qi, and Li’s chapter on applications of semidefinite program-ming in XML document classification.
The editors would like to thank Murray Browne of the University of Tennessee and Catherine Brett of Springer UK in coordinating the management of manuscripts among the authors, editors, and the publisher.
Michael W. Berry and Malu Castellanos
Knoxville, TN and Palo Alto, CA
August 2007
ix |
---|
|
3 |
---|
2 Automatic Discovery of Similar Words
Pierre Senellart, Vincent D. Blondel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 Vector Space Models for Search and Cluster Mining
Mei Kobayashi, Masaki Aono . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1097 Applications of Semidefinite Programming in XML Document
Classification
Zhonghang Xia, Guangming Xing, Houduo Qi, Qi Li . . . . . . . . . . . . . . . . . . . . . 129
9 Spam Filtering Based on Latent Semantic Indexing
Wilfried N. Gansterer, Andreas G.K. Janecek, Robert Neumayer . . . . . . . . . . . . 165Part IV Anomaly Detection
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Contributors
Michael W. Berry
Department of Electrical Engineering and Computer Science University of Tennessee
203 Claxton Complex
Knoxville, TN 37996-3450
Email: berry@eecs.utk.edu
Homepage: http://www.cs.utk.edu/˜berry
x Contributors
Santanu Das
Intelligent Systems Division
NASA Ames Research Center
Moffett Field, CA 94035
Email: sdas@email.arc.nasa.gov
Carlotta Domeniconi
Department of Computer Science
George Mason University
4400 University Drive MSN 4A4
Fairfax, VA 22030
Email: carlotta@ise.gmu.edu
Homepage: http://www.ise.gmu.edu/˜carlotta
Michael R. Horvath
Department of Computer Science
Wake Forest University
P.O. Box 7311
Winston-Salem, NC 27109
Email: horvmr5@wfu.edu
Peg Howland
Department of Mathematics and Statistics
Utah State University
3900 Old Main Hill
Logan, UT 84322-3900
Email: peg.howland@usu.edu
Homepage: http://www.math.usu.edu/˜howland
Jacob Kogan
Department of Mathematics and Statistics
University of Maryland, Baltimore County 1000 Hilltop Circle
Baltimore, MD 21250
Email: kogan@math.umbc.edu
Homepage: http://www.math.umbc.edu/˜kogan
Christopher V. Kopek
Department of Computer Science
Wake Forest University
P.O. Box 7311
Winston-Salem, NC 27109
Email: kopecv5@wfu.edu
Charles Nicholas
Department of Computer Science and Electrical Engineering University of
Maryland, Baltimore County
1000 Hilltop Circle
Baltimore, MD 21250
Email: nicholas@cs.umbc.edu
Homepage: http://www.cs.umbc.edu/˜nicholas
Farhad Oroumchian
College of Information Technology
University of Wollongong in Dubai
P.O. Box 20183, Dubai
U.A.E.
Houduo Qi
Department of Mathematics
University of Southampton, Highfield
Southampton SO17 1BJ, UK
Email: hdqi@soton.ac.uk
Homepage: http://www.personal.soton.ac.uk/hdqi Narjes Sharif
Razavian
Department of Electrical and Computer Engineering University of
Tehran
P.O. Box 14395-515, Tehran
Iran
Email: n.razavian@ece.ut.ac.ir
Hassan Seyed Razi
Department of Electrical and Computer Engineering University of
Tehran
P.O. Box 14395-515, Tehran
Iran
Email: seyedraz@ece.ut.ac.ir
1600 Amphitheatre Parkway
Mountain View, CA 94043
Email: mjwiacek@google.comContributors xv
Cluster-Preserving Dimension Reduction Methods for Document Classification
Peg Howland and Haesun Park
Dimension reduction is commonly based on rank reduction by the truncated sin-fined as gular value decomposition (SVD). For any matrix A ∈ Rm×n, its SVD can be de- A = UΣVT, (1.1)
where U ∈ Rm×mand V ∈ Rn×nare orthogonal, Σ = diag(σ1 · · · σp) ∈ Rm×n with p = min(m, n), and the singular values are ordered as σ1 ≥ σ2 ≥ · · · σp ≥ 0 [GV96, Bj¨o96]. Denoting the columns of U, or left singular vectors, by ui, and the columns of V , or right singular vectors, by vi, and the rank of A by q, we write
4 | A = | q � |
(1.2) |
---|
A ≈ | � |
---|
provides the rank-l approximation that is closest to the data matrix in L2 norm or Frobenius norm[GV96]. This is the main tool in principal component analysis (PCA) [DHS01], as well as in latent semantic indexing (LSI) [DDF+90, BDO95] of docu-ments.
If the data form clusters in the full dimension, the goal may change from find-ing the best lower dimensional representation to finding the lower dimensional rep-resentation that best preserves this cluster structure. That is, even after dimension reduction, a new document vector should be classified with the appropriate cluster. Assuming that the columns of A are grouped into clusters, rather than treating each column equally regardless of its membership in a specific cluster, as is done with the SVD, the dimension reduction methods we will discuss attempt to preserve this information. This is important in information retrieval, since the reduced representa-tion itself will be used extensively in further processing of data.
Given a term-document matrix
A = [a1 | a2 | · · · |
|
---|
found. This lower rank approximation is not unique since for any nonsingular matrix where both B ∈ Rm×lwith rank(B) = l and Y ∈ Rl×nwith rank(Y ) = l are to be
Z ∈ Rl×l,
A ≈ BY = (BZ)(Z−1Y ), where
rank(BZ) = l and rank(Z−1Y ) =
l.
In general, starting with A1 = A, and choosing
xk and yk such that ωk = yT kAkxk ̸=
0, the Wedderburn formula generates the sequence
Ak+1 = Ak − ω−1 k(Akxk)(yT
kAk).
Adding up all the rank one updates, factoring into matrix outer product form, and truncating gives an approximation A ≈ BY . The question becomes: what are good choices for xk and yk?
The goal of linear discriminant analysis (LDA) is to combine features
of the orig-inal data in a way that most effectively discriminates
between classes. With an ap-propriate extension, it can be applied to
the goal of reducing the dimension of a term-document matrix in a way
that most effectively preserves its cluster structure. That is, we want
to find a linear transformation G whose transpose maps each
doc-ument vector a in the m-dimensional space to a
vector y in the l-dimensional space (l ≪
m):
GT: a ∈ Rm×1→ y ∈
Rl×1.
Assuming that the given data are already clustered, we seek a transformation that optimally preserves this cluster structure in the reduced dimensional space.
� | (1.5) |
---|
|
j∈Ni� |
|
---|
� |
---|
1 Cluster-Preserving Dimension Reduction Methods | 7 | |||
---|---|---|---|---|
Sw = | � | j∈Ni� |
|
|
Sb = | � | j∈Ni� |
|
|
respectively. The scatter matrices have the relationship [JD88]
Sm = Sw + Sb. (1.6)
trace(Sw) = | � | j∈Ni� | |
---|---|---|---|
= | � | j∈Ni� |
|
measures the closeness of the columns within the clusters, and
trace(Sb) = | � | j∈Ni� | |
---|---|---|---|
= | � | j∈Ni� |
max Gtrace(GT SbG) | and | (1.7) |
---|
Assuming the matrix Sw is nonsingular, classical LDA approximates this simul-taneous trace optimization by finding a transformation G that maximizes