Otey intelligent systems division nasa ames research center moffett field

Survey of Text Mining II

Michael W. Berry, BS, MS, PhD Department of Computer Science University of Tennessee, USA

Malu Castellanos, PhD
Hewlett-Packard Laboratories Palo Alto, California, USA

The publisher makes no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that may be made.

Printed on acid-free paper.

As we enter the third decade of the World Wide Web (WWW), the textual revolution has seen a tremendous change in the availability of online information. Finding infor-mation for just about any need has never been more automatic—just a keystroke or mouseclick away. While the digitalization and creation of textual materials continues at light speed, the ability to navigate, mine, or casually browse through documents too numerous to read (or print) lags far behind.

What approaches to text mining are available to efficiently organize, classify, label, and extract relevant information for today’s information-centric users? What algorithms and software should be used to detect emerging trends from both text streams and archives? These are just a few of the important questions addressed at the Text Mining Workshop held on April 28, 2007, in Minneapolis, MN. This workshop, the fifth in a series of annual workshops on text mining, was held on the final day of the Seventh SIAM International Conference on Data Mining (April 26–28, 2007).

In Part I (Clustering), Howland and Park update their work on cluster-preserving dimension reduction methods for efficient text classification. Likewise, Senellart and Blondel revisit thesaurus construction using similarity measures between vertices in graphs. Both of these chapter were part of the first edition of this book (based on a SIAM text mining workshop held in April 2002). The next three chapters are completely new contributions. Zeimpekis and Gallopoulos implement and evaluate several clustering schemes that combine partitioning and hierarchical algorithms. Kogan, Nicholas, and Wiacek look at the hybrid clustering of large, high-dimensional data. AlSumait and Domeniconi round out this topic area with an examination of local semantic kernels for the clustering of text documents.

In Part II (Document Retrieval and Representation), Kobayashi and Aono re-vise their first edition chapter on the importance of detecting and interpreting minor document clusters using a vector space model based on principal component analy-sis (PCA) rather than the popular latent semantic indexing (LSI) method. This is followed by Xia, Xing, Qi, and Li’s chapter on applications of semidefinite program-ming in XML document classification.

The editors would like to thank Murray Browne of the University of Tennessee and Catherine Brett of Springer UK in coordinating the management of manuscripts among the authors, editors, and the publisher.

Michael W. Berry and Malu Castellanos
Knoxville, TN and Palo Alto, CA
August 2007

	ix

1 Cluster-Preserving Dimension Reduction Methods for Document Classification Peg Howland, Haesun Park . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .	3

2 Automatic Discovery of Similar Words
Pierre Senellart, Vincent D. Blondel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6 Vector Space Models for Search and Cluster Mining
Mei Kobayashi, Masaki Aono . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

7 Applications of Semidefinite Programming in XML Document
Classification
Zhonghang Xia, Guangming Xing, Houduo Qi, Qi Li . . . . . . . . . . . . . . . . . . . . . 129

9 Spam Filtering Based on Latent Semantic Indexing
Wilfried N. Gansterer, Andreas G.K. Janecek, Robert Neumayer . . . . . . . . . . . . 165

Part IV Anomaly Detection

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

Contributors

Michael W. Berry
Department of Electrical Engineering and Computer Science University of Tennessee
203 Claxton Complex
Knoxville, TN 37996-3450
Email: berry@eecs.utk.edu
Homepage: http://www.cs.utk.edu/˜berry

x Contributors

Santanu Das
Intelligent Systems Division
NASA Ames Research Center
Moffett Field, CA 94035
Email: sdas@email.arc.nasa.gov

Carlotta Domeniconi
Department of Computer Science
George Mason University
4400 University Drive MSN 4A4
Fairfax, VA 22030
Email: carlotta@ise.gmu.edu
Homepage: http://www.ise.gmu.edu/˜carlotta

Michael R. Horvath
Department of Computer Science
Wake Forest University
P.O. Box 7311
Winston-Salem, NC 27109
Email: horvmr5@wfu.edu

Peg Howland
Department of Mathematics and Statistics
Utah State University
3900 Old Main Hill
Logan, UT 84322-3900
Email: peg.howland@usu.edu
Homepage: http://www.math.usu.edu/˜howland

Jacob Kogan
Department of Mathematics and Statistics
University of Maryland, Baltimore County 1000 Hilltop Circle
Baltimore, MD 21250
Email: kogan@math.umbc.edu
Homepage: http://www.math.umbc.edu/˜kogan

Christopher V. Kopek
Department of Computer Science
Wake Forest University
P.O. Box 7311
Winston-Salem, NC 27109
Email: kopecv5@wfu.edu

Charles Nicholas
Department of Computer Science and Electrical Engineering University of Maryland, Baltimore County
1000 Hilltop Circle
Baltimore, MD 21250
Email: nicholas@cs.umbc.edu
Homepage: http://www.cs.umbc.edu/˜nicholas

Farhad Oroumchian
College of Information Technology
University of Wollongong in Dubai
P.O. Box 20183, Dubai
U.A.E.

Houduo Qi
Department of Mathematics
University of Southampton, Highfield
Southampton SO17 1BJ, UK
Email: hdqi@soton.ac.uk
Homepage: http://www.personal.soton.ac.uk/hdqi Narjes Sharif Razavian
Department of Electrical and Computer Engineering University of Tehran
P.O. Box 14395-515, Tehran
Iran
Email: n.razavian@ece.ut.ac.ir

Hassan Seyed Razi
Department of Electrical and Computer Engineering University of Tehran
P.O. Box 14395-515, Tehran
Iran
Email: seyedraz@ece.ut.ac.ir

1600 Amphitheatre Parkway
Mountain View, CA 94043
Email: mjwiacek@google.com

Contributors xv

Cluster-Preserving Dimension Reduction Methods for Document Classification

Peg Howland and Haesun Park

Dimension reduction is commonly based on rank reduction by the truncated sin-fined as gular value decomposition (SVD). For any matrix A ∈ Rm×n, its SVD can be de- A = UΣVT, (1.1)

where U ∈ Rm×mand V ∈ Rn×nare orthogonal, Σ = diag(σ1 · · · σp) ∈ Rm×n with p = min(m, n), and the singular values are ordered as σ1 ≥ σ2 ≥ · · · σp ≥ 0 [GV96, Bj¨o96]. Denoting the columns of U, or left singular vectors, by ui, and the columns of V , or right singular vectors, by vi, and the rank of A by q, we write

4		A =	q �		(1.2)

A ≈	�

provides the rank-l approximation that is closest to the data matrix in L2 norm or Frobenius norm[GV96]. This is the main tool in principal component analysis (PCA) [DHS01], as well as in latent semantic indexing (LSI) [DDF+90, BDO95] of docu-ments.

If the data form clusters in the full dimension, the goal may change from find-ing the best lower dimensional representation to finding the lower dimensional rep-resentation that best preserves this cluster structure. That is, even after dimension reduction, a new document vector should be classified with the appropriate cluster. Assuming that the columns of A are grouped into clusters, rather than treating each column equally regardless of its membership in a specific cluster, as is done with the SVD, the dimension reduction methods we will discuss attempt to preserve this information. This is important in information retrieval, since the reduced representa-tion itself will be used extensively in further processing of data.

Given a term-document matrix

A = [a1	a2	· · ·	an] ∈ Rm×n,

found. This lower rank approximation is not unique since for any nonsingular matrix where both B ∈ Rm×lwith rank(B) = l and Y ∈ Rl×nwith rank(Y ) = l are to be

Z ∈ Rl×l,
A ≈ BY = (BZ)(Z−1Y ), where rank(BZ) = l and rank(Z−1Y ) = l.

In general, starting with A1 = A, and choosing xk and yk such that ωk = yT kAkxk ̸= 0, the Wedderburn formula generates the sequence
Ak+1 = Ak − ω−1 k(Akxk)(yT kAk).

Adding up all the rank one updates, factoring into matrix outer product form, and truncating gives an approximation A ≈ BY . The question becomes: what are good choices for xk and yk?

The goal of linear discriminant analysis (LDA) is to combine features of the orig-inal data in a way that most effectively discriminates between classes. With an ap-propriate extension, it can be applied to the goal of reducing the dimension of a term-document matrix in a way that most effectively preserves its cluster structure. That is, we want to find a linear transformation G whose transpose maps each doc-ument vector a in the m-dimensional space to a vector y in the l-dimensional space (l ≪ m):
GT: a ∈ Rm×1→ y ∈ Rl×1.

Assuming that the given data are already clustered, we seek a transformation that optimally preserves this cluster structure in the reduced dimensional space.

	�		(1.5)

c(i)= 1 ni	j∈Ni�	aj

	�

1 Cluster-Preserving Dimension Reduction Methods
Sw =	�	j∈Ni�	(aj − c(i))(aj − c(i))T,
Sb =	�	j∈Ni�	(c(i)− c)(c(i)− c)T

respectively. The scatter matrices have the relationship [JD88]

Sm = Sw + Sb. (1.6)

trace(Sw) =	�	j∈Ni�
=	�	j∈Ni�	∥aj − c(i)∥2

measures the closeness of the columns within the clusters, and

trace(Sb) =	�	j∈Ni�
=	�	j∈Ni�

max Gtrace(GT SbG)	and		(1.7)

Assuming the matrix Sw is nonsingular, classical LDA approximates this simul-taneous trace optimization by finding a transformation G that maximizes