And one million examples from the sea dataset

Fast Perceptron Decision Tree Learning from

Evolving Data Streams

In the data stream model, data arrive at high speed, and algorithms that process them must do so under very strict constraints of space and time. Consequently, data streams pose several challenges for data mining algorithm design. First, algorithms must make use of limited resources (time and memory). Second, they must deal with data whose nature or distribution changes over time.

An important issue in data stream mining is the cost of performing the learning and prediction process. As an example, it is possible to buy time and space usage from cloud computing providers [25]. Several rental cost options exist:

2 Related Work

Standard decision tree learners such as ID3, C4.5, and CART [18, 21] assume that all training examples can be stored simultaneously in main memory, and are thus severely limited in the number of examples they can learn from. In particular, they are not applicable to data streams, where potentially there is no bound on the number of examples and these arrive sequentially.

ϵ=	�	R2ln(1/δ)
ϵ=	�	2n

A theoretically appealing feature of Hoeffding Trees not shared by other incre-mental decision tree learners (ID4 [22], ID5 [24]) is that it has sound guarantees of performance. Using the Hoeffding bound one can show that its output is asymptotically nearly identical to that of a non-incremental learner using in-finitely many examples. CVFDT [14] is an extension of the Hoeffding Tree to evolving data streams, but does not exhibit theoretical guarantees.

Outside the data stream world, there is prior work on using perceptrons or similar classifiers in decision trees. Utgoff [24] presented the Perceptron Decision Tree as a decision tree in which each leaf node uses the perceptron as a classifier of the input instances. Bennett et al. [2] showed that maximizing margins in perceptron decision trees can be useful to combat overfitting. Zhou [26] proposed Hybrid Decision Trees as a hybrid learning approach combining decision trees with neural networks.

In this section, we present the perceptron learner we use, and the Hoeffding Per-ceptron Tree based on it. We also consider bagging trees with change detection.

3.1 Perceptron Learning

∇J = −�(yi − hw(xi))∇hw(xi)

where for sigmoid hypothesis

Perceptron Learning(Stream, η)

1 for each class
2 do Perceptron Learning(Stream, class, η)

Hoeffding trees [6] are state-of-the-art in classification for data streams and they perform prediction by choosing the majority class at each leaf. Their predictive accuracy can be increased by adding naive Bayes models at the leaves of the trees. However, Holmes et al. [12] identified situations where the naive Bayes method outperforms the standard Hoeffding tree initially but is eventually overtaken. They propose a hybrid adaptive method that generally outperforms the two original prediction methods for both simple and complex concepts. We call this method Hoeffding Naive Bayes Tree (hnbt). This method works by performing a naive Bayes prediction per training instance, and comparing its prediction with the majority class. Counts are stored to measure how many times the naive Bayes prediction gets the true class correct as compared to the majority class.

When performing a prediction on a test instance, the leaf will only return a naive Bayes prediction if it has been more accurate overall than the majority class, otherwise it resorts to a majority class prediction.

ADWIN is parameter- and assumption-free in the sense that it automatically detects and adapts to the current rate of change. Its only parameter is a confi-dence bound δ, indicating how confident we want to be in the algorithm’s output, inherent to all algorithms dealing with random processes.

Also important for our purposes, ADWIN does not maintain the window ex-plicitly, but compresses it using a variant of the exponential histogram technique. This means that it keeps a window of length W using only O(log W) memory and O(log W) processing time per item.

a concept drift event as a weighted combination of two pure distributions that characterizes the target concepts before and after the drift. This framework de-fines the probability that a new instance of the stream belongs to the new concept after the drift based on the sigmoid function.

Definition 1. Given two data streams a, b, we define c = a ⊕W t0b as the data stream built by joining the two data streams a and b, where t0 is the point of change, W is the length of change, Pr[c(t) = b(t)] = 1/(1 + e−4(t−t0)/W) and Pr[c(t) = a(t)] = 1 − Pr[c(t) = b(t)].

(((SEA9 ⊕W t0SEA8) ⊕W 2t0SEA7) ⊕W 3t0SEA9.5)

Rotating Hyperplane This data was used as a testbed for CVFDT versus VFDT in [14]. Examples for which�d and examples for which�d are useful for simulating time-changing concepts, because we can change the orientation and position of the hyperplane in a smooth manner by changing i=1wixi < w0 are labeled negative. Hyperplanes i=1wixi ≥ w0 are labeled positive,

4.2 Real-World Data

The UCI machine learning repository [1] contains some real-world benchmark data for evaluating machine learning techniques. We consider three of the largest: Forest Covertype, Poker-Hand, and Electricity.

These datasets are small compared to synthetic datasets we consider. Another important fact is that we do not know when drift occurs or indeed if there is any drift. We may simulate concept drift, joining the three datasets, merging attributes, and supposing that each dataset corresponds to a different concept

CovPokElec = (CoverType ⊕5,000 581,012Poker) ⊕5,000 1,000,000ELEC

Accuracy (%)	65	htnbp
	60	htnbp
		htnb
		htp
		ht

Time (sec.)	35		htnbp
	30
	25
	20
	20
	15
	15
	10
	5
	0

Fig. 2. Accuracy, runtime and memory on the LED data with three concept drifts.

4.3 Results

We use the datasets explained in the previous sections for evaluation. The exper-iments were performed on a 3 GHz Intel 64-bit machine with 2 GB of memory. The evaluation methodology used was Interleaved Test-Then-Train on 10 runs: every example was used for testing the model before using it to train. This inter-leaved test followed by train procedure was carried out on 10 million examples from the hyperplane and RandomRBF datasets, and one million examples from the SEA dataset. The parameters of these streams are the following:

Time

Acc.

Mem.

Time

Acc.

Mem.

Time

Acc.

Mem.

RBF(0,0)
RBF(50,0.001)
RBF(10,0.001)
RBF(50,0.0001) RBF(10,0.0001) HYP(10,0.001)
HYP(10,0.0001) SEA(50)
SEA(50000)
LED(50000)
CovType
Poker
Electricity
CovPokElec

9.03
10.98
9.07
11.62
9.28
9.73
9.37 3.70
4.51
21.28
24.73
9.81
0.96
68.37

1.57
1.51
1.56
1.86
1.63
1.61
1.40
0.57
0.57
4.99 2.59 1.12
0.12
10.05

8.04 8.66 8.03 9.54 8.20 7.57 7.08 4.65 4.85 18.64 16.53 8.40 0.93 49.37

90.33 ± 0.49 8.66 68.95 ± 0.31 87.61 ± 0.36 9.54 79.91 ± 0.42 89.37 ± 0.32 82.74 ± 1.13 82.59 ± 0.62 86.73 ± 0.01 86.41 ± 0.07 68.87 ± 0.07 83.59

10.77 11.32 10.77 12.45 10.98 11.45 11.30 5.49 5.64 24.99 22.16 11.40 1.07 69.70

91.07 ± 0.44 68.65 ± 0.32 87.98 ± 0.32 79.88 ± 0.39 89.74 ± 0.33 84.54 ± 1.40 88.40 ± 0.36 5.49 87.41 ± 0.01 5.64 87.12 ± 0.07 70.04 ± 0.03 85.77

82.93

2.30 2.20 2.28 2.72 2.39 2.34 2.04 1.24 1.24 6.00 3.46 1.82 0.21 13.53

81.33 Acc.

68.56 RAM-Hours

Table 1 shows the accuracy, speed and memory usage of a naive Bayes learner, a perceptron with η = 1 and a classic Hoeffding Tree with majority class learn-ing at the leaves. As naive Bayes uses a Gaussian distribution to model numeric attributes, with different variances for each class, it does not yield a linear sep-arator as the perceptron does. In general terms, we see that the perceptron and the Hoeffding Tree are the fastest methods, but the Hoeffding Tree needs more memory. Comparing using RAM-Hours, naive Bayes needs 3.5 times more RAM-Hours than the perceptron, and the Hoeffding Tree needs 89 more RAM-Hours than naive Bayes. Note that using η = 1 we obtain a very fast adaptive method, but for some datasets like Poker, the results are worse than obtained using a more conservative rate like η = 0.01. Choosing an optimal η remains an open problem for further research.

Table 2 shows, for the Hoeffding tree models, their accuracy, speed and mem-ory. We see that hpt is much faster than hnbt, and more accurate in several streams. hnbpt is more accurate than hnbt and hpt, but it needs more time. Comparing RAM-Hours, hpt needs 1.11 times more RAM-Hours than hnbt, and 2.54 more RAM-Hours than ht, and hnbpt needs 1.37 more than hpt.

Concept drift is handled well by the proposed ADWIN bagging algorithms, excluding the poor performance of the hpt-based classifier on CovPokElec, which is due to the nature of the Poker dataset. Decision trees alone do not deal as well with evolving streaming data, as they have limited capability of adaption.

A tradeoff between RAM-Hours and accuracy could be to use single per-ceptrons when resources are scarce, and ADWIN bagging methods when more accuracy is needed. Note that to gain an increase of 24% in accuracy, we have to increase by more than 10, 000 times the RAM-Hours needed; this is the differ-ence of magnitude between the RAM-Hours needed for a single perceptron and for a more accurate ADWIN bagging method.

1. A. Asuncion and D. Newman. UCI machine learning repository, 2007.

2. K. Bennett, N. Cristianini, J. Shawe-Taylor, and D. Wu. Enlarging the margins in perceptron decision trees. Machine Learning, 41(3):295–313, 2000.

7. E. Frank, Y. Wang, S. Inglis, G. Holmes, and I. H. Witten. Using model trees for classification. Machine Learning, 32(1):63–76, 1998.

8. J. Gama. On Combining Classification Algorithms. VDM Verlag, 2009.

13. G. Holmes, R. Kirkby, and B. Pfahringer. MOA: Massive Online Analysis. http://sourceforge.net/projects/moa-datastream. 2007.

14. G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In KDD, pages 97–106, 2001.

20. N. C. Oza and S. J. Russell. Experimental comparisons of online and batch versions of bagging and boosting. In KDD, pages 359–364, 2001.

21. S. R. Safavian and D. Landgrebe. A survey of decision tree classifier methodology. Systems, Man and Cybernetics, IEEE Transactions on, 21(3):660–674, 1991.

26. Z. Zhou and Z. Chen. Hybrid decision tree. Knowledge-based systems, 15(8):515– 528, 2002.