Patterns classes are heart disease present and heart disease absent

262 Performance

The area under the curve is given by

Z	.1 � F.u//dG.u/ D 1 �	Z		(8.10)

value than a randomly chosen class !1 pattern is

definition (8.11) for the area under the ROC curve. R G.u/ f .u/ du. This is the same as the

Calculating the area under the ROC curve

The area under the ROC curve is easily calculated by applying the classification rule

X .ri � i/ D X ri �X i D S0 �1 2n1.n1 C 1/ iD1 iD1 iD1

where S0 is the sum of the ranks of the class !1 test patterns. Since there are n1n2

OA D	1	²
OA D	n1n2	²

been obtained using the rankings alone and has not used threshold values to calculate it.

The standard deviation of the statisticOA is (Hand and Till, 2001)

s

S0
O� D n1n2
Q0 D1 6.2n1 C 2n2 C 1/.n1 C n2/ � Q1
n1
Q1 D X
.r j � 1/2 jD1

An alternative approach, considered by Bradley (1997), is to construct an estimate of the ROC curve directly for specific classifiers by varying a threshold and then to use an integration rule (for example, the trapezium rule) to obtain an estimate of the area beneath the curve.

The data There are six data sets comprising measurements on two classes:

1. Cervical cancer. Six features, 117 patterns; classes are normal and abnormal cervical cell nuclei.

6. Heart disease 2. Eleven features, 261 patterns; classes are heart disease present and heart disease absent.

Incomplete patterns (patterns for which measurements on some features are missing) were removed from the data sets.