Classification is a two phase process:
First, a classification model is computed (trained)
Second, the model is tested.
The options on the Classify tab tell where to get the testing data.
1 Use the whole training data set (once)
2 Use a separate data set
3 Use a number of partially overlapping subsets (folds) and report the
average of the results
4 Divide the input data using the identified percentage for training and
the rest for testing.
Option 1 is the fastest because both the training and the testing sets are
smaller. Option 3 gives the most reliable results but takes longest.
10-fold cross-validation is the gold standard.
So I can be definite, here is the output from one of my classifications:
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure Class
0.871 0.736 0.659 0.871 0.75 B0
0.264 0.129 0.556 0.264 0.358 B1
=== Confusion Matrix ===
a b <-- classified as
533 79 | a = B0
276 99 | b = B1
TP is True Positives: the number of correct classifications of the target
data class (in my case 0.871 for B0, and 0.264 for B1). These fractions
correspond to the upper left and lower right counts in the confusion matrix,
533 and 99, respectively.
0.871 is the result of 533 / (533 + 79) from the first row of the confusion
FP is the False Positives: the number of incorrect clssifications of the
target data class (in my case 0.736 for B0 and 0.129 for B1).
0.736 is the result of 276 / (276 + 99) from the second row of the confusion
The lower row in the confusion matrix represents the handling of the actual
target class B1 (a value of my target variable). The upper row the actual
handling of the class B0 (another value of my target variable).
The right-hand column of the confusion matrix represents the prediction of
value B1 by the model in the testing data. The left-hand column the
prediction of value B0 by the model in the testing data.
I have read interpretations of the other columns (Precision, Recall &
F-Measure) somewhere, but I haven't found them as useful as what I've
What to look for in these numbers? The ideal result is your FPs being zero
and the lower-left to upper-right confusion matrix diagonal is zeros. But
if that actually happens you've made a mistake in attribute selection. If
one of the confusion matrix columns is zeros then you might not have a
high-enough density of training data in that class: I've raised the density
of the smallest class in the training data to 30%.
For the record I have found best results with Neural Nets and Decision
Tables, and even better results with preprocessing of the attribute values
to show their relationship to the target variable values.
I hope this helps,
[mailto:email@example.com]On Behalf Of Shelly Wu
Sent: Friday, March 28, 2003 3:14 AM
Subject: [Wekalist] Two questions about weka
I have two questions:
1.In the Test options of Classify tab there are four options.What is the
difference between the "Using Training set" and "Cross-validation"?
2.If I choose J48 and the "Using Traning set",I will get the decision
tree.But I do not understand what the "summary" section,"Confusion
and "Detailed Accuracy By Class"section mean.I guess from the value of those
figure we can tell wether the tree is good or not.Can anybody give me a more
Thank you all very much!
Tired of spam? Get advanced junk mail protection with MSN 8.
Wekalist mailing list