No, you aren't missing anything. Most classifiers
are sensitive to the
sequence the data arrives in. If your data is ordered by class label and
you use a percentage split with preserved order, then the last class
label will be underrepresented, maybe not represented at all, in the
training set. That's the reason, why one normally performs 10 runs of
10-fold cross-validation in order to get reasonable numbers. Before each
run of cross-validation, the data is randomized (and for nominal classes
stratified again to get a similar class distribution in the different
This makes sense indeed, but please one last question : If the inferior
results are also present on a **regression** problem, does that then mean
that the model is an unstable one, suggesting possibly over-fitting, etc???
Once again, if your data is sorted according to the class attribute
(this time a numeric number, e.g., in ascending order), then the
training data from a split with preserved order is not representative of
the entire dataset (you chopped off the highest values). Every
classification/regression scheme tries to fit its model onto the
training data, if the training data is chosen poorly, it will (most
likely) perform poorly on the test data as well.
Regarding unstable/over-fitting... It really depends on the scheme, some
schemes may need to see the full range of the numeric class in order to
build a useful model, others might be more robust.
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
+64 (7) 838-4466 Ext. 5174