In WEKA, each instance in a dataset can have a weight. All learning algorithms that make
use of instance weights, when they are provided, implement the WeightedInstancesHandler
interface (the relevant ones in the core distribution for WEKA 3.7 can be seen in the
Javadoc at http://weka.sourceforge.net/doc.dev/weka/core/WeightedInstancesHandler.html
For example, instead of duplicating an instance, you can simply give it the weight two.
This *should* have the same (or approximately the same) effect if the learning algorithm
is a WeightedInstancesHandler.
ClassBalancer simply reweights the instances so that the sum of weights for all classes of
instances in the data is the same. No instances are deleted or added, so the count for
each class remains unchanged.
In contrast, SpreadSubsample, when "AdjustWeights=true", after it has resampled
the data to achieve the desired spread, modifies the instance weights so that the
"total weight per class is maintained" (i.e., is the same as in the original
On 7 Jan 2016, at 04:22, Jeff Pattillo
I work with healthcare data and the specific problem I am working on right now has very
unbalanced classes (roughly 10:1). I tried including 10 copies of the smaller class for
every 1 instance of the bigger class, but the classifier that resulted did not generalize
very well. The model seems to be picking up on features of the smaller class because it
was artificially enlarged in such a uniform way. I am thinking I might get better results
by reducing the size of the larger class via sampling.
Does anyone have extensive experience in this? Is this the right way to go?
I was looking at the three supervised filters ClassBalancer, Resample, and
SpreadSubsample and it seems I can get to classes of equal size using all 3.
ClassBalancer does this automatically, Resample does it if you set
"biasToUniformClass=1.0", and SpreadSubsample does it if you set
"distributionSpread=1.0". With ClassBalancer and SpreadSubsample you get
slightly odd results. If you run ClassBalancer on diabetes.arff, you get Counts that
differ for the class, even though the weights are the same. If you run SpreadSubsample on
diabetes.arff, with both "AdjustWeights=True" and
"distributionSpread=1.0", you get Counts that are the same for the class, but
the weights differ.
Which of these samplers do you recommend? What does it mean to have different counts but
equal weights for the class? What does it mean to have equal counts but different weights
for the class?
One final question. I've seen it recommended to repeatedly sample the data and build
a classifier, and ultimately create a hybrid classifier made up of all the classifiers
built from the samples. Is there a way to create such a classifier in WEKA using a
Thanks for the help as always!
Wekalist mailing list
Send posts to: Wekalist(a)list.waikato.ac.nz
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html