On Tue, Oct 19, 2004 at 15:28:07 +0000, Eibe Frank wrote:
There is currently no way of doing this in Weka.
However, this would be
a very useful filter to have (i.e. a filter that converts the values of
one user-specified attribute into instance weights, and deletes that
I gave this a try and used it to preprocess input to the decision tree
algorithm J48 via FilteredClassifier. Fiddling with pruning options I
got what seemed to be a reasonable tree, but I have some misgivings.
I think there may be a problem in method addErrs in
weka.classifiers.trees.j48.Stats. Looking at the following excerpt from
cvs, I have a couple of concerns:
// Compute upper limit of confidence interval
double f = (e + 0.5) / N;
double r = (f + (z * z) / (2 * N) +
z * Math.sqrt((f / N) -
(f * f / N) +
(z * z / (4 * N * N)))) /
(1 + (z * z) / N);
First, the definition of f as a ratio of weighted counts, looks ok.
But I wonder whether the "continuity correction" should be sensitive to
the scale of the weights (which is a general matter that I think someone
else brought up on the list recently).
Anyway, my real concern is with the definition of r. The use of the
weighted count N worries me a bit. I think that for purposes of
statistical inference, the sample size N should just be the simple
(unweighted) count of instances with valid data points.
In my case the weights represent the (huge) unsampled population for a
typical sample survey. I gather a more common application of weights in
Weka may be as an imputation for missing values. In either situation,
counting these unobserved entities in sample size, or degrees of
freedom, seems intuitively wrong and potentially quite biased.
There may be other situations where weighted N is the right thing to
use. And I guess if one's only concern is a very low rate of missing
values, the distinction is pretty subtle. Still my impression is that
for some users there could be a real problem.
BTW I believe the minimum-leaf-count option -M is also a weighted count.
Reading the documentation I wasn't sure about this, but perhaps it
should have been obvious.
On Oct 19, 2004, at 10:11 AM, Yue Pan wrote:
> This question seems to be asked before but never fully answered.
> Anyway(s) to give weight to each instance in the weka file? For
> example, setting the first attribute as the weight for each example
> and let Weka utilize that in training and testing internally. (I'm NOT
> talking about instance weighting in bagging or boosting, instead,
> having different weights to reflect the relative importance of
> individual instances)