As per how I understood wekalist regulations I attached the output (including the confusion matrix) in a separate file as opposed to copy pasting.
I am still hesitant to paste it here (long message), but here's the link to stackoverflow which has everything
> Date: Thu, 9 Jan 2014 06:56:41 +0000
> From: "Arjannikov, Tom" <tom.arjannikov(a)uleth.ca>
> To: Weka machine learning workbench list.
> Subject: Re: [Wekalist] Difference between WEKA instance predictions
> and confusion matrix results?
> Content-Type: text/plain; charset="us-ascii"
> It would help to see the actual confusion matrix..
> - Tom
I tried posting this to stackoverflow and got no solutions, but a recommendation to address you guys. So here goes..
I have a data set of numeric vectors that have a binary classification (S,H). I train a NaiveBayes model (although the method really doesn't matter) in leave one out cross-validation.
The results and PlainText output (with score distribution) are attached in the Train file.
Basically -- no problems there and, as you can see, there are three errors in both the output and the confusion matrix.
I then re-evaluate the model using an independent data set with the same attributes and same two classes. The results is in the attached Test file.
....And here is where my problems lie.
The output clearly shows that there are many errors. In fact, there are 44.
The confusion matrix and the result summary, on the other hand, suggest that there are 12 errors.
Now, if the prediction classes were reversed, the confusion matrix would be true.
So I look at the distribution of scores and I see that in the cross-validation results (Train file) the value before the comma represents the H class, and the second value is the S class (so the value 1,0 means H prediction).
However, in the Test file these are reversed and the value 1,0 means S prediction.
So, if I take the score distribution as it was in the Training results, the confusion matrix is right.
If I take the actual prediction (H or S) -- the confusion matrix is wrong.
I tried changing all test file classes to be H or S.
This does NOT change the output results or the confusion matrix totals: in the confusion matrix, 16 instances are always predicted a(H) and 40 are always b(S).
The plain text output distributions are, on the other hand, actually 16 b(S) and 40 a(H).
Any ideas what is going wrong?
It must be a simple thing, but I am completely and totally at a loss...
Thanks in advance!
Hello everyone, I'm Ricardo from Costa Rica =)
I have a dataset that has 5580 attributes and some hundred instances.
All of these attributes were obtained numerically, using functions,
integrals, transforms, etc. I got some values for this attributes to be
NaN, -NaN, INF or - INF.
I want these attributes to stay numeric.
My question is, how can i make WEKA understand them as that, another value
different from any other real numbers?
I have considered doing a "nominalization" but that could take a while,
since the variance in the values is big.
I have a 3million+ transactions that I would like to analyze using Apriori.
I run out of memory after a while. I am running 3-7-10-oracle-jvm on a Mac
OS and I was wondering if I should be doing something else instead of using
What are the recommendation for this case?
Thanks very much,
I am a member of Weka mailing list, but my posts are not being accepted. I
am receiving the posts of other people on my email normally but my posts
are not being accepted and my questions are not being answered.
I contacted Mark directly and the list manager but I got no response.
So, could you please fix this problem as I have some questions and I really
Thanks for your help,
Dear,I have a question concerning the combination of feature selection and cross validation in Weka.For my experiment, I have a dataset containing 2000 documents which I want to classify using ten-fold cross-validation. My preprocessing step will consist of applying the StringToWordVector filter and applying AttributeSelection with the InfoGainAttributeEval evaluator. Of course I only want to apply the attribute selection to the training set, since AttributeSelection is a supervised filter.Is the following code a correct way to obtain this evaluation? Will the following code apply STW to all data and IG only to the training sets? If not, what would be the best way to obtain this behavior?
Instances inst = //The complete dataset
StringToWordVector stwFilter = new StringToWordVector();
Instances stw = Filter.useFilter(inst,stwFilter);
AttributeSelection as = new AttributeSelection();
Ranker ranker = new Ranker();
FilteredClassifier fc = new FilteredClassifier();
Evaluation eval = new Evaluation(stw);
eval.crossValidateModel(fc, stw,10,new Random());
Thank you in advance.