Dear Stefan, Hans and Jose,
there seem to be only four of us being interested in using WEKA for text
mining purposes and developing related classes. I suggest to continue
the discussion outside the WEKA list to avoid annoying others.
We might think of creating a self-contained package weka.text that
contains classes representing and pre-processing (normalization,
tokenization, discovery of multi-token terms, stop-word removal, stemming,
weighting, named entity extraction etc.) textual data.
Who else wants to join?
is anybody developing or considering to develop data pre-processing Weka
classes for textual data? Currently, we think of doing that by
generalizing our home-grown, quick-and-dirty solution. However, we don't
want to re-invent the wheel! Specifically, we are working on a KDD
approach to convert domain-specific text archives into semantically tagged
Note this email was send to support but maybe someone on the mailing list
may help me out.
My name is Erik Geleyn, i am currently a RA in Florida Atlantic
Univsersity and I am doing some empirical investigations for software
quality modeling using Weka.
I) Problem report
First, I would like to mention a problem I encountered:
When using Costsensitive classifiers with logitboost and decision stump
everything is working fine, however when I try to use another classifier
(J48), I get an error message saying "Error message : Class is numeric". I
am doing classification. And the class is definitely not numeric.
Here are the CL for the two classifiers:
CostSensitiveClassifier -S 1 -W weka.classifiers.LogitBoost -- -P 100 -I
10 -W weka.classifiers.DecisionStump --
CostSensituveClassifier -S 1 -W weka.classifiers.LogitBoost -- -P 100 -I
10 -W weka.classifiers.j48.J48 -- -C 0.25 -M 2
II)Questions about the Experimenter
I also had a couple of questions about the experimenter:
When I am using the RandomSplitResultProducer, the results are given
for the test data set.
My problem is the following, I have to vary a cost
ratio until I reach a prefered model based on the cross-validation results
on the fit data set and then evaluate the prefered model on the test data
So right now I have the result on the Test data but no way to
evaluate the model on the fit data using cross validation.
The CrossValidationResultProducer gives me the cross validation results
but on the whole data set so I can perform my model selection but I have
no data left to evaluate my model.
Is there any way to use random
splitting, then run 10-fold cross-validation on the test data and evaluate
on the fit data?
Right now the only way I found was to manualy split the
data before using the Explorer, build models using 10foldcv and then
evaluate them on test data. The only thing is that I have 50 data splits
to do, so I would like to automate the runs using the experimenter.
III) Questions about classifiers
I am doing research, on meta learning schemes applied to software quality
modeling and I had a couple of questions about the algorithms.
III.1) Cost-sensitive classifier
When I use this classifier with let's say j48 (C4.5). Are the costs used
by the tree algorithm or are we just performing some sort of resampling on
the original fit data by copying instances with higher cost???
For example for decision stumps, I am pretty sure that's what happens
since Decision stumps don't "handle" costs. I had a doubt for C4.5 since
some implementations handle costs.
About the useResampling option:
If I select true, I guess we are performing a resampling from the
original training data using the weights of each instance and the
algorithm is just run on the resampled training data.
If I select false, my guess was that the algorithm would handle the
weights directly. I can imagine that for C4.5 as explained p254 of the
book from Witten and Frank "C4.5 is an exampple of a learning scheme that
can accomodate weighted instances. However I doubt decision stump handle
directly those weights so I don't really understand how you can set this
option to false and use this kind of classifier.
I guess I misunderstood the sense of that option.
III.3) Cost sensitive boosting
In my research field the cost of misclassification is dramaticaly
important: the cost of missing a fault-prone module (may allow failures to
go into the field) is much higher than a false alarm on a not fault-prone
module (time wasted reviewing something that we didn't need). Because of
this we are investingating cost sensitive boosting methods. The two we
selected are: Cost-boosting and ADAcost. Do you have any plans
implementing those methods. Otherwise, I will have a member of our
research group work on it starting january. In this case, we may have a
lot of questions about your implementation of Adaboost and about the GUI,
could you provide us with some support at that time?
I hope my email was not too long. I would thank you guys for the great
book and the great tool you provide to the community.
Regards Erik Geleyn
Florida Atlantic University
(561) 750 9258
(561) 297 2512
Florida Atlantic University
(561) 750 9258
(561) 297 2512
Here is a programming question for you to chew on:
I'd like to be able to reformat the output from the Apriori
algorithm. Is there a simpler way than extending from
weka.associations.Apriori and overriding the toString() method
to do my bidding?
The reason behind my madness:
Given numeric attributes N1, N2, ..., Nn and given that I have
discretized them in some manner to form symbolic attributes
S1, S2, ..., Sm, which are subsequently given to the Apriori
algorithm, I obtain output in the following format:
S1=4 S3=8 46 ==> S7=8 35 conf:(0.76)
Now I just so happen to know that when S1=4, it means that
a<=N1<=b. So I would prefer to output a rule of the form:
a<=N1<=b c<=N3<=d 46 ==> e<=N7<=f 35 conf:(0.76)
Any help in this matter would be greatly appreciated.
Alan J. Barton Work: 1 (613) 991-5486
Rm 368, Bldg. M-50, IIT, NRC Fax: 1 (613) 952-0215
1500 Montreal Road, Ottawa, ON, CA, K1A 0R6 <mailto:Alan.Barton@nrc.ca>
If we knew what it was we were doing, it would not be called
research, would it? A.Einstein
I am using the weka software at the University of
ran into some problem using the graphical version when
to run more than one classifier type on the same
The problem is the following:
I start with the default ZeroR classifier.
I enable "Generator properties" and select classifier
to get ZeroR in the "Generator properties box"
I select the "Result generator" to get a
weka.gui.GenericObjectEditor window where I select the
I select in the weka.experiment.ClassifierSplitEvaluator
window the "classifier" field.
I get a list of classifiers and select some other
When I do that the one (ZeroR) that I had previously
vanishes from the "Result generator" field in the main
Do you know what I am not doing right ?
Þú færð frítt fax, tölvupóst og talhólf hjá torg.is. Taktu þátt í spennandi leik!
Torg.is - íslenska upphafssíðan - http://www.torg.is