I've some attributes (variables) in my dataset that have only one or two or
three instances. I couldn't find any information about the minimum number of
instances per attribute. Should these attributes with so many missing
instances be removed for clustering with the Kmeans and Xmeans?
I work with Weka in my doctor degree, then I need to know about resample
(filter/supervised/instance/resample). My data has class imbalance. The
resample does undersampling ou oversampling or their combination?
I'm trying to run Experimenter from command line. It does not seem to permit
classifiers with parameters or I'm just typing it wrong. That would be
strange but it does seem like all the examples concerning Experimenter runs
in the manual and this archive are parameter-free (e.g.
weka.classifiers.trees.DecisionStump) so I got to wonder if
I've tried various syntaxes to wrap the classifier parameters within -W
attribute but none have worked. So how should I type J48 in Experimenter
with its two parameters (below extract)?
java -classpath /home/u1/univ1/hy/harsaari/weka-3-4-8a
weka.experiment.Experiment -r -T 100-10-10.arff -T 100-10-100.arff -T
100-10-1000.arff -T 100-2-10.arff -T 100-2-100.arff -T 100-2-1000.arff -T
100-20-10.arff -T 100-20-100.arff -T 100-20-1000.arff -T 100-5-10.arff -T
100-5-100.arff -T 100-5-1000.arff -T 60-10-10.arff -T 60-10-100.arff -T
60-10-1000.arff -T 60-2-10.arff -T 60-2-100.arff -T 60-2-1000.arff -T
60-20-10.arff -T 60-20-100.arff -T 60-20-1000.arff -T 60-5-10.arff -T
60-5-100.arff -T 60-5-1000.arff -D "weka.experiment.InstancesResultListener
-O LMT-results.arff" -P weka.experiment.AveragingResultProducer -- -X 10 -W
weka.experiment.CrossValidationResultProducer "-- -X 10 -D -O
splitEvaluatorOut.zip -W weka.experiment.ClassifierSplitEvaluator
-- -W weka.classifiers.trees.J48 '-C 0.25 -M 2'
best, Harri S
I ran the Xmeans on the commandline 10x10 times on the same dataset but
varied the random seeds. I noticed that each time the results of the
centroids seemed to vary between the two examples below. Cluster 0 and
Cluster 1 mean values are awfully close and graphing it produced two lines
that almost laid on top each other apart from the initial points. Why would
it produce two clusters if from the graph they might as well be one and does
this means the data are not suitable for running Xmeans cluster?
Cluster centers : 2 centers
Cluster 0 106.34965034965035 24.06993006993007 22.52412041083916
62795 30.127788001472208 25.171896240032495 30.767402322448152
34.08041958041958 79.84615384615384 97.0
Cluster 1 16.833153928955866 21.396966361426067 23.635566637513456
2596514 29.187762732989626 26.895213462446897 30.109484401695045
492 34.27610333692142 80.02368137782562 97.0
BIC-Value : 5698.218144
Cluster centers : 2 centers
Cluster 0 109.26119402985074 24.73134328358209 21.69365088619403
70013 30.63448546739984 25.75563066431369 30.814630973572466
34.06902985074627 79.83582089552239 97.0
Cluster 1 17.276119402985074 21.328125532798314 23.743540944829423
0516588 29.124396812927827 26.79528780375243 30.10905009682951
4 34.27585287846482 80.02345415778251 97.0
BIC-Value : 5708.235908
Currently I have an arff file which has a classifying attribute that
gives "0" as negative and "1" as positive.
In my case, there are only two classes - positive (1) and negative (2)
and all tuple must either be positive or negative.
For my purpose, "false positive" means a tuple in the testing set is
classified as "1" but it is "0" and "false negative means the tuple is
classified as "0" but it is "1"
First, how can I specify the classifying attribute in the
*Experimenter*? I know how to do that in the Explorer, but since I am
doing a random sample of test data repeatedly, you told me that I
habve to use the Experimenter.
Then how can I tell weka about my intention about what is false
positive and what is false negative?
I am trying to use the weka explorer -> classify
Then I picked a classification method (J48) and I selected "percentage
split" as a mean to split between the training data and clasification
I ran the test for 10 times. However the output is always identical.
This makes me think that the classifier only takes the first 66% of
data as "training" and last 34% as "testing"
How can I make weka to pick 66% of entries *randomly* as training
data? Because I would like to repeat the classification for a large
number of time and get the distributions for correct / incorrect
classified ratio, false positive and false negative ratios.