I think my question is in the same direction. In the project I am currently
working on I have a database with some documents. I want to compare the
documents pairwise and for every pair save whether the two documents are
similar or not. Right now I don't have to know how to do it, but I would
to know if it is possible at all to do this with WEKA or any of the programs
which use WEKA as basic platform?
Thanks for your help,
>as I have posted earlier, I am still looking for a way to get measures
>such as mean distance within a document set (or simply a set of
>vectors), as well as within clusters when using clustering algorithms.
>It would be nicest to have options like TF.IDF weighting, vector
>normalisation (I guess this is a fundamental part of most distance
>measures anyway) and such, some of which could probably be done in the
>Since clustering and classification algorithms use such measures
>internally it should not be hard to get them somehow, but it doesn't
>seem possible from the command line and I haven't yet looked
>sufficiently at the internals of WEKA to know how it could be done in
>Any hints on where to look would be greatly appreciated.
I'm trying to translate C4.5 style .names and .data files to an .arff file
In a .names file you're allowed to specify that an attribute should be ignored
simply by specifying "ignore" as the type. Is there any way of specifying
this in an .arff file? I did a basic google search and didn't see one. I
only found advice to edit the attributes later inside the GUI. Any ideas on
how this can be declared in an .arff file?
I just wanna know if it is possible to use the Weka clustering
algorithm in order to cluster news documents. I wanna use the news title
and the news content to cluster the documents.
Can I acomplish that easily or I have to modify the code to read my
data? Has someone done a similar functionallity?
Albert Vila Puig
[iMente, El mayor agregador de titulares en español]
Le invitamos a visitar nuestra nueva web y probar nuestros servicios
Dear Weka developers,
I'm a PhD candidate in Japan. I'm planning to set up 'Weka Japanese
localized package project' on SourceForge.jp. In this project, we will
transrate GUI texts and "More" documents included in the source codes.
It will be very useful for japanese dataminers, because they tend
to select localized datamining tools.
Is there any manner or standard to make up localized Weka package?
Of cource, we will distribute original source package, localized patch,
and binary package, obeying the license.
I just downloaded WEKA-3 (the one being used in the chapter-8). I
the instruction but was unable to run it.
when I used the following command
java weka.classifiers.j48.J48 -t weather.arff
I am getting the following message
Warning: -t not understood.
Weka exception: No training file and no object input file given.
-t <name of training file>
Sets training file.
-T <name of test file>
Sets test file. If missing, a cross-validation
will be performed on the training data.
-c <class index>
Sets index of class attribute (default: last).
-x <number of folds>
Sets number of folds for cross-validation (default: 10).
-s <random number seed>
Sets random number seed for cross-validation (default: 1).
-m <name of file with cost matrix>
Sets file with cost matrix.
-l <name of input file>
Sets model input file.
-d <name of output file>
Sets model output file.
Outputs no statistics for training data.
Outputs statistics only, not the classifier.
Outputs information retrieval statistics for
Outputs information-theoretic statistics.
Only outputs predictions for test instances.
Only outputs cumulative margin distribution.
Only outputs the graph representation of the classifier.
Options specific to weka.classifiers.j48.J48:
Use unpruned tree.
-C <pruning confidence>
Set confidence threshold for pruning.
-M <minimum number of instances>
Set minimum number of instances per leaf.
Use reduced error pruning.
-N <number of folds>
Set number of folds for reduced error
pruning. One fold is used as pruning set.
Use binary splits only.
Don't perform subtree raising.
I tried using Weka for wrapper model based feature selection with J48 as
the attribute evaluator, and BestFirst as the search method. My dataset
has 139 attributes and aprrox 20,000 instances. I get StackOverflow
error. Is there any way that I can overcome it?
What is the maximum size of the dataset that Weka can handle?
Thanks for your help!
recently, I need to train a adaboost classifier from a large dataset. the dataset contains about 2000 instances and each instance has 11980 attributes. so the dataset is about 160M. it seems that weka can't deal with such big files. What can I do?
any suggestion will be highly appreciated.