I invite you to RSCTC'2010 Discovery Challenge - a special event of the
7th International Conference on Rough Sets and Current Trends in
Computing. The task is related to feature selection in analysis of DNA
microarray data and classification of patients for the purpose of
medical diagnosis and treatment.
Prizes worth over * 3,000 USD * will be awarded to the best solutions!
Users of Weka are particularly invited, since one of the challenge
tracks accepts Java source code as a solution and the algorithm must be
a classifier implemented in architecture of Weka (or 2 other related
systems: Debellor or Rseslib).
Challenge web page: http://tunedit.org/challenge/RSCTC-2010-A
Started: Dec 1, 2009
Ends: Feb 28, 2010
Marcin Wojnarski, Project Lead, TunedIT, http://tunedit.org
Machine Learning & Data Mining Research -
Automated Tests, Repeatable Experiments, Meaningful Results
I have a dataset with 100,000 examples and 14 features, most of them are
nominals with large set of possible values. The class itself has ~750
different possible values. J48 does not seem to be able to handle that with
a 2G heap space. Lazy.IB1 does not seem to be able to finish. Only
NaiveBayes and OneR give me something. I have 3 questions:
(1) which classifiers should work best with large multiclass datasets?
(2) should I change the type of those nominals to string, nominal to
binary? Does it matter?
(3) I tried a linear regression on one of the numeric feature but I got an
out-of-memory error there too. Any hints?
I am using weka 3.6.1 version for classification tasks and I'm testing J48
I have just two attributes in my dataset for now, "text"(string) and
"class"(nominal with 2 values). I generate arff files without any problem
for both the training and test datasets separately, and also ensure that the
headers for each of them are the same.
Now when I call classifier.classifyInstance(testDataInstance) on the test
dataset, I get java.lang.ArrayIndexOutOfBoundsException sometimes only. This
happened when in the training set I had 3 examples labelled as class1, and 2
examples labelled as class2. But when there were 2 and 2 of both classes in
the training dataset, it didn't give any error.
Here's the classifier output after training
J48 pruned tree
it <= 0: IBREL (3.0)
it > 0: GEN (2.0)
Number of Leaves : 2
Size of the tree : 3
And here's the output after trying to classify the test dataset:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 31
at weka.core.FastVector.elementAt(Unknown Source)
at weka.core.Instances.attribute(Unknown Source)
at weka.core.Instance.attribute(Unknown Source)
at weka.classifiers.trees.j48.C45Split.whichSubset(Unknown Source)
at weka.classifiers.trees.J48.classifyInstance(Unknown Source)
I don't understand why this is happening. I don't see any problem in
contents of the line I added, so why should it work for 2,2 labelled
examples of each attribute and not for 3,2 labelled examples. In fact, for
different distributions of examples, it sometimes works and sometimes fails
as above. Is this a property of J48 trees itself?
Thanks so much