Re: [Wekalist] Mined large data sets before with Weka?
by Hans van Rijnberk , Assort Vision, Utrecht
Is performance measured on the same independent instances (instances never
used in any training set)?
At 14:49 16-6-2004 +0100, Aidan Finn wrote:
>I have used WEKA with 50,000 attributes and 80,000 instances. It is very
slow with this many attributes and instances so I used the information gain
>filter to filter down to 5000 attributes. This was a highly skewed dataset
- reducing down to a few thousand attributes didn't adversely affect
>performance, however reducing the number of instances resulted in lower
precision. We tried several ways of selecting smaller subsets of instances for
>training but on our datasets, all resulted in lower performance. On less
skewed datasets, you can probably do just as well with a small subset of
>examples. There is lots of work in this area - look for papers in the areas
of active learning, selective sampling.
>I used WEKA with 1.9GB allocated to the java interpreter. This is the
maximum you can allocate to the sun java interpreter.
>Hans van Rijnberk , Assort Vision, Utrecht wrote:
>> I don't know any specifics but is it usefull to produce classifiers based on
>> so many data?
>> What will be learned from it ? Does it help identification of future cases
>> better or worse then selecting intelligent subsets?
>> Aren't you overfitting?
>> At 08:43 14-6-2004 +0000, Chun Phua wrote:
>>>I want to find if anyone has used Weka to mine any data set bigger than
>> 65,000 examples with 100 attributes. Weka is running happily with 20,000
>> examples with 70 numerical attributes though. Currently having the java
>> outofMemory error, even after setting the -Xmx1024M for the java engine (as
>> stated in a Weka tutorial).
>>>Does anyone have any suggestions or experiences to share with me?
>>>Wekalist mailing list
>> Hans van Rijnberk
>> Wekalist mailing list
>Wekalist mailing list
Hans van Rijnberk