I don't know if I got it right. You said that the words are classified by
frequency, but this frenquency has anything to do with the method used to
generate the weights of the words? Like if I'm using TFIDF the words with
the highest weights would be selected.
Or it is simple the frenquency (the number of times the word appear in the
set of documents)?
The problem is if the words are selected according to the method, if I
simple choose boolean to represent the 'bag of words' how would it select a
On 5/24/07, Peter Reutemann <fracpete(a)waikato.ac.nz> wrote:
I'm using the filter StringToWordVector to
generate the "bag of words"
then comparate some classification methods. I
didn't realize until now
there is a parameter (-W) that says how many
words it should keep. My
is how the words are selected, how the more
relevant words are keep.
This is what I can tell from a quick look at the code:
The dictionary will have at most -W words that have at least the
frequency specified with -M. The list of words is sorted by number of
occurrences, i.e., less frequent words will be kicked out, in case the
minimum frequency -M would produce too many words.
With -O one can throw all words together from all the classes, by
default the dictionaries are generated on a per-class basis.
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
Ph. +64 (7) 858-5174
Wekalist mailing list