In response to the request below, and to all others on the WEKA list who
have faced similar problems, I have written a simple utility to share
with the community.
You will find attached to this message the file TextDirectoryToArff.zip.
It contains the Java source and class files for the utility. The utility
is very simple - it takes a single command line parameter which is the
path to a directory on your computer. The utility will simply create a
new dataset in ARFF format where each instance represents each file with
a '.txt' extension found in the directory given. The dataset will be
dumped to standard out.
Each instance in the dataset consists of two string attributes - the
name of the file, and the contents of the file. Basically this is
designed to be a preprocessing step before using the StringToWordVector
filter. The StringToWordVector filter will count the occurrences of
words in the files, and turn the strings into a set of boolean
attributes representing the presence or absence of each word in the
document. It lets you limit the size of the dictionary produced to
however many of the most frequent words you desire.
Here is an example of how it is to be used. Lets say I have a collection
of documents sitting in a directory called /home/richard/mydocuments. I
can create an arff file called mydocuments.arff from this directory with
the following command:
java TextDirectoryToArff /home/richard/mydocuments > mydocuments.arff
Now to convert my dataset into an arff file called
mydocuments_vector.arff that has a set of word vectors suitable for
training learning schemes with:
java weka.filters.unsupervised.attribute.StringToWordVector -i
mydocuments.arff -o mydocuments_vector.arff -R 2 -w 100
(all on a single line)
The -R 2 option tells the filter to convert the second attribute.
The -w 100 option tells the filter to only include the top 100 frequent
words in the dictionary. Adjust to your requirements.
What this process doesn't handle is classifying each document. This part
I leave up to you to figure out. In the worst case you could just edit
the arff file by hand to include a new class attribute.
You have the source, so feel free to adjust the behaviour of this
utility in any way to suit your requirements.
NOTE: I'm using the version 3-3-4 of WEKA, older versions will probably
find the filter called weka.filters.StringToWordVectorFilter if it
exists. The newest (unreleased but available through cvs) version of the
filter can put counts instead of just booleans for each word, and has
fixed the -w option to be -W. To see the options for your version of the
filter, run it with the -h option.
I am a student in Staffordshire university, doing a
project in text
mining, i have a collection of texts which are saved in .txt format.
After reading your book, i understood that i have to change the text to
the ARFF format. I saw some classes like 'weka.core.Instances', does
that mean that i have to change my text to ARFF format using these
classes?, is there a way in which WEKA can convert my document
I will very much appreciate if i can receive more information in this.
Thank you in advance