Hi Wekalist,

Yes,  I am using the cluster labels as class labels for a classification task. Even if I mention k or not beforehand, these strange results appear!...
Yes, since I am using the cluster labels for classification, hence, I am trying to determine the classes for the test set to be represented by one of the clusters..

I also want to know:
Suppose, I take the whole data set D in Weka and then apply clustering on it  and use "supplied test set" where I select the test set T2 (which is 1/3 of D) and then click 'start'...I get clusters of the T2 data set.
What does it imply?

Thanks and regards
Sujata Sinha


On Mon, Mar 21, 2016 at 6:48 PM, sujata sinha <sujatasinhacse@gmail.com> wrote:
Hi Wekalist,

Yes,  I am using the cluster labels as class labels for a classification task. Even if I mention k or not beforehand, these strange results appear!...
Yes, since I am using the cluster labels for classification, hence, I am trying to determine the classes for the test set to be represented by one of the clusters.. I have attached my data set here...

Thanks and regards,
Sujata


Thanks and regards
Sujata Sinha


On Mon, Mar 21, 2016 at 1:56 PM, Michael Hall <mik3hall@gmail.com> wrote:
> On Mar 21, 2016, at 2:29 AM, sujata sinha <sujatasinhacse@gmail.com> wrote:
>
> Hii WEKA list,
>
> I want to know if this way of classification helps.
>
> Case 1:
> Suppose,I have a data set, D. I first cluster D and then divide the clustered D into training (T1) and test  (T2) data sets. Now, I build a classifier model on T1 and use the T2 through 'Supplied test data'  in Weka. Then, I get about 93% accuracy of the classifier.
>
> CAse2:
> On the other hand, if I divide D into T1 and T2 before clustering and cluster T1 and T2 independently and build classifier model on T1 to find the classes for T2, I get only 4% accuracy.
>
> What does each of the above cases imply?

You aren’t telling us anything about what the clusters might represent.
When you cluster do you indicate the number of clusters you expect beforehand?
Is the class you are trying to determine supposed to be represented by one of the clusters?
The main implication would seem to be that clustering first gives you data that trains well for predicting the test set. While splitting and then clustering doesn’t. This might be because clustering first gives you better ‘stratification’ for the split. The classes in the training set better match the classes in the test set when you cluster first. The drop in accuracy is extreme though suggesting a really, really, bad split in the second case. If it is done randomly even without stratification I wouldn’t expect that big of a drop? Somehow it just seems like a really terrible split. Or one of the resulting datasets after the split is way too small to accurately cluster or something like that.
If you use cross validation and not training/test split I think it default provides some ‘stratification’ to get matching class representations? I might be wrong there.
The accuracy in both cases seems strange though. That clustering, a unsupervised process gives you something that can be classified with 93% accuracy seems almost unrealistically good. A different split resulting in 4% accuracy seems even more unrealistically bad then.

Michael Hall



_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@list.waikato.ac.nz
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html