> I'm interested in this claim, as I've heard it before but I am not sure
> I agree. I could view k-means as identifying a combination of
> (hyper)spherical regions in the feature space, but I see nothing in
> k-means that requires that those regions are populated with Gaussian
> distributions. Because of its hard item assignment, there is nothing
> implicit in the algorithm that says anything about the distribution of
> points within a cluster, just that each point is assigned to its
> "nearest" model (center). Can you explain this further?
There *is* a model-based explanation of KMeans. Following up on what Tom
pointed out, KMeans can be considered as fitting a mixture of Gaussians to
a dataset, under certain assumptions:
- Uniform prior distribution over the mixture components
- Covariance of each Gaussian is eI, where e -> 0+ and I is the identity
This model is described by Michael Kearns et al. in their UAI 97 paper "An
Information-Theoretic Analysis of Hard and Soft Assignment Methods for
We have an equivalent model-based analysis of KMeans in our ICML 01 paper
"Semi-supervised Clustering by Seeding", where the relation between KMeans
and EM is described in more detail.
> This is one of the reasons I do not consider k-means to be
> "model-based"; there is no model of the data, aside from the collection
> of cluster centers. You could argue that this is a model, but it
> certainly isn't generative. For example, the same set of cluster
> parameters (centers) could describe Gaussian data, or
> uniformly-distributed data, or data shaped in rings that fall inside
> cluster boundaries, or anything. Therefore, I find the view of k-means
> as data compression/summary process (since you lose the full data
> distribution) more apt.
It is true that KMeans does data compression and can be viewed as a vector
quantization algorithm. But according to the model I explained above,
KMeans can also be formally viewed as a generative clustering algorithm.
This view can be helpful at times, especially for proving convergence
To follow up on the discussion of clustering evaluation, we typically find
it useful to evaluate the results of clustering algorithms that we
investigate using both unsupervised and supervised metrics:
- Unsupervised metrics include (1) Log likelihood (as mentioned by Tom)
of the data given the model. Here one could either evaluate the log
likelihood on the hold out test set, or on the whole dataset. For KMeans,
measuring the complete log likelihood on the whole data, given the model,
would be equivalent to measuring the KMeans objective function on the data
(details of this are shown in our ICML 01 paper).
- Supervised metrics include (1) Normalized mutual information (NMI),
which determines the amount of statistical information shared by the
random variables representing the cluster assignments and the user-labeled
class assignments of the data points. The NMI measure measures how closely
the clustering algorithm could reconstruct the underlying label
distribution in the data. More details of this metric can be found in the
AAAI 2000 Workshop paper by Alex Strehl et al. titled "Impact of
Similarity Measures on Web-page Clustering". Byron Dom also gives a good
explanation of the MI metric in his IBM Tech Report "An Information-
Theoretic External Cluster-Validity Measure" (2) Pairwise F-measure, which
is defined as:
Pairwise Precision = #PairsCorrectlyPredictedInSameCluster /
Pairwise Recall = #PairsCorrectlyPredictedInSameCluster /
Pariwise F-measure = 2 * (Precision * Recall) / (Precision + Recall)
Pairwise F-measure is a generalization of the Rand-Index metric used in
recent semi-supervised clustering papers. Note that both NMI and Pairwise
F-measure are external evaluation metrics, which compare the clustering to
a known underlying class labeling of the data.
When comparing the performance of clustering algorithm "foo" to clustering
algorithm "bar" on a given domain, if "foo" is (statistically
significantly) better than "bar" on both the unsupervised and supervised
metrics mentioned above, then one can reasonably conclude with high
confidence that "foo" performs better than "bar" in the domain being
Note that the supervised metrics outlined above are typically useful for
comparing the performance of a new clustering algorithm "foo" to an
existing algorithm "bar" when we have access to an evaluation dataset with
a known underlying class distribution. In actual clustering situations
where this is not available, using Tom's idea of calculating log
likelihood on the held-out test set to compare two models seems to be a
good method of evaluation.
I know weka has implemented cross validation by the argument '-x'. However,
if I need to do multiple cross validation to validate my algorithm, what
argument should I use if there is any?
For example, 3-trial 10-fold cross validation means I shuffle my instances
before each 1 of 3 10-fold cross validation. Thus the 10 subsets of each
cross validation are different. Can I achieve this using weka?
Love and work are essential for human being's well being.
Department of Computer Science
University of Vermont
Burlington, Vermont, USA 05405
Tel: 802 656 1934
Dear Weka Users,
I have some you doubt regarding the algorithm Kmeans, I hope can help me.
How do I do to validate the clustering in the algorithm SimpleKmeans?! I sought in
the manual and in the tool, but I am not getting an appropriate result.
I use my training group and I make a split of 30%, they put I would like to do the
evaluation of my results in relation to the test group and not of the training base informs.
I am needing that information with urgency, for gracefulness who knows how to answer me.
in the wait
Cassiana Fagundes da Silva
Programa Interdisciplinar de Pós-Graduação - PIPCA
Mestranda em Computacao Aplicada - UNISINOS
suppose you have train and test data do in example:
java weka.classifiers.bayes.NaiveBayes -t qtr.arff -T qte.arff -m qcost.cost
The qcost.cost is a textfile with follow example: It's possible create this "cost" file
with the weka gui, too. -> Classify -> More options -> Cost-sensitive evaluation.
% Rows Columns
% Matrix elements
0: are for me the normal customers
and 1: premium customer. And i this task it is the challenge
to find especially the premium customers.
Now i have a average profit of 0.48 Euro/Customer for thr correct
prediciton in test set - i define -5 as a profit of 5 Euro,
because weka speaks from cost and so negative cost is profit.
=== Evaluation Cost Matrix ===
Time taken to build model: 1.19 seconds
Time taken to test model on training data: 3.53 seconds
=== Error on training data ===
Correctly Classified Instances 7449 92.7993 %
Incorrectly Classified Instances 578 7.2007 %
Kappa statistic 0.405
Total Cost -4515
Average Cost -0.5625
Mean absolute error 0.0966
Root mean squared error 0.2373
Relative absolute error 61.5832 %
Root relative squared error 84.7826 %
Total Number of Instances 8027
=== Confusion Matrix ===
a b <-- classified as
7223 116 | a = Kunde
462 226 | b = Premiumkunde
=== Error on test data ===
Correctly Classified Instances 66125 91.525 %
Incorrectly Classified Instances 6123 8.475 %
Kappa statistic 0.3902
Total Cost -34981
Average Cost -0.4842
Mean absolute error 0.1063
Root mean squared error 0.2559
Relative absolute error 63.4592 %
Root relative squared error 85.7812 %
Total Number of Instances 72248
=== Confusion Matrix ===
a b <-- classified as
63808 1319 | a = Kunde
4804 2317 | b = Premiumkunde
Cupidil <cupidil(a)yahoo.com> schrieb am 28.08.03 12:07:37:
> Can you PLEASE tell me what is the data that I should
> insert to the cost matrix file so the classifier will
> not classify/return 1 when it should be 0. (Although
> returning 0 when it should have been 1 is acceptable)
> I'm using a command-line Weka.
> I wasn't able to create a valid cost matrix file so
> Weka will do it.
> Can you please help me?
> Do you Yahoo!?
> Yahoo! SiteBuilder - Free, easy-to-use web site design software
> Wekalist mailing list
But it is common the validate the number of clusters - i.e., calculate some
function of between and within-cluster variance, and stop when that reaches
a min or max. It's also possible to compare training set clustering to
clustering of random data with similar parameters.
From: Kiri Wagstaff [mailto:Kiri.Wagstaff@jhuapl.edu]
Sent: Thursday, September 11, 2003 6:17 AM
Subject: Re: [Wekalist] Kmeans Information
On Mon, 8 Sep 2003, Cassiana Fagundes Silva wrote:
> I have some you doubt regarding the algorithm Kmeans, I hope can help
> me. How do I do to validate the clustering in the algorithm
> SimpleKmeans?! I sought in the manual and in the tool, but I am not
> getting an appropriate result. I use my training group and I make a
> split of 30%, they put I would like to do the evaluation of my results
> in relation to the test group and not of the training base informs.
What do you mean by "validate the clustering"? Clustering is not
generally used in a training/test environment, because it does not
generate a classifier that can be "tested" on other data. The clusters
are descriptive, not generative (that is, they tell you about
characteristics of your data set, but they don't tell you how to
label/assign/classify new data, because they don't know anything about the
true data labels). You simply cluster all of your items at the same time.
You can evaluate the accuracy of this clustering by comparing the cluster
assignments to the data labels, if you have them.
For completeness, I should mention that there are variants on clustering
methods that do permit you to build a set of clusters, treat them as a
model, and then see where a new item would end up. However, this doesn't
classify the item (unless you've used data labels to assign cluster labels
based on the majority of items in each cluster, or something like that),
and it's important to remember that the addition of new items may mean
that the clusters are no longer "accurate" with respect to the data. In
general, evaluating clustering results is challenging, since a) if you
have the true labels to exactly compare to, you probably don't need a
clustering algorithm in the first place and b) since clustering is a
descriptive method, there's no guarantee that the clusters it finds will
be the ones you hoped it would find.
Kiri L. Wagstaff, Ph.D. (kiri.wagstaff(a)jhuapl.edu) \|/
Science Applications Group, Space Department -O-
The Johns Hopkins University Applied Physics Lab /|\
Wekalist mailing list
Hi weka users
does anyone know how the discretization is performed
in the M5 tree? From the book "Data Mining, Witten,
Frank" I understand how discretization is performed
based on class values (pages 238-246), but how does it
work in case of prediction?
The weka.filters.DiscretizeFilter refers to the Fayyad
& Irani's method but it seems that this method works
in case of classification.
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
I am attempting to run the NaiveBayes classifier with the -p option and
the output looks unusual.
I'm running a subset of the soybean-large dataset where the training file
has instances from only two classes and the testing dataset has instances
from all 19.
The output contains predictions using a class which is not represented
in the training file.
I have attached a tar ball containing the training, testing and output
Any help would be greatly appreciated,
I have a very generic question regarding SOMs... Once, you have built an
SOM using ur training data... how can u use that to classify unseen data
Gaming galore at http://xtramsn.co.nz/gaming !
Hi, I think would be better to put some links about SVM-Weka integration
(libs, wrappers...) in the Weka web pages.
___________________________Arturo Montejo Raez
ETT-EC-EX (CERN) http://cern.ch/amontejo