[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

One more comment on k-means (and mid-term marks)




The following contains case 1 and case 1' results, this time with the
cluster dissimilarity measure (sum of the deviations of the elements
in each cluster from that cluster's center) noted next to the cluster.
These measures in a way tell you how good the clustering is--the
smaller the measure the better the clustering. 

case 1
USER(218): (k-means mlist 5  :key #'mark-val)

>>>>((61.5) (38) (32) (26) (17.5))
>>>>((61.5 55) (48 47.5 47.5 47.5 38 37 35) (34.5 32.5 32.5 32 30 29) (28 27 27 26 25.5 22.5)
     (20.5 19 18 17.5 17 13.5 13 11.5 9.5 8.5 7 4)) --Dissimilarity Measure:113.07143
>>>>((61.5 55) (48 47.5 47.5 47.5 38) (37 35 34.5 32.5 32.5 32 30 29) (28 27 27 26 25.5 22.5 20.5)
     (19 18 17.5 17 13.5 13 11.5 9.5 8.5 7 4)) --Dissimilarity Measure:97.791214
>>>>((61.5 55) (48 47.5 47.5 47.5) (38 37 35 34.5 32.5 32.5 32 30) (29 28 27 27 26 25.5 22.5 20.5 19)
     (18 17.5 17 13.5 13 11.5 9.5 8.5 7 4)) --Dissimilarity Measure:88.91668
>>>>((61.5 55) (48 47.5 47.5 47.5) (38 37 35 34.5 32.5 32.5 32 30) (29 28 27 27 26 25.5 22.5 20.5 19)
     (18 17.5 17 13.5 13 11.5 9.5 8.5 7 4)) --Dissimilarity Measure:88.91668

case 1'
USER(219): (k-means mlist-r 5  :key #'mark-val)

>>>>((35) (32) (26) (18) (17))
>>>>((61.5 55 48 47.5 47.5 47.5 38 37 35 34.5) (32.5 32.5 32 30 29) (28 27 27 26 25.5 22.5) (20.5 19 18 17.5)
     (17 13.5 13 11.5 9.5 8.5 7 4)) --Dissimilarity Measure:117.0
>>>>((61.5 55 48 47.5 47.5 47.5) (38 37 35 34.5 32.5 32.5 32 30 29) (28 27 27 26 25.5 22.5) (20.5 19 18 17.5 17)
     (13.5 13 11.5 9.5 8.5 7 4)) --Dissimilarity Measure:82.19365
>>>>((61.5 55 48 47.5 47.5 47.5) (38 37 35 34.5 32.5 32.5 32 30) (29 28 27 27 26 25.5 22.5) (20.5 19 18 17.5 17)
     (13.5 13 11.5 9.5 8.5 7 4)) --Dissimilarity Measure:80.37619
>>>>((61.5 55 48 47.5 47.5 47.5) (38 37 35 34.5 32.5 32.5 32) (30 29 28 27 27 26 25.5 22.5) (20.5 19 18 17.5 17)
     (13.5 13 11.5 9.5 8.5 7 4)) --Dissimilarity Measure:78.55476
>>>>((61.5 55 48 47.5 47.5 47.5) (38 37 35 34.5 32.5 32.5 32) (30 29 28 27 27 26 25.5) (22.5 20.5 19 18 17.5 17)
     (13.5 13 11.5 9.5 8.5 7 4)) --Dissimilarity Measure:78.571434
>>>>((61.5 55 48 47.5 47.5 47.5) (38 37 35 34.5 32.5 32.5 32) (30 29 28 27 27 26 25.5) (22.5 20.5 19 18 17.5 17)
     (13.5 13 11.5 9.5 8.5 7 4)) --Dissimilarity Measure:78.571434


You will note 
 1. That dissimilairty measure is reduced from iteration to iteration
    in each run

 2. that the lowest dissimilarity attained depends on the original
    cluster centers. This is a consequence of the fact that K-means is 
    a greedy algorithm and is not finding clusters with globally
    lowest cluster dissimilarities. 

 3. It is nice to see that the clusters found in case 1' are better
    (according to the dissimilarity metric) than those found in case 1 
   (because this means that giving more As is in fact a better
     idea according to k-means ;-)

Here is a little puzzle to ponder over:
 How would you go about making k-means find the globally best cluster
according to the dissimilairty measure?

Rao
[Mar 23, 2001]