[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

clustering the mid-term marks using k-means




The midterm has been graded. The stats are: avg: 29  standard
deviation:14.5
33 people took the exam which was for 63 points (not 65, as i seem to
have miscounted)
There are  1 above 60
one between 50-60 
4 between 40-50
8 between 40-30
8 between 30-20
7 between 20-10
4 below 10

I will return the exams at the end of the class on Monday. don't ask
me for marks before that. 

Now I am sure you all want to know what would be "A", "B" etc cut offs 
just for this exam.

I thought we can use the k-means algorithm to do clustering. After
all, the usual problem with k-means, that it requires us to
pre-specify number of clusters is not really a problem since we are
going to have a fixed number of letter grades anyway.

So, I wrote up a little (lisp) program for k-means and ran it on the
list of 33 individual scores. Here, for your edification are the
results:


Case 1: 5 clusters (A,B,C,D,E)

USER(164): (k-means mlist 5 :key #'mark-val)

>>>>((61.5) (38) (32) (26) (17.5))
>>>>((61.5 55) (48 47.5 47.5 47.5 38 37 35) (34.5 32.5 32.5 32 30 29) (28 27 27 26 25.5 22.5)
     (20.5 19 18 17.5 17 13.5 13 11.5 9.5 8.5 7 4))
>>>>((61.5 55) (48 47.5 47.5 47.5 38 35) (37 35 34.5 32.5 32.5 32 30 29) (28 27 27 26 25.5 22.5 20.5)
     (19 18 17.5 17 13.5 13 11.5 9.5 8.5 7 4))
>>>>((61.5 55) (48 47.5 47.5 47.5 35) (38 37 35 34.5 32.5 32.5 32 30) (29 28 27 27 26 25.5 22.5 20.5 19)
     (18 17.5 17 13.5 13 11.5 9.5 8.5 7 4))
>>>>((61.5 55) (48 47.5 47.5 47.5 35) (38 37 35 34.5 32.5 32.5 32 30) (29 28 27 27 26 25.5 22.5 20.5 19)
     (18 17.5 17 13.5 13 11.5 9.5 8.5 7 4))

;;Notice that the k-means starts with some default cluster centers and 
;;iterates until clusters stablize. So the last ">>>" aboe si the
;;cluster we get. 
;;(if we stopped with the first iteration people with 34.5 would have
;;gotten A ;-)

Case 2: 4 clusters (A,B,C,D) (assume I decided not to give Es)
USER(162): (k-means mlist 4 :key #'mark-val)

>>>>((61.5) (35) (27) (17.5))
>>>>((61.5 55) (48 47.5 47.5 47.5 38 37 35 34.5 32.5 32.5 32) (30 29 28 27 27 26 25.5 22.5)
     (20.5 19 18 17.5 17 13.5 13 11.5 9.5 8.5 7 4))
>>>>((61.5 55) (48 47.5 47.5 47.5 38 37 35 34.5 32) (32.5 32.5 32 30 29 28 27 27 26 25.5 22.5 20.5)
     (19 18 17.5 17 13.5 13 11.5 9.5 8.5 7 4))
>>>>((61.5 55) (48 47.5 47.5 47.5 38 37 35 34.5 32) (32.5 32.5 32 30 29 28 27 27 26 25.5 22.5 20.5)
     (19 18 17.5 17 13.5 13 11.5 9.5 8.5 7 4))


Case 3: 4 clusters (A,B,C,D) but we remove the highest (61.5) from
consideration (assume that we give that person an A+ or just physically push
that person out of the class for the sake of collective happiness). 


USER(125): (k-means mlist2 4 :key #'mark-val)

>>>>((55) (34.5) (27) (17))
>>>>((55 48 47.5 47.5 47.5) (38 37 35 34.5 32.5 32.5 32) (30 29 28 27 27 26 25.5 22.5)
     (20.5 19 18 17.5 17 13.5 13 11.5 9.5 8.5 7 4))
>>>>((55 48 47.5 47.5 47.5) (38 37 35 34.5 32.5 32.5 32) (30 29 28 27 27 26 25.5 22.5 20.5)
     (19 18 17.5 17 13.5 13 11.5 9.5 8.5 7 4))
>>>>((55 48 47.5 47.5 47.5) (38 37 35 34.5 32.5 32.5 32) (30 29 28 27 27 26 25.5 22.5 20.5)
     (19 18 17.5 17 13.5 13 11.5 9.5 8.5 7 4))

As expected, with 61.5 tossed out of the class, a lot more people get
As ;-)


Case 1': Clustering with a different set of initial centers
 we will repeat case 3, except we start with different centers

USER(172): (k-means mlist-r 5 :key #'mark-val)

>>>>((35) (32) (26) (18) (17))
>>>>((61.5 55 48 47.5 47.5 47.5 38 37 35 34.5) (32.5 32.5 32 30 29) (28 27 27 26 25.5 22.5) (20.5 19 18 17.5)
     (17 13.5 13 11.5 9.5 8.5 7 4))
>>>>((61.5 55 48 47.5 47.5 47.5) (38 37 35 34.5 32.5 32.5 32 30 29) (28 27 27 26 25.5 22.5) (20.5 19 18 17.5 17)
     (13.5 13 11.5 9.5 8.5 7 4))
>>>>((61.5 55 48 47.5 47.5 47.5) (38 37 35 34.5 32.5 32.5 32 30) (29 28 27 27 26 25.5 22.5) (20.5 19 18 17.5 17)
     (13.5 13 11.5 9.5 8.5 7 4))
>>>>((61.5 55 48 47.5 47.5 47.5) (38 37 35 34.5 32.5 32.5 32) (30 29 28 27 27 26 25.5 22.5) (20.5 19 18 17.5 17)
     (13.5 13 11.5 9.5 8.5 7 4))
>>>>((61.5 55 48 47.5 47.5 47.5) (38 37 35 34.5 32.5 32.5 32) (30 29 28 27 27 26 25.5) (22.5 20.5 19 18 17.5 17)
     (13.5 13 11.5 9.5 8.5 7 4))
>>>>((61.5 55 48 47.5 47.5 47.5) (38 37 35 34.5 32.5 32.5 32) (30 29 28 27 27 26 25.5) (22.5 20.5 19 18 17.5 17)
     (13.5 13 11.5 9.5 8.5 7 4))

Compare this to case 1--see that we now converged to an entirely
different cluster just because we started from new centers


case 2': repeat case 2 with different centers

USER(173): (k-means mlist-r 4 :key #'mark-val)

>>>>((35) (22.5) (18) (9.5))
>>>>((61.5 55 48 47.5 47.5 47.5 38 37 35 34.5 32.5 32.5 32 30 29) (28 27 27 26 25.5 22.5 20.5) (19 18 17.5 17)
     (13.5 13 11.5 9.5 8.5 7 4))
>>>>((61.5 55 48 47.5 47.5 47.5 38 37 35 34.5) (32.5 32.5 32 30 29 28 27 27 26 25.5 22.5) (20.5 19 18 17.5 17)
     (13.5 13 11.5 9.5 8.5 7 4))
>>>>((61.5 55 48 47.5 47.5 47.5 38 37) (35 34.5 32.5 32.5 32 30 29 28 27 27 26 25.5) (22.5 20.5 19 18 17.5 17)
     (13.5 13 11.5 9.5 8.5 7 4))
>>>>((61.5 55 48 47.5 47.5 47.5) (38 37 35 34.5 32.5 32.5 32 30 29 28 27 27 26 25.5) (22.5 20.5 19 18 17.5 17)
     (13.5 13 11.5 9.5 8.5 7 4))
>>>>((61.5 55 48 47.5 47.5 47.5) (38 37 35 34.5 32.5 32.5 32 30 29 28 27 27 26 25.5) (22.5 20.5 19 18 17.5 17)
     (13.5 13 11.5 9.5 8.5 7 4))

;;interesting how going from 5 to four just merges old B and C...


case 3': repeat case 3 with different centers (we just removed 61.5)

USER(181): (k-means mlist-r-2 4 :key #'mark-val)

>>>>((35) (30) (22.5) (9.5))
>>>>((55 48 47.5 47.5 47.5 38 37 35 34.5 32.5 32.5) (32 30 29 28 27 27) (26 25.5 22.5 20.5 19 18 17.5 17)
     (13.5 13 11.5 9.5 8.5 7 4))
>>>>((55 48 47.5 47.5 47.5 38 37) (35 34.5 32.5 32.5 32 30 29 28 27 27 26 25.5) (25.5 22.5 20.5 19 18 17.5 17)
     (13.5 13 11.5 9.5 8.5 7 4))
>>>>((55 48 47.5 47.5 47.5 38) (37 35 34.5 32.5 32.5 32 30 29 28 27 27 26 25.5) (25.5 22.5 20.5 19 18 17.5 17)
     (13.5 13 11.5 9.5 8.5 7 4))
>>>>((55 48 47.5 47.5 47.5) (38 37 35 34.5 32.5 32.5 32 30 29 28 27 27 26 25.5) (25.5 22.5 20.5 19 18 17.5 17)
     (13.5 13 11.5 9.5 8.5 7 4))
>>>>((55 48 47.5 47.5 47.5) (38 37 35 34.5 32.5 32.5 32 30 29 28 27 27 26 25.5) (25.5 22.5 20.5 19 18 17.5 17)
     (13.5 13 11.5 9.5 8.5 7 4))

Notice the non-local and somewhat non-intuitive effect on the
clusters.. A cutoff still remains 47.5, but B now got chopped up.. 

You would have thought more people will get better grades if the
highest guy gets thrown out... and yet... 

In summary, you wouldn't trust your grades to K-means algorithm based
clustering because:

1. its clusters are sensitive to intial centers

2. its clusters are sensitive to outliers

3. deletion of elements might have non local effects on its clusters


I know I know. You see this whole thing as a joke in bad taste....

Rao
[Mar 23, 2001]
---------- 
Subbarao Kambhampati 
Professor, Dept. of Comp Sci. and Engg. Arizona State University, 
Tempe, AZ 85287-5406  rao@asu.edu (email)
480-965-0113 (Phone)                         
480-965-2751 (FAX) 
WWW: http://rakaposhi.eas.asu.edu