Learning (from a final exam):

George Costanza wants to throw a party and he wants it to be a success. Over the years he has thrown a fair number of parties, and some have been successful and some have been unsuccessful. He typically calls some subset of Seinfeld, Newman and Kramer. Here is the data about the past parties--described in terms of who turned up at the party and whether the party was a success.

 

 

Seinfeld invited

Newman invited

Kramer invited

Good Party

Party1

N

N

N

N

Party2

N

N

Y

N

Party3

N

Y

N

Y

Party4

N

Y

Y

Y

Party5

Y

N

N

Y

Party6

Y

N

Y

Y

Party7

Y

Y

N

N

Party8

Y

Y

Y

N

Part A. We want to help George out by using decision trees to learn the pattern behind successful parties, so that George can then use the learned decision tree to predict whether or not a new party he is trying to throw will be a success. Since this is a small-enough example, YOU should be able to see the pattern in the data. Can you express the pattern in terms of a compact decision tree? (If you are completely lost on this part, you can ask me for a hint at the expense of some points; so you can continue with the rest of the sub-parts)

Can you write the decision tree above as a propositional formula? (Hint: You can write it as GoodParty <=> Formula (where a <=> b is shorthand for a=>b and b=>a, and formula is a disjunction of conjunctions, each conjunction corresponding to one of the good-party=yes branches of the decision tree).

(From this, you can see that a decision tree is really a set of DNF formulas. Since DNF is a canonical form, any Boolean function can be written in DNF form, and so any Boolean function can be written as a decision tree)

Part B. Suppose you want to go with the automated decision-tree construction algorithm (which uses the information value of splitting on each attribute). Show the information values of splitting on each attribute, and show which attribute will be selected at the first level. Assume that if two attributes have the same information value, you will split the tie using the alphabetical ordering (this should be a reasonable assumption since decision trees don't care how you choose among equally valued attributes).

Now continue and make the full decision tree for this example.

Part C. Is the tree you got in Part B will be better or worse than the tree that you wrote in Part A? If it is worse, is it theoretically consistent that you could do better than the automated information-theoretic method for constructing decision trees? (You don't need to show any explicit information value calculations here).

Part D. Consider the super-attribute "Seinfeld-Newman", which in essence is the result of merging the seinfeld and newman attributes. This super attribute takes four values: YY, YN, NY, NN. Compute the information value of this super-attribute, and compare it to the attribute information values in Part B. If you had a chance of considering this super attribute along with the other attributes, which would you have chosen to split on at the first level?

Part E. Does the answer to Part D suggest you a way of always getting optimal decision trees? If so, what is the computational cost of this method relative to the standard method (that you used in Part B)?

Part F. Now, George is gettin' upset at all this dilly-dallying with the decision trees, and would like to switch over to neural networks for finding pattern in his successful parties. He found out that Kramer cannot either make or break a party. So, he decided to ignore Kramer's presence. Using just Seinfeld and Newman he decided to learn the concept of a good party using perceptron learning, with learning rate (alpha=) 0.1. Assume that he started with all weights set to 1. Show how his decision surface changes as the first example is processed (Notice that there are now only 8 examples --1, 3, 5 and 7 (the other 4 are equivalent since Kramer doesn't matter).

Part G. Do you think that the perceptron learning algorithm able to learn the concept above? If not, what is the minimum number of layers of threshold units that he would need in his neural network so he can converge and learn the pattern behind his successful parties? Explain how you reached your answer. Show an actual neural network (with weights and thresholds) that represents the concept. Assume that the inputs are going to be 1 (for Y) and 0 (for N). We want the output to be 1 (for successful party) and 0 (for unsuccessful party). I am not asking you to learn a network using the gradient descent method. I am asking you to show one directly.

Part H. Suppose you want to learn the same concept using a naive bayes classifier. Show the topology of the bayes network you will learn. Show the CPT for the node "Party" and the node "Kramer." Do you think NBC will be successful in this problem? (I.e., are the NBC assumptions that the features are independent of each other given class: P(f1,f2|C)= P(f1|C)*P(f2|C)hold in this problem?)

 

 

Part I.(*Required*) [Using applets] Run this full example (i.e., with all three attributes) on the decision tree and neural net learning applets available at http://www.aispace.org/mainTools.shtml Include the screen dumps showing the learned decision tree and perceptron. Discuss whether the results are as expected and if you learned anything new.