Learning (from a final
exam):
George Costanza wants to
throw a party and he wants it to be a success. Over the years he has thrown a
fair number of parties, and some have been successful and some have been
unsuccessful. He typically calls some subset of Seinfeld, Newman and Kramer.
Here is the data about the past parties--described in terms of who turned up at
the party and whether the party was a success.
|
Seinfeld invited |
Newman invited |
Kramer invited |
Good Party |
Party1 |
N |
N |
N |
N |
Party2 |
N |
N |
Y |
N |
Party3 |
N |
Y |
N |
Y |
Party4 |
N |
Y |
Y |
Y |
Party5 |
Y |
N |
N |
Y |
Party6 |
Y |
N |
Y |
Y |
Party7 |
Y |
Y |
N |
N |
Party8 |
Y |
Y |
Y |
N |
Part A. We want to help
George out by using decision trees to learn the pattern behind successful parties,
so that George can then use the learned decision tree to predict whether or not
a new party he is trying to throw will be a success. Since this is a
small-enough example, YOU should be able to see the
pattern in the data. Can you express the pattern in terms of a compact decision
tree? (If you are completely lost on this part, you can ask me for a hint at
the expense of some points; so you can continue with the rest of the sub-parts)
Can you write the decision
tree above as a propositional formula? (Hint: You can write it as GoodParty <=> Formula (where a <=>
b is shorthand for a=>b and b=>a, and
formula is a disjunction of conjunctions, each conjunction
corresponding to one of the good-party=yes branches of the decision tree).
(From this, you can see that a decision tree is really a set of DNF
formulas. Since DNF is a canonical form, any Boolean function can be written in
DNF form, and so any Boolean function can be written as a decision tree)
Part B. Suppose
you want to go with the automated decision-tree construction algorithm (which uses
the information value of splitting on each attribute). Show the information
values of splitting on each attribute, and show which attribute will be
selected at the first level. Assume that if two attributes have the same
information value, you will split the tie using the alphabetical ordering (this
should be a reasonable assumption since decision trees don't care how you
choose among equally valued attributes).
Now continue and make the
full decision tree for this example.
Part C. Is the tree you got
in Part B will be better or worse than the tree that you wrote in Part A? If it
is worse, is it theoretically consistent that you could do better than the
automated information-theoretic method for constructing decision trees? (You
don't need to show any explicit information value calculations here).
Part D. Consider the
super-attribute "Seinfeld-Newman",
which in essence is the result of merging the seinfeld and newman
attributes. This super attribute takes four values: YY,
Part E. Does the answer to
Part D suggest you a way of always getting optimal decision trees? If so, what
is the computational cost of this method relative to the standard method (that
you used in Part B)?
Part F. Now, George is
gettin' upset at all this dilly-dallying with the decision trees, and would
like to switch over to neural networks for finding pattern in his successful
parties. He found out that Kramer cannot either make or break a party. So, he
decided to ignore Kramer's presence.
Using just Seinfeld and Newman he decided to learn the concept of a good party
using perceptron learning, with learning rate (alpha=) 0.1. Assume that he
started with all weights set to 1. Show how his decision surface changes as the
first example is processed (Notice that there are now only 8 examples --1, 3, 5 and 7 (the other 4 are
equivalent since Kramer doesn't matter).
Part G. Do you think that
the perceptron learning algorithm able to learn the concept above? If not, what
is the minimum number of layers of threshold units that he would need in
his neural network so he can converge and learn the pattern behind his
successful parties? Explain how you reached your answer. Show an actual neural
network (with weights and thresholds) that represents the concept. Assume that
the inputs are going to be 1 (for Y) and 0 (for N). We want the output to be 1
(for successful party) and 0 (for unsuccessful party). I am not asking you to
learn a network using the gradient descent method. I am asking you to show one
directly.
Part I.(*Required*)
[Using applets] Run this full example
(i.e., with all three attributes) on the decision tree and neural net learning
applets available at http://www.aispace.org/mainTools.shtml
Include the screen dumps showing the learned decision tree and perceptron.
Discuss whether the results are as expected and if you learned anything new.