Realizing also that in each type, it is possible to go more expressive hypotheses (e.g. topology and hidden variables for Bayes nets; higher order polynomials for decision surfaces etc).
Understanding that inductive learning involves ranking hypotheses in terms of their fit to training data.
False positives and false negatives and how the "loss" /"cost" function for learning may give asymmetric costs to them.
Understanding that what we want to really do is get a hypothesis that fits well to the test data--and that training data is only a stand-in.
Understanding that training data has to be drawn from the same distribution as test data (the discussion about fair tests).
Understanding the "bias-variance" tradeoff and the perils of over-fitting. ( if we keep picking more and more expressive hypotheses (or weaker and weaker bias--e.g. going from lines to 2nd order polynomials to 7th order polyonomials etc) we can reduce the training error, but will eventually start increasing the test error due to over-fitting).
Desiderata for a good learning algorithm--in terms of probably approximately correct learning performance.
Understanding learner difficulty interms of number of samples needed to reach a certain level of test accuracy, and how different learning algorithms might compare.
Understanding learning problem difficulty interms of sample complexity of the problem. Realizing that sample complexity of discrete hypothesis spaces (e.g. boolean functions) is proportional to the log of the hypothesis space size. Seeing that this means boolean function learning requires exponential number of samples while conjunctive boolean function learning only requires polynomial number of samples.
Greedy decision tree induction. Figuring out which attribute to split on. Entropy heuristic, and computing expected entropy after a split.
Seinfeld-Neuman example and understanding why the greedy splitting on single attributes can give us inoptimal (not the smallest) decision trees.
Understanding when the data is noisy (we have no more attributes to split on, and we still have both +ve and -ve examples..).
Undestanding over-fitting and refusing to split on a variable because its entropy is not much lower than 1 (or alternately its information gain is not much higher than 1).