CS446: Machine Learning

Quiz 3

1. While learning a linear separator using the continuous and differentiable loss function Q(w), the batch gradient descent algorithm is guaranteed to find which of the following? Assume that the step size is set to an appropriate value.
1. The global minima of Q(w)
2. One of the local minima of Q(w)
3. The global maxima of Q(w)
4. One of the local maxima of Q(w)

2. What kinds of boolean functions can decision trees represent?
1. Conjunctions
2. Disjunctions
3. M-of-n functions
4. All of the above

3. We run the ID3 algorithm for learning decision trees over a set of attributes, where each attribute can take two values. Assume that we have to choose one attribute out of a possible four attributes, which split the data into two groups of 500 data points, with both groups having the same distribution over positive and negative examples as described in the four options below. Which of the attributes will ID3 choose?
1. 250 positive examples, 250 negative examples
2. 300 positive examples, 200 negative examples
3. 200 positive examples, 300 negative examples
4. 450 positive examples, 50 negative examples

4. What is the correct computation for the entropy of a dataset with 10 positive examples and 23 negative examples?

1. -(10/33) * log(23/33) - (23/33) * log(10/33)
2. -(10/33) * log(10/33) - (23/33) * log(23/33)
3. -(10/23) * log(10/23) - (23/10) * log(23/10)
4. (10/33) * log(10/33) + (23/33) * log(23/33)

5. Consider two different approaches for the same learning problem:
(1) Learning decision trees where the depth of the learned trees can be at maximum 10.
(2) Learning decision trees where their depth is not limited in any way.
In which of these two scenarios are you more likely to over fit the training data?
1. Scenario (1)
2. Scenario (2)

Dan Roth