Note: answers are bolded
Stochastic gradient descent, when used with the hinge loss, leads to which update rule?
- Widrow's Adaline
In a mistake-driven algorithm, if we make a mistake on example xi with label yi, we update the weights w so that we now predict yi correctly.
Which of the following properties is true about the (original) Perceptron algorithm?
- The Perceptron always converges to the best linear separator for a given dataset.
- The convergence criteria for Perceptron depends on the initial value of the weight vector.
- If the dataset is not linearly separable, the Perceptron algorithm does not converge and keeps cycling between some sets of weights.
- If the dataset is not lineary separable, the Perceptron algorithm learns the linear separator with least misclassifications.
Let's assume that we are using the standard Averaged Perceptron algorithm for training and testing (prediction). Let's further assume that it makes k mistakes on the training data. Now, how many weight vectors do we require to predict the label for a test instance?
- Not enough information.
Winnow has a better mistake bound than Perceptron when only k of n features are relevant to the prediction and k << n.