Bias, Variance and Model Complexity
Test error (generalization error): the prediction error over an independent test sample \[ πΈπππ=πΈ[πΏ(π,\hat{f} (π))|π] \] Here the training set \(\tau\) is fixed, and test error refers to the error for this specific training set.
Expected test error: \[ Err=E[L(Y,\hat{f}(X)]=E[Err_\tau] \]
This expectation averages over everything that is random, including the randomness in the training set that produced \(\hat{f}\)
Training error: the average loss over the training sample \[ \bar{err}=\frac{1}{N}\sum_{i=1}^NL(y_i,\hat{f}(x_i)) \]
Model selection: estimating the performance of different models in order to choose the best one.
Model assessment: having chosen a final model, estimating its prediction error (generalization error) on new data.
Randomly divide the dataset into three parts: - a training set: fit the models - a validation set: estimate prediction error for model selection - a test set: assessment of the generalization error of the nal chosen model
A typical split might be 50% for training, and 25% each for validation and testing:
The Bias Variance Decomposition
General Model
If we assume that \(Y=f(X)+\epsilon\) where \(E(\epsilon)=0\), and \(Var(\epsilon)=\sigma^2_\epsilon\), we can derive an expression for the expected prediction error of a regression fit \(\hat{f}(X)\) at an input point X = x0, using squared-error loss:
\[ \begin{align} Err(x_0)&=E[(Y-\hat{f}(x_0))^2|X=x_0] \\ &=E[(f(x_0)+\epsilon-\hat{f}(x_0))^2] \\ &=E[\epsilon^2+(f(x_0)-\hat{f}(x_0))^2+2\epsilon(f(x_0)-\hat{f}(x_0))] \\ &=\sigma^2_\epsilon+E[f(x_0)^2+\hat{f}(x_0)^2-2f(x_0)\hat{f}(x_0)] \\ &=\sigma^2_\epsilon+E[\hat{f}(x_0)^2]+f(x_0)^2-2f(x_0)E[\hat{f}(x_0)] \\ &=\sigma^2_\epsilon+(E[\hat{f}(x_0)])^2+f(x_0)^2-2f(x_0)E[\hat{f}(x_0)] +E[\hat{f}(x_0)^2]-(E[\hat{f}(x_0))^2 \\ &=\sigma^2_\epsilon+(E\hat{f}(x_0)-f(x_0))^2+Var(\hat{f}(x_0))\\ &=\sigma^2_\epsilon+Bias^2(\hat{f}(x_0))+Var(\hat{f}(x_0))\\ &= Irreducible Error+ Bias^2 + Variance \end{align} \]
- The first term is the variance of the target around its true mean f(x0), and cannot be avoided no matter how well we estimate f(x0), unless \(\sigma^2_\epsilon=0\)
- The second term is the squared bias, the amount by which the average of our estimate differs from the true mean
- The last term is the variance; the expected squared deviation of \(\hat{f}(x_0)\) around its mean.
Typically the more complex we make the model \(\hat{f}\), the lower the (squared) bias but the higher the variance.
KNN regression
For the k-nearest-neighbor regression t, these expressions have the sim- ple form \[ \begin{align} Err(x_0)&=E[(Y-\hat{f}_k(x_0))^2|X=x_0] \\ \end{align} \]
Ref:
James, Gareth, et al. An introduction to statistical learning. Vol. 112. New York: springer, 2013.
Hastie, Trevor, et al. "The elements of statistical learning: data mining, inference and prediction." The Mathematical Intelligencer 27.2 (2005): 83-85