Bias and variance are two phenomenon that explains a part of the error of a predictor on the test set.
The error of a predictor may be split in:
An important observation is that a model trained on a given dataset is random since it depends on the random impredictible values (the \(\varepsilon_i\)) from the training set.
That’s why talking about the bias and the variance of an estimator makes sense.
Statistical bias measures whereby the expected value of the results differs from the true underlying quantitative parameter being estimated. It hence represents an additional error of our estimated learner compare to the true parameter.
Bias comes from erroneous assumptions in the learning algorithm and is linked to underfitting.
Let \(\hat{\theta}\) be an estimator on \(S\) of a parameter \(\theta\) then the biais of \(\hat{\theta}\) is defined as follow:
\[Biais(\hat{\theta}) = \theta - \mathbb{E}(\hat{\theta})\]The error due to variance comes from sensitivity to small fluctuations in the training set.
High variance may result from an algorithm modeling the random noise in the training data and is linked to overfitting. A model with high variance will have generalisation problem as it sticks too much to the training data.
Let \(\hat{\theta}\) be an estimator on \(S\) of a parameter \(\theta\) then the biais of \(\hat{\theta}\) is defined as follow:
\[Var(\hat{\theta}) = \mathbb{E}\left[(\bar{\theta} - \mathbb{E}_S[\hat{\theta}])^2\right]\]Where:
Hence the variance of \(\hat{\theta}\) is the variance of the estimator with respect to the training dataset \(S\).
Let a dataset \(\hat{D}=\{(X^{(j)}, Y^{(j)})\}_{j=1}^m\). Assume \(Y=f(X)+\varepsilon\). Let \(\hat{f}_m\) be an estimator from the dataset \(\hat{D}\).
For a new unseen sample pair \((X^{*}, Y^{*})\), the expected MSE is:
\[\begin{eqnarray} MSE(\hat{f}_m) &&= \mathrm{E}\left[\left(Y^{*}-\hat{f}_m(X^{*})\right)^2\right] \\ &&= \mathrm{E}\left[\left(\varepsilon+f(X^{*})-\hat{f}_m(X^{*})\right)^2\right] \\ &&= \mathrm{E}[\varepsilon^2] + \mathrm{E}\left[\left(f(X^{*})-\hat{f}_m(X^{*})\right)^2\right]+ \mathrm{E}\left[2\varepsilon\left(f(X^{*})-\hat{f}_m(X^{*})\right)\right] \\ &&= \mathrm{E}[\varepsilon^2] + \mathrm{E}\left[\left(f(X^{*})-\hat{f}_m(X^{*})\right)^2\right]+ \mathrm{E}\left[\varepsilon\right]\mathrm{E}\left[2\left(f(X^{*})-\hat{f}_m(X^{*})\right)\right] \;\;\;\;\; && \mathrm{E}\left[\varepsilon\right]=0\\ &&= \mathrm{E}[\varepsilon^2] + \mathrm{E}\left[\left(f(X^{*})-\hat{f}_m(X^{*})\right)^2\right] \\ &&= \mathrm{E}[\varepsilon^2] + \mathrm{E}\left[f(X^{*})-\hat{f}_m(X^{*})\right]^2 + \mathrm{V}\left[f(X^{*})-\hat{f}_m(X^{*})\right] \;\;\;\;\; && \left(\mathrm{E}[X^2]=\mathrm{E}[X]^2+\mathrm{V}[X]\right) \\ &&= \mathrm{E}[\varepsilon^2] + \mathrm{E}\left[f(X^{*})-\hat{f}_m(X^{*})\right]^2 + \mathrm{V}\left[\hat{f}_m(X^{*})\right] \;\;\;\;\; && \left(\mathrm{V}[a-X]=\mathrm{V}[X]\right) \end{eqnarray}\]On the final line:
Such a clean decomposition into Bias and Variance terms exists only for the squared error loss.
Overfitting relates to having a high variance model or estimator. To fight overfitting, we need to focus on reducing the Variance of the estimator, such as: increase regularization, obtain larger data set, decrease number of features, use a smaller model, etc.
Underfitting relates to having a high bias model or estimator. To fight underfitting, we need to focus on reducing the Bias in the estimator, such as: decrease regularization, use more features, use a larger model or more complex model, etc.
This chart from Kaggle summarizes the fitting problems:
Assume we are in a binomial classification setting.
Let \(S\) be a training dataset from a population distribution \(\mathcal{D}\).
Let \(\mathcal{H}\) be the family of the considered algorithms (linear models, svm, …).
Let \(\hat{\varepsilon}_D(h)\) represent the loss on the dataset \(\hat{D}\) for a predictor h:
\[\hat{\varepsilon}_S(h) = \frac{1}{m} \sum_{j=1}{m} \mathrm{1}_{h(x^{(j)}) \neq y^{(j)}}\]Let \(\hat{h}\) be the best predictor of the family \(\mathcal{H}\) given the training dataset \(S\) and the loss \(\hat{\varepsilon}\):
\[\hat{h} = arg\min_{h \in \mathcal{H}} \hat{\varepsilon}_S(h)\]Let \(\varepsilon(h)=P_{(x,y) \sim \mathcal{D}}(h(x) \neq y)\) be the generalization error.
Let \(h^{*}\) be the best possible predictor from \(\mathcal{H}\) on the whole population \(\mathcal{D}\) ie let \(h^{*}\) be the predictor that minimizes the loss \(\varepsilon\):
\[h^{*} = arg\min_{h \in \mathcal{H}} \varepsilon(h)\]Let \(Z_1, \ldots, Z_m\) be \(m\) independent and identically distributed (iid) random variables drawn from a Bernoulli distribution:
\(\phi\) is the true parameter of the distribution. Now let \(\hat{\phi}\) be an estimator of \(\phi\) from \(m\) sample randomly drawn from this population, ie:
\[\hat{\phi}=\frac{1}{m}\sum_{i=1}^{m}Z_i\]Then the Hoeffding inequality says:
\[P(\vert \phi - \hat{\phi} \vert \gt \gamma) \leq 2 \exp(-2 \gamma^2 m)\]The probability that the estimators \(\hat{\phi}\) be farther from \(\phi\) by more than \(\gamma\) is less than an expression depending on \(\gamma\) and the quantity \(m\) of training data.
Remark that \(Z = \mathrm{1}_{h(x^{(j)}) \neq y^{(j)}}\) is a Bernouilli variable. Then the Hoeffding inequality can be applied to the estimation \(\hat{\varepsilon}_S\) of the true error \(\varepsilon\):
\[P(\vert \varepsilon(h) - \hat{\varepsilon}_S(h) \vert \gt \gamma) \leq 2 \exp(-2 \gamma^2 m)\]The probability that the error \(\hat{\varepsilon}_S\) estimated on \(S\) be farther from the generalisation error \(\varepsilon\) by more than \(\gamma\) is less than an expression depending on \(\gamma\) and the quantity \(m\) of training data.
Assume that \(\mathcal{H}\) is finite, ie their exist a finite number of predictors (of possible parametrization) of \(\mathcal{H}\). Let the number of predictors of \(\mathcal{H}\) be \(k\), ie \(\vert \mathcal{H} \vert = k\).
Using union bounds lemma we can generalize from one \(h\) to every \(h \in \mathcal{H}\) (see CS229 course - Biais Variance for computation details):
\[P(\exists h \in \mathcal{H} \;\; \vert \varepsilon(h) - \hat{\varepsilon}_S(h) \vert \gt \gamma) \leq 2k \exp(-2 \gamma^2 m)\]Where:
We can also write:
\[P(\not \exists h \in \mathcal{H} \;\; \vert \varepsilon(h) - \hat{\varepsilon}_S(h) \vert \gt \gamma) \geq 1 - 2k \exp(-2 \gamma^2 m)\]We can also invert the inequality and express the bound using the different variables.
First let define \(\delta\) such as:
\[\delta := P(\exists h \in \mathcal{H} \;\; \vert \varepsilon(h) - \hat{\varepsilon}_S(h) \vert \gt \gamma)\]We can ask the following question: for a given \(\gamma\) and a choosen probability \(\delta\) how large must \(m\) be before we can guarantee that with probability at least \(1 − \delta\) that the training error will be within \(\gamma\) of generalization error?
For:
\[m \geq \frac{1}{2 \gamma^2} \log \frac{2k}{\delta}\]we have that \(\vert \varepsilon(h) - \hat{\varepsilon}_S(h) \vert \gt \gamma \forall h \in \mathcal{H}\) with probability at least \(1 - \delta\).
Recall that \(\vert \varepsilon(h) - \hat{\varepsilon}_S(h) \vert \gt \gamma\). Hence a bound on \(\gamma\) gives a bound on \(\vert \varepsilon(h) - \hat{\varepsilon}_S(h) \vert\).
For a given \(m\) and a choosen probability \(\delta\) how small can be the generalization error \(\gamma\)?
Hence with probability at least \(1 - \delta\) we have that:
\[\vert \varepsilon(h) - \hat{\varepsilon}_S(h) \vert \leq \gamma \leq \sqrt{\frac{1}{2 m} \log \frac{2k}{\delta}}\]We can computes bound on the generalization error of out best estimator \(\hat{h}\) on \(S\) against the best estimator \(h^{*}\) of the true population:
Recall that: \(\vert \varepsilon(h) - \hat{\varepsilon}_S(h) \vert \leq \gamma\) with probability \(1 - \delta\). Hence \(\varepsilon(h) \leq \hat{\varepsilon}_S(h) + \gamma\).
\[\begin{eqnarray} \varepsilon(\hat{h}) && \leq \hat{\varepsilon}_S(\hat{h}) + \gamma \\ && \leq \hat{\varepsilon}_S(h^{*}) + \gamma \\ && \leq \varepsilon(h^{*}) + 2 \gamma \end{eqnarray}\]Using the bound on \(\gamma\) that we get from inverting the inequality, we finally get:
\[\varepsilon(\hat{h}) \leq h^{*} + 2 \sqrt{\frac{1}{2 m} \log \frac{2k}{\delta}}\]Or
\[\varepsilon(\hat{h}) \leq \left(arg\min_{h \in \mathcal{H}} \varepsilon(h) \right) + 2 \sqrt{\frac{1}{2 m} \log \frac{2k}{\delta}}\]We can replace the bound by complexity bounds, for example for \(m\) we get:
\[\begin{eqnarray} m && \geq \frac{1}{2 \gamma^2} \log \frac{2k}{\delta} \\ && = \mathcal{O} \left(\frac{1}{\gamma^2} \log \frac{k}{\delta}\right) \end{eqnarray}\]that implies \(\vert \varepsilon(h) - \hat{\varepsilon}_S(h) \vert \gt \gamma \forall h \in \mathcal{H}\) with probability at least \(1 - \delta\).
For infinite family of models \(\mathcal{H}\). The number of model is replaced by the Vapnik-Chervonenkis dimension of the family \(\mathcal{H}\). Let \(d\) be the VC of \(\mathcal{H}\).
We also talk in term of complexity bound using \(\mathcal{O}\) notation:
\[\vert \varepsilon(h) - \hat{\varepsilon}_S(h) \vert \leq \mathcal{O} \left(\sqrt{\frac{d}{m} \log \frac{m}{d} + \frac{1}{m} \log \frac{1}{\delta}}\right)\]Or also:
\[\varepsilon(\hat{h}) \leq \varepsilon(h^{*}) + \mathcal{O} \left(\sqrt{\frac{d}{m} \log \frac{m}{d} + \frac{1}{m} \log \frac{1}{\delta}}\right)\]See: