Revision

Back to Machine Learning

Definition

Bias and variance are two phenomenon that explains a part of the error of a predictor on the test set.

The error of a predictor may be split in:

Irreducible error,
Bias,
Variance.

Preliminary remark

An important observation is that a model trained on a given dataset is random since it depends on the random impredictible values (the \(\varepsilon_i\)) from the training set.

That’s why talking about the bias and the variance of an estimator makes sense.

Bias-Variance tradeoff

Bias

Statistical bias measures whereby the expected value of the results differs from the true underlying quantitative parameter being estimated. It hence represents an additional error of our estimated learner compare to the true parameter.

Bias comes from erroneous assumptions in the learning algorithm and is linked to underfitting.

Formal definition

Let \(\hat{\theta}\) be an estimator on \(S\) of a parameter \(\theta\) then the biais of \(\hat{\theta}\) is defined as follow:

\[Biais(\hat{\theta}) = \theta - \mathbb{E}(\hat{\theta})\]

Variance

The error due to variance comes from sensitivity to small fluctuations in the training set.

High variance may result from an algorithm modeling the random noise in the training data and is linked to overfitting. A model with high variance will have generalisation problem as it sticks too much to the training data.

Formal definition

Let \(\hat{\theta}\) be an estimator on \(S\) of a parameter \(\theta\) then the biais of \(\hat{\theta}\) is defined as follow:

\[Var(\hat{\theta}) = \mathbb{E}\left[(\bar{\theta} - \mathbb{E}_S[\hat{\theta}])^2\right]\]

Where:

\(S\) is a dataset used to calibrate \(\hat{\theta}\).

Hence the variance of \(\hat{\theta}\) is the variance of the estimator with respect to the training dataset \(S\).

Error decomposition

Let a dataset \(\hat{D}=\{(X^{(j)}, Y^{(j)})\}_{j=1}^m\). Assume \(Y=f(X)+\varepsilon\). Let \(\hat{f}_m\) be an estimator from the dataset \(\hat{D}\).

For a new unseen sample pair \((X^{*}, Y^{*})\), the expected MSE is:

\[\begin{eqnarray} MSE(\hat{f}_m) &&= \mathrm{E}\left[\left(Y^{*}-\hat{f}_m(X^{*})\right)^2\right] \\ &&= \mathrm{E}\left[\left(\varepsilon+f(X^{*})-\hat{f}_m(X^{*})\right)^2\right] \\ &&= \mathrm{E}[\varepsilon^2] + \mathrm{E}\left[\left(f(X^{*})-\hat{f}_m(X^{*})\right)^2\right]+ \mathrm{E}\left[2\varepsilon\left(f(X^{*})-\hat{f}_m(X^{*})\right)\right] \\ &&= \mathrm{E}[\varepsilon^2] + \mathrm{E}\left[\left(f(X^{*})-\hat{f}_m(X^{*})\right)^2\right]+ \mathrm{E}\left[\varepsilon\right]\mathrm{E}\left[2\left(f(X^{*})-\hat{f}_m(X^{*})\right)\right] \;\;\;\;\; && \mathrm{E}\left[\varepsilon\right]=0\\ &&= \mathrm{E}[\varepsilon^2] + \mathrm{E}\left[\left(f(X^{*})-\hat{f}_m(X^{*})\right)^2\right] \\ &&= \mathrm{E}[\varepsilon^2] + \mathrm{E}\left[f(X^{*})-\hat{f}_m(X^{*})\right]^2 + \mathrm{V}\left[f(X^{*})-\hat{f}_m(X^{*})\right] \;\;\;\;\; && \left(\mathrm{E}[X^2]=\mathrm{E}[X]^2+\mathrm{V}[X]\right) \\ &&= \mathrm{E}[\varepsilon^2] + \mathrm{E}\left[f(X^{*})-\hat{f}_m(X^{*})\right]^2 + \mathrm{V}\left[\hat{f}_m(X^{*})\right] \;\;\;\;\; && \left(\mathrm{V}[a-X]=\mathrm{V}[X]\right) \end{eqnarray}\]

On the final line:

The first term is the irreducible error
The second term is the bias squared
The third term is the Variance

Such a clean decomposition into Bias and Variance terms exists only for the squared error loss.

Link with model estimation problems

Overfitting

Overfitting relates to having a high variance model or estimator. To fight overfitting, we need to focus on reducing the Variance of the estimator, such as: increase regularization, obtain larger data set, decrease number of features, use a smaller model, etc.

Underfitting

Underfitting relates to having a high bias model or estimator. To fight underfitting, we need to focus on reducing the Bias in the estimator, such as: decrease regularization, use more features, use a larger model or more complex model, etc.

Summary on fitting

This chart from Kaggle summarizes the fitting problems:

Generalisation error

Notation

Assume we are in a binomial classification setting.

Let \(S\) be a training dataset from a population distribution \(\mathcal{D}\).

Let \(\mathcal{H}\) be the family of the considered algorithms (linear models, svm, …).

Let \(\hat{\varepsilon}_D(h)\) represent the loss on the dataset \(\hat{D}\) for a predictor h:

\[\hat{\varepsilon}_S(h) = \frac{1}{m} \sum_{j=1}{m} \mathrm{1}_{h(x^{(j)}) \neq y^{(j)}}\]

Let \(\hat{h}\) be the best predictor of the family \(\mathcal{H}\) given the training dataset \(S\) and the loss \(\hat{\varepsilon}\):

\[\hat{h} = arg\min_{h \in \mathcal{H}} \hat{\varepsilon}_S(h)\]

Let \(\varepsilon(h)=P_{(x,y) \sim \mathcal{D}}(h(x) \neq y)\) be the generalization error.

Let \(h^{*}\) be the best possible predictor from \(\mathcal{H}\) on the whole population \(\mathcal{D}\) ie let \(h^{*}\) be the predictor that minimizes the loss \(\varepsilon\):

\[h^{*} = arg\min_{h \in \mathcal{H}} \varepsilon(h)\]

Hoeffding inequality

Let \(Z_1, \ldots, Z_m\) be \(m\) independent and identically distributed (iid) random variables drawn from a Bernoulli distribution:

\(P(Z_i=1)=\phi\),
\(P(Z_i=0)=1-\phi\).

\(\phi\) is the true parameter of the distribution. Now let \(\hat{\phi}\) be an estimator of \(\phi\) from \(m\) sample randomly drawn from this population, ie:

\[\hat{\phi}=\frac{1}{m}\sum_{i=1}^{m}Z_i\]

Then the Hoeffding inequality says:

\[P(\vert \phi - \hat{\phi} \vert \gt \gamma) \leq 2 \exp(-2 \gamma^2 m)\]

The probability that the estimators \(\hat{\phi}\) be farther from \(\phi\) by more than \(\gamma\) is less than an expression depending on \(\gamma\) and the quantity \(m\) of training data.

Applying the Hoeffding inequality

Remark that \(Z = \mathrm{1}_{h(x^{(j)}) \neq y^{(j)}}\) is a Bernouilli variable. Then the Hoeffding inequality can be applied to the estimation \(\hat{\varepsilon}_S\) of the true error \(\varepsilon\):

\[P(\vert \varepsilon(h) - \hat{\varepsilon}_S(h) \vert \gt \gamma) \leq 2 \exp(-2 \gamma^2 m)\]

The probability that the error \(\hat{\varepsilon}_S\) estimated on \(S\) be farther from the generalisation error \(\varepsilon\) by more than \(\gamma\) is less than an expression depending on \(\gamma\) and the quantity \(m\) of training data.

Case of finite \(\mathcal{H}\)

Assume that \(\mathcal{H}\) is finite, ie their exist a finite number of predictors (of possible parametrization) of \(\mathcal{H}\). Let the number of predictors of \(\mathcal{H}\) be \(k\), ie \(\vert \mathcal{H} \vert = k\).

Extension from one \(h\) to every \(h \in \mathcal{H}\)

Using union bounds lemma we can generalize from one \(h\) to every \(h \in \mathcal{H}\) (see CS229 course - Biais Variance for computation details):

\[P(\exists h \in \mathcal{H} \;\; \vert \varepsilon(h) - \hat{\varepsilon}_S(h) \vert \gt \gamma) \leq 2k \exp(-2 \gamma^2 m)\]

Where:

\(k\) is the number of predictor of \(\mathcal{H}\)

We can also write:

\[P(\not \exists h \in \mathcal{H} \;\; \vert \varepsilon(h) - \hat{\varepsilon}_S(h) \vert \gt \gamma) \geq 1 - 2k \exp(-2 \gamma^2 m)\]

Inverting the inequality

We can also invert the inequality and express the bound using the different variables.

First let define \(\delta\) such as:

\[\delta := P(\exists h \in \mathcal{H} \;\; \vert \varepsilon(h) - \hat{\varepsilon}_S(h) \vert \gt \gamma)\]

Inverting for m

We can ask the following question: for a given \(\gamma\) and a choosen probability \(\delta\) how large must \(m\) be before we can guarantee that with probability at least \(1 − \delta\) that the training error will be within \(\gamma\) of generalization error?

For:

\[m \geq \frac{1}{2 \gamma^2} \log \frac{2k}{\delta}\]

we have that \(\vert \varepsilon(h) - \hat{\varepsilon}_S(h) \vert \gt \gamma \forall h \in \mathcal{H}\) with probability at least \(1 - \delta\).

Inverting for \(\vert \varepsilon(h) - \hat{\varepsilon}_S(h) \vert\)

Recall that \(\vert \varepsilon(h) - \hat{\varepsilon}_S(h) \vert \gt \gamma\). Hence a bound on \(\gamma\) gives a bound on \(\vert \varepsilon(h) - \hat{\varepsilon}_S(h) \vert\).

For a given \(m\) and a choosen probability \(\delta\) how small can be the generalization error \(\gamma\)?

Hence with probability at least \(1 - \delta\) we have that:

\[\vert \varepsilon(h) - \hat{\varepsilon}_S(h) \vert \leq \gamma \leq \sqrt{\frac{1}{2 m} \log \frac{2k}{\delta}}\]

Generalization error with respect to \(h^{*}\)

We can computes bound on the generalization error of out best estimator \(\hat{h}\) on \(S\) against the best estimator \(h^{*}\) of the true population:

Recall that: \(\vert \varepsilon(h) - \hat{\varepsilon}_S(h) \vert \leq \gamma\) with probability \(1 - \delta\). Hence \(\varepsilon(h) \leq \hat{\varepsilon}_S(h) + \gamma\).

\[\begin{eqnarray} \varepsilon(\hat{h}) && \leq \hat{\varepsilon}_S(\hat{h}) + \gamma \\ && \leq \hat{\varepsilon}_S(h^{*}) + \gamma \\ && \leq \varepsilon(h^{*}) + 2 \gamma \end{eqnarray}\]

The first line use the recall that we made applied to \(\hat{h}\),
The second line use the fact that \(\hat{h} = arg\min_{h \in \mathcal{H}} \hat{\varepsilon}_S(h)\)
The third line use the recall that we made applied to \(h^{*}\),

Using the bound on \(\gamma\) that we get from inverting the inequality, we finally get:

\[\varepsilon(\hat{h}) \leq h^{*} + 2 \sqrt{\frac{1}{2 m} \log \frac{2k}{\delta}}\]

\[\varepsilon(\hat{h}) \leq \left(arg\min_{h \in \mathcal{H}} \varepsilon(h) \right) + 2 \sqrt{\frac{1}{2 m} \log \frac{2k}{\delta}}\]

Talking in term of complexity bounds

We can replace the bound by complexity bounds, for example for \(m\) we get:

\[\begin{eqnarray} m && \geq \frac{1}{2 \gamma^2} \log \frac{2k}{\delta} \\ && = \mathcal{O} \left(\frac{1}{\gamma^2} \log \frac{k}{\delta}\right) \end{eqnarray}\]

that implies \(\vert \varepsilon(h) - \hat{\varepsilon}_S(h) \vert \gt \gamma \forall h \in \mathcal{H}\) with probability at least \(1 - \delta\).

Case of infinite \(\mathcal{H}\)

For infinite family of models \(\mathcal{H}\). The number of model is replaced by the Vapnik-Chervonenkis dimension of the family \(\mathcal{H}\). Let \(d\) be the VC of \(\mathcal{H}\).

We also talk in term of complexity bound using \(\mathcal{O}\) notation:

\[\vert \varepsilon(h) - \hat{\varepsilon}_S(h) \vert \leq \mathcal{O} \left(\sqrt{\frac{d}{m} \log \frac{m}{d} + \frac{1}{m} \log \frac{1}{\delta}}\right)\]

Or also:

\[\varepsilon(\hat{h}) \leq \varepsilon(h^{*}) + \mathcal{O} \left(\sqrt{\frac{d}{m} \log \frac{m}{d} + \frac{1}{m} \log \frac{1}{\delta}}\right)\]

Resources

See: