Supervised Example: Polynomial regression

Course completion
62%
$
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator{\dom}{dom}
\DeclareMathOperator{\sigm}{sigm}
\DeclareMathOperator{\softmax}{softmax}
\DeclareMathOperator{\sign}{sign}
$

Let us now consider a simple regression problem on the dataset of Figure 5. The problem consists in finding an approximation of the function $f(x)=\sin(x)+\frac{1}{2}x$ given observations of $f(x)$ perturbed by a small Gaussian random noise. Notice that the observations are split in two datasets: the training set for learning parameters, and the test set to evaluate the quality of the resulting model.


regression-dataset
Figure 5: Example of a regression dataset: a noisy version of the function $\sin(x)+\frac{1}{2}x$.

In order to solve this problem, we propose to use polynomial regression, that is to try and fit a polynomial of fixed degree K
to the training points. The model can be defined by:

$$\hat{f}(x)=\sum_{k=0}^{K}a_{k}x^{k}$$
where $(a_{0},\dots,a_{K})$ are the parameters of the model.

Learning then consists in an optimization problem where the goal is to find the parameters $a_{k}$ that minimize the MSE (there are many readily available algorithms capable of solving this kind of problem, here we use the polyfit function of the numpy python package), namely find $\hat{f}$ such that
$$\hat{f}=\argmin_{(a_{0},\dots,a_{K})}\sum_{(x,y)\in\mathcal{D}}\left[y-\sum_{k=0}^{K}a_{k}x^{k}\right]^{2}$$


polyfit
Figure 6: Best polynomial fits for several degrees on the example regression dataset. (a) gives a case of under-fitting. (b) and (c) are examples of what would be considered good fits. In (d) we see an example of over-fitting.

Depending on the degree we chose for the polynomial, Figure 6 shows that the results can be very different. When the degree is too small, we can see an example of under-fitting: the model is too simple to represent the target function accurately. On the other hand, if the degree of the polynomial is too large, we risk the problem of over-fitting: the model can fit the training points more closely but points in the testing set are not well approximated anymore.

Trying to choose the best degree $K$ for the polynomial, consists in a secondary optimization problem. In this setting, K is a special kind of parameter called a hyper-parameter.

Next: Model selection