Supervised and unsupervised learning

Course completion
12%
$
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator{\dom}{dom}
\DeclareMathOperator{\sigm}{sigm}
\DeclareMathOperator{\softmax}{softmax}
\DeclareMathOperator{\sign}{sign}
\DeclareMathOperator{\ind}{ind}
$

In supervised learning, the objective is to approximate an unknown function $f^{*}$ from a number of observations $(\mathbf{x},\mathbf{y})$ with the assumption that $\mathbf{y}=f^{*}(\mathbf{x})$. These observations are called learning examples and are usually compiled into a set $\mathcal{D}=\{(\mathbf{x}_{1},\mathbf{y}_{1}),\dots(\mathbf{x}_{N},\mathbf{y}_{N})\}$ called a dataset.

Definition 2.1 (supervised learning problem)
Consider a set $\mathcal{D}=\{(\mathbf{x}_{1},\mathbf{y}_{1}),\dots(\mathbf{x}_{N},\mathbf{y}_{N})\}$ and a class of functions $\mathcal{H}$. The problem of finding a function $\hat{f}\in\mathcal{H}$ matching inputs $\mathbf{x}_{i}$ to their expected output $\mathbf{y}_{i}$ as in
$$\hat{f}(\mathbf{x}_{i})\approx\mathbf{y}_{i}$$
is called a supervised learning problem. The set $\mathcal{D}$ is referred to as the training dataset or simply training set. The class of function $\mathcal{H}$ is called the hypothesis space. The expected output $\mathbf{y}$ is often referred to as the label or the target.

Supervised learning problems arise in many settings but can often be reduced to either a classification problem when the label $\mathbf{y}$ is a natural number $\mathbf{y}\in\mathbb{N}$ or, a regression problem when $\mathbf{y}$ is a real number $\mathbf{y}\in\mathbb{R}^{D}$. We now define these two problems formally as optimization problems.

Definition 2.2 (classification problem)
Classification concerns the case when the label $\mathbf{y}$ can be interpreted as a class variable and $\mathbf{y}\in\mathbb{N}$. The loss to minimize is usually the misclassification rate, i.e. the 0-1 loss:
$$\hat{f}=\argmin_{f\in\mathcal{H}}\sum_{(\mathbf{x},\mathbf{y})\in\mathcal{D}}\ind\{\mathbf{y}\neq f(\mathbf{x})\}$$
where $\ind$ is the indicator function equal to $1$ when the argument evaluates to true and $0$ otherwise.
Definition 2.3 (regression problem)
Regression concerns the case when the target variable $\mathbf{y}$ is in $\mathbb{R}^{K}$ and the loss to minimize is usually the average distance between $\mathbf{y}$ and $f(\mathbf{x})$, i.e. the Mean Squared Error (MSE):
$$\hat{f}=\argmin_{f\in\mathcal{H}}\sum_{(\mathbf{x},\mathbf{y})\in\mathcal{D}}\left[\mathbf{y}-f(\mathbf{x})\right]^{2}$$

In unsupervised learning we consider a dataset $\mathcal{D}=\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{N}\}$ where there is no label. The goal can be to better understand the structure of the dataset $\mathcal{D}$ or, to learn new representations and improve performance in a supervised setting.

Clustering, dimensionality reduction, and density estimation are common unsupervised learning problems and are described below. Note that we do not give a complete and formal definition of these problems, but rather try to give examples of the corresponding optimization problems.

Clustering

The objective of a clustering algorithm is to group similar observations into clusters, such that points inside a cluster are similar to each other. Formally, a clustering algorithm returns a partition of observations into disjoint sets $\mathcal{C}_{1},\dots,\mathcal{C}_{K}$ called clusters. The model is often given by a set of points $\{\mathbf{c}_{1},\mathbf{c}_{2},\dots,\mathbf{c}_{K}\}$ called exemplars or centroids such that each $\mathbf{c}_{i}$ is representative of the elements in the cluster $\mathcal{C}_{i}$. The objective is then e.g. to minimize the average distance from points in a cluster to their representing centroid, i.e. :
$$\{\hat{\mathbf{c}}_{1},\hat{\mathbf{c}}_{2},\dots,\hat{\mathbf{c}}_{K}\}=\argmin_{\{\mathbf{c}_{1},\mathbf{c}_{2},\dots,\mathbf{c}_{K}\}}\sum_{i=1}^{K}\sum_{\mathbf{x}\in\mathcal{C}_{i}}\left\Vert \mathbf{x}-\mathbf{c}_{i}\right\Vert .$$

Dimensionality reduction

The purpose of dimensionality reduction is to find a representation of lower dimensionality for the dataset $\mathcal{D}$. This can be done in several ways, for instance by assuming that the training samples are all on a sub-manifold of the input space or by trying to find a variable $\mathbf{y}$ with $\dim\mathbf{y}<\dim\mathbf{x}$ such that $\mathbf{x}$ can be reconstructed from $\mathbf{y}$, thus trying to solve the following optimization problem: $$\{\hat{\mathbf{y}}_{1},\hat{\mathbf{y}}_{2},\dots,\hat{\mathbf{y}}_{N}\},\hat{f}=\argmin_{\{\mathbf{y}_{1},\mathbf{y}_{2},\dots,\mathbf{y}_{N}\},f\in\mathcal{H}}\sum_{i=1}^{N}\left\Vert \mathbf{x}_{i}-f(\mathbf{y}_{i})\right\Vert $$ Note that the optimization problem is then on the function $f$ and the values $\mathbf{y}_{1},\mathbf{y}_{2},\dots,\mathbf{y}_{N}$ which are not provided. Density estimation

Given the training dataset $\mathcal{D}=\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{N}\}$, the goal of density estimation is to find a probability density which could have generated the dataset, often with the assumption that the training samples are iid. This is usually done with the maximization of the log-likelihood of $\mathcal{D}$ under a parametrized family of distributions $p_{\theta}(\mathbf{x})$ i.e. choosing $p_{\hat{\theta}}(\mathbf{x})$ such that
$$\hat{\theta}=\argmax_{\theta}\sum_{\mathbf{x}\in\mathcal{D}}\log p_{\theta}(\mathbf{x}).$$

Density estimation will be reviewed in more detail in a following section.

Next: Generalization