Least squares: Cholesky, QR and Householder

Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Sep 3, 2020
Copyright: © 2020 Sebastien Roch



Motivating example: predicting sales

The following dataset is from [ISLR]. Quoting [ISLR, Section 2.1]:

Suppose that we are statistical consultants hired by a client to provide advice on how to improve sales of a particular product. The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper. [...] It is not possible for our client to directly increase sales of the product. On the other hand, they can control the advertising expenditure in each of the three media. Therefore, if we determine that there is an association between advertising and sales, then we can instruct our client to adjust advertising budgets, thereby indirectly increasing sales. In other words, our goal is to develop an accurate model that can be used to predict sales on the basis of the three media budgets.

This a regression problem. That is, we want to estimate the relationship between an outcome variable and one or more predictors (or features). Here is the data.

In [2]:
df ="advertising.csv")

5 rows × 4 columns

In [3]:

4 rows × 8 columns


We will focus for now on the TV budget.

In [4]:
TV, sales = df[:,:TV], df[:,:sales]
scatter(TV, sales, legend=false, alpha=0.5)

There does seem to be a relationship between the two. Roughly a higher TV budget is linked to higher sales, although the correspondence is not perfect. To express the relationship more quantitatively, we seek a function $f$ such that

$$ Y \approx f(X) $$

where $X$ denotes a sample TV budget and $Y$ is the corresponding observed sales. We might posit for instance that there exists a true $f$ and that each observation is disrupted by a noise $\epsilon$

$$ Y = f(X) + \epsilon. $$

A natural way to estimate such an $f$ from data is $k$-nearest-neighbors ($k$-NN) regression. Let the data be of the form $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$, where in our case $\mathbf{x}_i$ is the TV budget of the $i$-th sample and $y_i$ is the corresponding sales. For each $\mathbf{x}$ (not necessarily in the data), we do the following:

  • find the $k$ nearest $\mathbf{x}_i$'s to $\mathbf{x}$
  • take an average of the corresponding $y_i$'s.
In [5]:
function mmids_knnregression(X,y,k,xnew)
    n = size(X,1)
    closest = sortperm([norm(X[i,:].-xnew) for i=1:n])
    return mean(y[closest[1:k]])
mmids_knnregression (generic function with 1 method)

For $k=3$ and a grid of $1000$ points, we get the following approximation $\hat{f}$. Here the function LinRange creates an array of equally spaced points.

In [6]:
k = 3
xgrid = LinRange(minimum(TV),maximum(TV),1000)
yhat = [mmids_knnregression(TV,sales,k,xnew) for xnew in xgrid];
In [7]:
scatter(TV, sales, legend=false, alpha=0.5)
plot!(xgrid, yhat, lw=2)

A higher $k$ gives something smoother.

In [8]:
k = 10
xgrid = LinRange(minimum(TV),maximum(TV),1000)
yhat = [mmids_knnregression(TV,sales,k,xnew) for xnew in xgrid];
In [9]:
scatter(TV, sales, legend=false, alpha=0.5)
plot!(xgrid, yhat, lw=2)

One downside of $k$-NN regression is that it does not give an easily interpretable relationship: if I increase my TV budget by $\Delta \$$, how is it expected to affect the sales? Another issue arises in high dimension where the counter-intuitive phenomena we discussed in the previous lectures can have a significant impact. Recall in particular the *High-dimensional Cube Theorem*. If we have $d$ predictors -- where $d$ is large -- and our data is distributed uniformly in a bounded region, then any given $\mathbf{x}$ will be far from any of our data points. In that case, the $y_i$-values of the closest $\mathbf{x}_i$'s may not be predictive. This is referred to as the Curse of Dimensionality.

One way out is to make stronger assumptions on the function $f$. For instance, we can assume that the true relationship is affine, that is,

$$ Y = \beta_0 + \beta_1 X + \epsilon. $$

Or if we have $d$ predictors:

$$ Y = \beta_0 + \sum_{j=1}^d \beta_j X_j + \epsilon. $$

This special case is called multiple linear regression. How do estimate appropriate intercept and coefficients? The standard approach is to minimize the sum of the squared errors

$$ \sum_{i=1}^n \left(y_i - \left\{\hat\beta_0 + \sum_{j=1}^d \hat\beta_j x_{i,j}\right\}\right)^2. $$

This is a least-squares problem. We re-write it in a more convenient matrix form and combine the $\alpha$ with the $\beta_i$'s by adding a dummy predictor to each sample. Let

$$ \mathbf{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix}, \quad\quad A = \begin{pmatrix} 1 & \mathbf{x}_1^T \\ 1 & \mathbf{x}_2^T \\ \vdots & \vdots \\ 1 & \mathbf{x}_n^T \end{pmatrix} \quad\text{and}\quad \boldsymbol{\beta} = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_d \end{pmatrix}. $$

Then the least-squares problem is

$$ \min_{\boldsymbol{\beta}} \|\mathbf{y} - A \boldsymbol{\beta}\|^2. $$

In words, we are looking for a linear combination of the columns of $A$ that is closest to $\mathbf{y}$ in $2$-norm.

One could solve this optimization problem through calculus (and we will come back to this important approach later in the course), but understanding the geometrical and algebraic structure of the problem turns out to provide useful insight into its solution. It will also be an opportunity to review some basic linear-algebraic concepts along the way.

We come back to the Advertising dataset in the accompanying notebook.