Linear regression

Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Sep 14, 2020
Copyright: © 2020 Sebastien Roch

Recap from the lectures

The setup We have data of the form $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$ where the $\mathbf{x}_i$'s are predictors and the $y_i$'s are outcomes. We want to estimate the relationship between the outcome variable and the predictors. This is called regression analysis. In linear regression, we assume the relationship is linear

$$ y_i = \beta_0 + \sum_{j=1}^d \beta_j x_{i,j} + \epsilon_i $$

where $\epsilon_i$ is noise. We use the least-squares approach of finding coefficients that minimize

$$ \sum_{i=1}^n \left(y_i - \left\{\beta_0 + \sum_{j=1}^d \beta_j x_{i,j}\right\}\right)^2. $$

We re-write it in matrix form. Let

$$ \mathbf{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix}, \quad\quad A = \begin{pmatrix} 1 & \mathbf{x}_1^T \\ 1 & \mathbf{x}_2^T \\ \vdots & \vdots \\ 1 & \mathbf{x}_n^T \end{pmatrix} \quad\text{and}\quad \boldsymbol{\beta} = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_d \end{pmatrix}. $$

Then the problem is

$$ \min_{\boldsymbol{\beta}} \|\mathbf{y} - A \boldsymbol{\beta}\|^2. $$

The theory In words, we are looking for a linear combination of the columns of $A$ that is closest to $\mathbf{y}$ in $2$-norm.

Theorem (Gram-Schmidt): Let $\mathbf{a}_1,\ldots,\mathbf{a}_m$ be linearly independent. Then there exist an orthonormal basis $\mathbf{q}_1,\ldots,\mathbf{q}_m$ of $\mathrm{span}(\mathbf{a}_1,\ldots,\mathbf{a}_m)$.

In matrix form, let

$$ A = \begin{pmatrix} | & & | \\ \mathbf{a}_1 & \ldots & \mathbf{a}_m \\ | & & | \end{pmatrix} \quad \text{and} \quad Q = \begin{pmatrix} | & & | \\ \mathbf{q}_1 & \ldots & \mathbf{q}_m \\ | & & | \end{pmatrix}. $$

The output of the Gram-Schimdt algorithm above can then be written in the following compact form, known as a QR decomposition,

$$ A = QR $$

where column $i$ of the $m \times m$ matrix $R$ contains the coefficients of the linear combination of $\mathbf{q}_j$'s that produces $\mathbf{a}_i$. By the proof of Gram-Schmidt, $\mathbf{a}_i \in \mathrm{span}(\mathbf{q}_1,\ldots,\mathbf{q}_i)$. So column $i$ of $R$ has only zeros below the diagonal. Hence $R$ has a special structure: it is upper triangular.

Definition (Orthogonal projection): Let $U \subseteq V$ be a linear subspace with orthonormal basis $\mathbf{q}_1,\ldots,\mathbf{q}_m$. The orthogonal projection of $\mathbf{v} \in V$ on $U$ is defined as

$$ \mathcal{P}_U \mathbf{v} = \sum_{j=1}^m \langle \mathbf{v}, \mathbf{q}_j \rangle \,\mathbf{q}_j. $$


Theorem (Orthogonal Projection is Closest in Subspace): Let $U \subseteq V$ be a linear subspace with orthonormal basis $\mathbf{q}_1,\ldots,\mathbf{q}_m$ and let $\mathbf{v} \in V$. For any $\mathbf{u} \in U$

$$ \|\mathbf{v} - \mathcal{P}_U \mathbf{v}\| \leq \|\mathbf{v} - \mathbf{u}\|. $$

Furthermore, if $\mathbf{u} \in U$ and the inequality above is an equality, then $\mathbf{u} = \mathcal{P}_U \mathbf{v}$.

From this result we get:

Theorem (Least Squares via QR): Let $A \in \mathbb{R}^{n\times m}$ be an $n\times m$ matrix with linearly independent columns, let $\mathbf{b} \in \mathbb{R}^n$ be a vector, and let $A = QR$ be a QR decomposition of $A$. The solution to the least-squares problem

$$ \min_{\mathbf{x} \in \mathbb{R}^m} \|A \mathbf{x} - \mathbf{b}\|. $$


$$ R \mathbf{x}^* = Q^T \mathbf{b}. $$

The algorithm

In [2]:
function mmids_gramschmidt(A)
    n,m = size(A)
    Q = zeros(Float64,n,m)
    R = zeros(Float64,m,m)
    for j = 1:m
        v = A[:,j]
        for i = 1:j-1
            R[i,j] = dot(Q[:,i],A[:,j])
            v -= R[i,j]*Q[:,i]
        R[j,j] = norm(v)
        Q[:,j] = v/R[j,j]
    return Q,R
mmids_gramschmidt (generic function with 1 method)
In [3]:
function mmids_backsubs(U,b)
    m = length(b)
    x = zeros(Float64,m)
    for i=m:-1:1
        x[i] = (b[i] - dot(U[i,i+1:m],x[i+1:m]))/U[i,i] # assumes non-zero diagonal
    return x
mmids_backsubs (generic function with 1 method)
In [4]:
function ls_by_qr(A, b)
    Q,R = mmids_gramschmidt(A)
    return mmids_backsubs(R,Q'*b)
ls_by_qr (generic function with 1 method)

1 Predicting sales

The following dataset is from [ISLR] textbook. Quoting [ISLR, Section 2.1]:

Suppose that we are statistical consultants hired by a client to provide advice on how to improve sales of a particular product. The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper. [...] It is not possible for our client to directly increase sales of the product. On the other hand, they can control the advertising expenditure in each of the three media. Therefore, if we determine that there is an association between advertising and sales, then we can instruct our client to adjust advertising budgets, thereby indirectly increasing sales. In other words, our goal is to develop an accurate model that can be used to predict sales on the basis of the three media budgets.

In [5]:
df = CSV.read("advertising.csv")

5 rows × 4 columns

In [6]:
TV, sales = df[:,:TV], df[:,:sales]
scatter(TV, sales, legend=false, alpha=0.5)

We form the matrix $A$ and use our least-squares code to solve for $\boldsymbol{\beta}$.

In [7]:
n = length(sales)
A = hcat(ones(n), TV)
coeff = ls_by_qr(A,sales)
2-element Array{Float64,1}:
In [8]:
TVgrid = LinRange(minimum(TV), maximum(TV), 100)
scatter(TV, sales, legend=false, alpha=0.5)
plot!(TVgrid, coeff[1].+coeff[2]*TVgrid, lw=2) # recall that array indices start at 1

A degree-two polynomial might be a better fit.

In [9]:
A = reduce(hcat, [ones(n), TV, TV.^2])
coeff = ls_by_qr(A, sales)
3-element Array{Float64,1}:
In [10]:
scatter(TV, sales, legend=false, alpha=0.5)
plot!(TVgrid, coeff[1] .+ coeff[2]*TVgrid .+ coeff[3]*TVgrid.^2, lw=2)

The fit looks slightly better than the linear one. This is not entirely surprising though given that the linear model is a subset of the quadratic one. But, in adding more parameters, one must worry about overfitting. To quote Wikipedia:

In statistics, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably".[1] An overfitted model is a statistical model that contains more parameters than can be justified by the data.[2] The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.[3]

To illustrate, let's see what happens with a degree-$20$ polynomial fit.

In [11]:
deg = 20
A = reduce(hcat, [TV.^i for i=0:1:deg])
coeff = ls_by_qr(A,sales)
21-element Array{Float64,1}:
In [12]:
saleshat = sum([coeff[i+1]*TVgrid.^i for i=0:1:deg])
scatter(TV, sales, legend=false, alpha=0.5)
plot!(TVgrid, saleshat, lw=2)

We return to the linear case, but with the full set of predictors.

In [13]:
radio, newspaper = df[:,:radio], df[:,:newspaper];
In [14]:
p1 = scatter(radio, sales, legend=false, alpha=0.5, title="radio")
p2 = scatter(newspaper, sales, legend=false, alpha=0.5, title="newspaper")
plot(p1, p2, laytout=(2,1), size=(750,375))
In [15]:
A = reduce(hcat, [ones(n), TV, radio, newspaper])
coeff = ls_by_qr(A, sales)
4-element Array{Float64,1}:

Newspaper advertising (the last coefficient) seems to have a much weaker effect on sales per dollar spent. We will come back later in the course to assessing the statistical significance of such a conclusion.