TUTORIAL 1

Linear regression


Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Sep 14, 2020
Copyright: © 2020 Sebastien Roch


Recap from the lectures

The setup We have data of the form $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$ where the $\mathbf{x}_i$'s are predictors and the $y_i$'s are outcomes. We want to estimate the relationship between the outcome variable and the predictors. This is called regression analysis. In linear regression, we assume the relationship is linear

$$ y_i = \beta_0 + \sum_{j=1}^d \beta_j x_{i,j} + \epsilon_i $$

where $\epsilon_i$ is noise. We use the least-squares approach of finding coefficients that minimize

$$ \sum_{i=1}^n \left(y_i - \left\{\beta_0 + \sum_{j=1}^d \beta_j x_{i,j}\right\}\right)^2. $$

We re-write it in matrix form. Let

$$ \mathbf{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix}, \quad\quad A = \begin{pmatrix} 1 & \mathbf{x}_1^T \\ 1 & \mathbf{x}_2^T \\ \vdots & \vdots \\ 1 & \mathbf{x}_n^T \end{pmatrix} \quad\text{and}\quad \boldsymbol{\beta} = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_d \end{pmatrix}. $$

Then the problem is

$$ \min_{\boldsymbol{\beta}} \|\mathbf{y} - A \boldsymbol{\beta}\|^2. $$

The theory In words, we are looking for a linear combination of the columns of $A$ that is closest to $\mathbf{y}$ in $2$-norm.


Theorem (Gram-Schmidt): Let $\mathbf{a}_1,\ldots,\mathbf{a}_m$ be linearly independent. Then there exist an orthonormal basis $\mathbf{q}_1,\ldots,\mathbf{q}_m$ of $\mathrm{span}(\mathbf{a}_1,\ldots,\mathbf{a}_m)$.


In matrix form, let

$$ A = \begin{pmatrix} | & & | \\ \mathbf{a}_1 & \ldots & \mathbf{a}_m \\ | & & | \end{pmatrix} \quad \text{and} \quad Q = \begin{pmatrix} | & & | \\ \mathbf{q}_1 & \ldots & \mathbf{q}_m \\ | & & | \end{pmatrix}. $$

The output of the Gram-Schimdt algorithm above can then be written in the following compact form, known as a QR decomposition,

$$ A = QR $$

where column $i$ of the $m \times m$ matrix $R$ contains the coefficients of the linear combination of $\mathbf{q}_j$'s that produces $\mathbf{a}_i$. By the proof of Gram-Schmidt, $\mathbf{a}_i \in \mathrm{span}(\mathbf{q}_1,\ldots,\mathbf{q}_i)$. So column $i$ of $R$ has only zeros below the diagonal. Hence $R$ has a special structure: it is upper triangular.

Definition (Orthogonal projection): Let $U \subseteq V$ be a linear subspace with orthonormal basis $\mathbf{q}_1,\ldots,\mathbf{q}_m$. The orthogonal projection of $\mathbf{v} \in V$ on $U$ is defined as

$$ \mathcal{P}_U \mathbf{v} = \sum_{j=1}^m \langle \mathbf{v}, \mathbf{q}_j \rangle \,\mathbf{q}_j. $$

$\lhd$


Theorem (Orthogonal Projection is Closest in Subspace): Let $U \subseteq V$ be a linear subspace with orthonormal basis $\mathbf{q}_1,\ldots,\mathbf{q}_m$ and let $\mathbf{v} \in V$. For any $\mathbf{u} \in U$

$$ \|\mathbf{v} - \mathcal{P}_U \mathbf{v}\| \leq \|\mathbf{v} - \mathbf{u}\|. $$

Furthermore, if $\mathbf{u} \in U$ and the inequality above is an equality, then $\mathbf{u} = \mathcal{P}_U \mathbf{v}$.


From this result we get:


Theorem (Least Squares via QR): Let $A \in \mathbb{R}^{n\times m}$ be an $n\times m$ matrix with linearly independent columns, let $\mathbf{b} \in \mathbb{R}^n$ be a vector, and let $A = QR$ be a QR decomposition of $A$. The solution to the least-squares problem

$$ \min_{\mathbf{x} \in \mathbb{R}^m} \|A \mathbf{x} - \mathbf{b}\|. $$

satisfies

$$ R \mathbf{x}^* = Q^T \mathbf{b}. $$

The algorithm

In [1]:
# Julia version: 1.5.1
using CSV, DataFrames, Plots, LinearAlgebra, StatsKit
In [2]:
function mmids_gramschmidt(A)
    n,m = size(A)
    Q = zeros(Float64,n,m)
    R = zeros(Float64,m,m)
    for j = 1:m
        v = A[:,j]
        for i = 1:j-1
            R[i,j] = dot(Q[:,i],A[:,j])
            v -= R[i,j]*Q[:,i]
        end
        R[j,j] = norm(v)
        Q[:,j] = v/R[j,j]
    end
    return Q,R
end
Out[2]:
mmids_gramschmidt (generic function with 1 method)
In [3]:
function mmids_backsubs(U,b)
    m = length(b)
    x = zeros(Float64,m)
    for i=m:-1:1
        x[i] = (b[i] - dot(U[i,i+1:m],x[i+1:m]))/U[i,i] # assumes non-zero diagonal
    end
    return x
end
Out[3]:
mmids_backsubs (generic function with 1 method)
In [4]:
function ls_by_qr(A, b)
    Q,R = mmids_gramschmidt(A)
    return mmids_backsubs(R,Q'*b)
end
Out[4]:
ls_by_qr (generic function with 1 method)

1 Predicting sales

The following dataset is from [ISLR] textbook. Quoting [ISLR, Section 2.1]:

Suppose that we are statistical consultants hired by a client to provide advice on how to improve sales of a particular product. The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper. [...] It is not possible for our client to directly increase sales of the product. On the other hand, they can control the advertising expenditure in each of the three media. Therefore, if we determine that there is an association between advertising and sales, then we can instruct our client to adjust advertising budgets, thereby indirectly increasing sales. In other words, our goal is to develop an accurate model that can be used to predict sales on the basis of the three media budgets.

In [5]:
df = CSV.read("advertising.csv")
first(df,5)
Out[5]:

5 rows × 4 columns

TVradionewspapersales
Float64Float64Float64Float64
1230.137.869.222.1
244.539.345.110.4
317.245.969.39.3
4151.541.358.518.5
5180.810.858.412.9

We will focus for now on the TV budget.

In [6]:
TV, sales = df[:,:TV], df[:,:sales]
scatter(TV, sales, legend=false, alpha=0.5)
Out[6]:

We form the matrix $A$ and use our least-squares code to solve for $\boldsymbol{\beta}$.

In [7]:
n = length(sales)
A = hcat(ones(n), TV)
coeff = ls_by_qr(A,sales)
Out[7]:
2-element Array{Float64,1}:
 7.032593549127698
 0.047536640433019715
In [8]:
TVgrid = LinRange(minimum(TV), maximum(TV), 100)
scatter(TV, sales, legend=false, alpha=0.5)
plot!(TVgrid, coeff[1].+coeff[2]*TVgrid, lw=2) # recall that array indices start at 1
Out[8]:

A degree-two polynomial might be a better fit.

In [9]:
A = reduce(hcat, [ones(n), TV, TV.^2])
coeff = ls_by_qr(A, sales)
Out[9]:
3-element Array{Float64,1}:
  6.114120128333173
  0.06726592695624341
 -6.846933732938574e-5
In [10]:
scatter(TV, sales, legend=false, alpha=0.5)
plot!(TVgrid, coeff[1] .+ coeff[2]*TVgrid .+ coeff[3]*TVgrid.^2, lw=2)
Out[10]:

The fit looks slightly better than the linear one. This is not entirely surprising though given that the linear model is a subset of the quadratic one. But, in adding more parameters, one must worry about overfitting. To quote Wikipedia:

In statistics, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably".[1] An overfitted model is a statistical model that contains more parameters than can be justified by the data.[2] The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.[3]

To illustrate, let's see what happens with a degree-$20$ polynomial fit.

In [11]:
deg = 20
A = reduce(hcat, [TV.^i for i=0:1:deg])
coeff = ls_by_qr(A,sales)
Out[11]:
21-element Array{Float64,1}:
 -3.5485182942335296
  1.9983587342791638
 -0.12293286246561463
  0.0035915780264108243
 -5.608819661563708e-5
  5.014719202946552e-7
 -2.5949492351362565e-9
  7.2410914236011946e-12
 -7.455713695819655e-15
 -9.418178511077042e-18
  2.4591311142030467e-20
 -6.316259228977537e-24
 -2.7466282663160562e-27
 -2.4036415725501296e-30
 -2.9082523810373025e-33
 -4.2946537585142373e-36
 -7.27662549565319e-39
 -1.3642164068864983e-41
 -2.7643672469104974e-44
 -5.957199891771067e-47
 -1.349448254017876e-49
In [12]:
saleshat = sum([coeff[i+1]*TVgrid.^i for i=0:1:deg])
scatter(TV, sales, legend=false, alpha=0.5)
plot!(TVgrid, saleshat, lw=2)
Out[12]:

We could use cross-validation to choose a suitable degree.

We return to the linear case, but with the full set of predictors.

In [13]:
radio, newspaper = df[:,:radio], df[:,:newspaper];
In [14]:
p1 = scatter(radio, sales, legend=false, alpha=0.5, title="radio")
p2 = scatter(newspaper, sales, legend=false, alpha=0.5, title="newspaper")
plot(p1, p2, laytout=(2,1), size=(750,375))
Out[14]:
In [15]:
A = reduce(hcat, [ones(n), TV, radio, newspaper])
coeff = ls_by_qr(A, sales)
Out[15]:
4-element Array{Float64,1}:
  2.9388893694594307
  0.04576464545539755
  0.18853001691820379
 -0.0010374930424762435

Newspaper advertising (the last coefficient) seems to have a much weaker effect on sales per dollar spent. We will come back later in the course to assessing the statistical significance of such a conclusion.

2 Analyzing flight delays [optional]

Suppose we are interested in gaining a better understanding of flight delays. Let's get some data from the DOT's Bureau of Transportation Statistics. The following dataset contains information about every flight out of Madison, WI between December 1, 2018 and November 30, 2019 (that were not cancelled or diverted). This information includes the date (date), day of week (day_of_week; 1 is Monday, 2 is Tuesday and so on), carrier (carrier), flight number (fl_num), destination (dest), scheduled departure time (sch_dep; e.g. 20.5 is 8:30pm), scheduled arrival time (sch_arr), and arrival delay (arr_delay; e.g. -33.0 means the flight arrived 33 minutes early).

In [16]:
df = CSV.read("msn-flight-data-19.csv")
first(df, 5)
Out[16]:

5 rows × 8 columns

dateday_of_weekcarrierfl_numdestsch_depsch_arrarr_delay
DateInt64StringInt64StringFloat64Float64Float64
12018-12-101DL949DTW13.666716.1-33.0
22018-12-101DL1223ATL17.266720.4833-16.0
32018-12-101DL1386MSP6.416677.78333-21.0
42018-12-101DL1495DTW6.583339.1-27.0
52018-12-101DL1547MSP17.333318.633372.0
In [17]:
describe(df)
Out[17]:

8 rows × 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolUnion…AnyUnion…AnyUnion…NothingDataType
1date2018-12-012019-11-30365Date
2day_of_week3.9015414.07Int64
3carrier9EYX11String
4fl_num3515.571593651.06117Int64
5destATLSLC26String
6sch_dep12.77285.0833313.316721.9Float64
7sch_arr14.91756.2333315.783323.9833Float64
8arr_delay6.59901-61.0-8.01540.0Float64

Let's say we suspect that the average delay gets worse throughout the day. By using linear regression, we can try to quantify the relationship between scheduled departure times and arrival delays. Let's extract the relevant columns first.

In [18]:
sch_dep = df[:,:sch_dep];
arr_delay = df[:,:arr_delay];
n = nrow(df)
Out[18]:
13579

The average delay in minutes was:

In [19]:
mean(arr_delay)
Out[19]:
6.5990131821194495

Here is an histogram of arrival delays (truncated at 4 hours).

In [20]:
histogram(arr_delay, legend=false, nbins=300, xlims=(-60,240))
Out[20]:

To get a sense of the relationship between scheduled departure times and arrival delays, we use a scatter plot. (This is a rather large dataset. To avoid making the notebook too sluggish, we only plot a random samples of $1000$ flights.)

In [21]:
subsample = rand(1:n,1000);
scatter(sch_dep[subsample], arr_delay[subsample], 
    legend=false, alpha=0.2, ylims=(-60,240))
Out[21]:

Finally, we form the matrix $A$ and use our linear regression algorithm.

In [22]:
A = reduce(hcat, [ones(n), sch_dep])
beta = ls_by_qr(A, arr_delay)
Out[22]:
2-element Array{Float64,1}:
 -3.906848102051145
  0.8225205619598612

So this suggest that delays increase by roughly $0.8$ minutes per hour. Over $17$-hour day, that adds up to about $14$ minutes by the end of the day.

In [23]:
x = LinRange(minimum(sch_dep), maximum(sch_dep), 100)
y = beta[1] .+ beta[2]*x;
scatter(sch_dep[subsample], arr_delay[subsample], 
    legend=false, alpha=0.2, ylims=(-60,240))
plot!(x, y, lw=2)
Out[23]: