*Course:* Math 535 - Mathematical Methods in Data Science (MMiDS)

*Author:* Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison

*Updated:* Oct 18, 2020

*Copyright:* © 2020 Sebastien Roch

In this notebook, we illustrate the use of gradient descent on binary classification by logistic regression.

**Classification.** Quoting Wikipedia, recall that classification is the following machine learning task:

In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.). Classification is an example of pattern recognition. In the terminology of machine learning, classification is considered an instance of supervised learning, i.e., learning where a training set of correctly identified observations is available.

The input data is of the form $\{(\boldsymbol{\alpha}_i, b_i) : i=1,\ldots, n\}$ where $\boldsymbol{\alpha}_i \in \mathbb{R}^d$ are the features and $b_i \in \{0,1\}$ is the label. As before we use a matrix representation: $A \in \mathbb{R}^{n \times d}$ has rows $\boldsymbol{\alpha}_j^T$, $j = 1,\ldots, n$ and $\mathbf{b} = (b_1, \ldots, b_n)^T \in \{0,1\}^n$.

**Logistic model.** We summarize the logistic regression approach. Our goal is to find a function of the features that approximates the probability of the label $1$. For this purpose, we model the log-odds (or logit function) of the probability of label $1$ as a linear function of the features

where $\mathbf{x} \in \mathbb{R}^d$. Inverting this expression gives

$$ p(\boldsymbol{\alpha}; \mathbf{x}) = \sigma(\boldsymbol{\alpha}^T \mathbf{x}) $$where the sigmoid function is

$$ \sigma(t) = \frac{1}{1 + e^{-t}} $$for $t \in \mathbb{R}$.

We seek to maximize the likelihood of the data assuming the labels are independent given the features, which is given by

$$ \mathcal{L}(\mathbf{x}; A, \mathbf{b}) = \prod_{i=1}^n p(\boldsymbol{\alpha}_i; \mathbf{x})^{b_i} (1- p(\boldsymbol{\alpha}_i; \mathbf{x}))^{1-b_i} $$Taking a logarithm, multiplying by $-1/n$ and substituting the sigmoid function, we want to minimize the cross-entropy loss

$$ \ell(\mathbf{x}; A, \mathbf{b}) = - \frac{1}{n} \sum_{i=1}^n b_i \log(\sigma(\boldsymbol{\alpha}^T \mathbf{x})) - \frac{1}{n} \sum_{i=1}^n (1-b_i) \log(1- \sigma(\boldsymbol{\alpha}^T \mathbf{x})). $$That is, we solve

$$ \min_{\mathbf{x} \in \mathbb{R}^d} \ell(\mathbf{x}; A, \mathbf{b}). $$**Gradient descent.** To use gradient descent, we need the gradient of $\ell$. We use the *Chain Rule* and first compute the derivative of $\sigma$ which is

The latter expression is known as the logistic differential equation. It arises in a variety of applications, including the modeling of population dynamics. Here it will be a convenient way to compute the gradient. Indeed observe that by the *Chain Rule*

where we use a subscript $\mathbf{x}$ to make it clear that the gradient is with respect to $\mathbf{x}$.

By another application of the *Chain Rule*

To compute the Hessian, we note that

$$ \nabla_{\mathbf{x}} (\sigma(\boldsymbol{\alpha}^T \mathbf{x})\, \alpha_j) = \sigma(\boldsymbol{\alpha}^T \mathbf{x}) (1 - \sigma(\boldsymbol{\alpha}^T \mathbf{x}))\, \boldsymbol{\alpha}\,\alpha_j $$so that

$$ \nabla_{\mathbf{x}} (\sigma(\boldsymbol{\alpha}^T \mathbf{x})\,\boldsymbol{\alpha}) = \sigma(\boldsymbol{\alpha}^T \mathbf{x}) (1 - \sigma(\boldsymbol{\alpha}^T \mathbf{x}))\, \boldsymbol{\alpha} \boldsymbol{\alpha}^T. $$Thus

$$ \nabla^2_{\mathbf{x}} \,\ell(\mathbf{x}; A, \mathbf{b}) = \frac{1}{n} \sum_{i=1}^n \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}) (1 - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}))\, \boldsymbol{\alpha}_i \boldsymbol{\alpha}_i^T $$where $\nabla^2_{\mathbf{x}}$ indicates the Hessian with respect to the $\mathbf{x}$ variables.

**Lemma (Convexity of logistic regression):** The function $\ell(\mathbf{x}; A, \mathbf{b})$ is convex as a function of $\mathbf{x} \in \mathbb{R}^d$.

*Proof:* Indeed, the Hessian is positive semidefinite: for any $\mathbf{z} \in \mathbb{R}^d$

since $\sigma(t) \in [0,1]$ for all $t$. $\square$

Convexity is one reason for working with the cross-entropy loss (rather than the mean squared error for instance).

**Update formula.** For step size $\beta$, one step of gradient descent is therefore

In stochastic gradient descent (SGD), a variant of gradient descent, we pick a sample $I$ uniformly at random in $\{1,\ldots,n\}$ and update as follows

$$ \mathbf{x}^{k+1} = \mathbf{x}^{k} +\beta \, ( b_I - \sigma(\boldsymbol{\alpha}_I^T \mathbf{x}^k) ) \, \boldsymbol{\alpha}_I. $$For the mini-batch version of SGD, we pick a random sub-sample $\mathcal{B} \subseteq \{1,\ldots,n\}$ of size $B$

$$ \mathbf{x}^{k+1} = \mathbf{x}^{k} +\beta \frac{1}{B} \sum_{i\in \mathcal{B}} ( b_i - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}^k) ) \,\boldsymbol{\alpha}_i. $$The key observation about the two stochastic updates above is that, in expectation, they perform a step of gradient descent. That turns out to be enough and it has computational advantages. Many variants exist and we will encounter some of them in the next notebook.

We start with a simple dataset from UC Berkeley's DS100 course. The file `lebron.csv`

is available here. Quoting the course's textbook [DS100, Section 17.1]:

In basketball, players score by shooting a ball through a hoop. One such player, LeBron James, is widely considered one of the best basketball players ever for his incredible ability to score. LeBron plays in the National Basketball Association (NBA), the United States's premier basketball league. We've collected a dataset of all of LeBron's attempts in the 2017 NBA Playoff Games using the NBA statistics website (https://stats.nba.com/).

We first load the data and look at its summary.

In [1]:

```
# Julia version: 1.5.1
using CSV, DataFrames, Plots, LinearAlgebra, Statistics
```

In [2]:

```
df = CSV.read("lebron.csv")
first(df,5)
```

Out[2]:

In [3]:

```
nrow(df)
```

Out[3]:

In [4]:

```
describe(df)
```

Out[4]:

The two columns we will be interested in are `shot_distance`

(LeBron's distance from the basket when the shot was attempted (ft)) and `shot_made`

(0 if the shot missed, 1 if the shot went in). As the summary table above indicates, the average distance was `10.6953`

and the frequency of shots made was `0.565104`

. We extract those two columns and display them on a scatter plot.

In [5]:

```
feature = df[:,:shot_distance]
label = df[:,:shot_made]
scatter(feature, label; legend=false, alpha=0.2)
```

Out[5]:

As you can see, this kind of data is hard to vizualize because of the superposition of points with the same $x$ and $y$-values. One trick is to jiggle the $y$'s a little bit by adding Gaussian noise. We do this next and plot again.

In [6]:

```
label_jitter = label .+ 0.05*randn(length(label))
scatter(feature, label_jitter; legend=false, alpha=0.2)
```

Out[6]:

To run gradient descent (GD), we first implement a function computing an update. It takes as input a function `grad_fn`

computing the gradient itself.

In [7]:

```
function desc_update(grad_fn, A, b, curr_x, β)
gradient = grad_fn(curr_x, A, b)
return curr_x .- β*gradient
end
```

Out[7]:

We are ready to implement GD. Our function takes as input a function `loss_fn`

computing the objective, a function `grad_fn`

computing the gradient and the dataset `A`

and `b`

. Optional parameters are an initial guess `init_x`

, the step size, and the number of iterations. (To get $\beta$, try `\beta<tab>`

where `<tab>`

is the tab key.)

In [8]:

```
function mmids_gd(loss_fn, grad_fn, A, b, init_x; β=1e-3, niters=1e5)
# current x and loss
curr_x, curr_l = init_x, loss_fn(init_x, A, b)
# until the maximum number of iterations
for iter = 1:niters
curr_x = desc_update(grad_fn, A, b, curr_x, β) # gradient step
end
return curr_x
end
```

Out[8]:

We apply GD to logistic regression. We first construct the data matrices $A$ and $\mathbf{b}$. To allow an affine function of the features, we add a column of $1$'s as we have done before.

In [9]:

```
A = [ones(length(label)) feature]
b = label;
```

To implement `loss_fn`

and `grad_fn`

, we define the sigmoid first.

In [10]:

```
sigmoid = z -> 1/(1+exp(-z))
pred_fn = (x, A) -> sigmoid.(A*x)
loss_fn = (x, A, b) -> mean(-b.*log.(pred_fn(x, A))
.-(1 .- b).*log.(1 .- pred_fn(x, A)))
```

Out[10]:

In [11]:

```
grad_fn = (x, A, b) -> -A'*(b.-pred_fn(x, A))/length(b)
```

Out[11]:

We run GD starting from $(0,0)$ with a step size of $0.01$.

In [12]:

```
init_x = zeros(size(A,2))
@time best_x = mmids_gd(loss_fn, grad_fn, A, b, init_x;
β=0.01, niters=10000)
```

Out[12]:

Finally we plot the results.

In [13]:

```
grid = LinRange(minimum(feature), maximum(feature), 100)
feature_grid = [ones(length(grid)) grid]
predict_grid = sigmoid.(feature_grid*best_x)
scatter(feature, label_jitter, legend=false, alpha=0.2)
plot!(grid,predict_grid;lw=2)
```

Out[13]:

We analyze a dataset from [ESL], which can be downloaded here. Quoting [ESL, Section 4.4.2]

The data [...] are a subset of the Coronary Risk-Factor Study (CORIS) baseline survey, carried out in three rural areas of the Western Cape, South Africa (Rousseauw et al., 1983). The aim of the study was to establish the intensity of ischemic heart disease risk factors in that high-incidence region. The data represent white males between 15 and 64, and the response variable is the presence or absence of myocardial infarction (MI) at the time of the survey (the overall prevalence of MI was 5.1% in this region). There are 160 cases in our data set, and a sample of 302 controls. These data are described in more detail in Hastie and Tibshirani (1987).

We load the data, which we slightly reformatted and look at a summary.

In [14]:

```
df = CSV.read("SAHeart.csv")
first(df,5)
```

Out[14]:

In [15]:

```
nrow(df)
```

Out[15]:

In [16]:

```
describe(df)
```

Out[16]:

Our goal to predict `chd`

, which stands for coronary heart disease, based on the other variables (which are briefly described here). We use logistic regression again.

We first construct the data matrices. We only use three of the predictors, as the convergence is quite slow. Try it for yourself!

In [17]:

```
feature = Matrix(df[:,[2,3,8]]);
```

In [18]:

```
label = Vector(df[:,end]);
```

In [19]:

```
A = [ones(length(label)) feature]
b = label;
```

We re-use the functions `loss_fn`

and `grad_fn`

, which were written for general logistic regression problems.

In [20]:

```
init_x = zeros(size(A,2))
@time best_x = mmids_gd(loss_fn, grad_fn, A, b, init_x)
```

Out[20]:

To get a sense of how accurate the result is, we compare our predictions to the true labels. By prediction, let us say that we mean that we predict label $1$ whenever $\sigma(\mathbf{a}^T \mathbf{x}) > 1/2$. We try this on the training set. (A better approach would be to split the data into training and testing sets, but we will not do this here.)

In [21]:

```
function logis_acc(x, A, b)
return sum((pred_fn(x, A) .> 0.5) .== b)/length(b)
end
```

Out[21]:

In [22]:

```
logis_acc(best_x, A, b)
```

Out[22]:

We also try mini-batch stochastic gradient descent (SGD). The only modification needed to the code is to pick a random mini-batch.

In [23]:

```
function mmids_sgd(loss_fn, grad_fn, A, b, init_x;
β=1e-3, niters=1e5, batch=40)
# current x and loss
curr_x, curr_l = init_x, loss_fn(init_x, A, b)
# until the maximum number of iterations
nsamples = length(b) # number of samples
for iter = 1:niters
I = rand(1:nsamples, batch)
curr_x = desc_update(grad_fn, A[I,:], b[I], curr_x, β) # gradient step
end
return curr_x
end
```

Out[23]:

In [24]:

```
init_x = zeros(size(A,2))
@time best_x = mmids_sgd(loss_fn, grad_fn, A, b, init_x)
```

Out[24]:

In [25]:

```
logis_acc(best_x, A, b)
```

Out[25]: