*Course:* Math 535 - Mathematical Methods in Data Science (MMiDS)

*Author:* Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison

*Updated:* August 28, 2020

*Copyright:* © 2020 Sebastien Roch

Here is a classical iris dataset first analyzed by Fisher. We will upload the data in the form of a `DataFrame`

-- similar to a spreadsheet -- where the columns are different measurements (or features) and the rows are different samples. Below, we show the first $5$ lines of the dataset.

In [1]:

```
# Julia version: 1.5.1
using CSV, DataFrames, Statistics, Plots, LinearAlgebra, StatsPlots
```

In [2]:

```
df = CSV.read("iris-measurements.csv")
first(df, 5)
```

Out[2]:

There are $150$ samples.

In [3]:

```
nrow(df)
```

Out[3]:

Here is a summary of the data.

In [4]:

```
describe(df)
```

Out[4]:

Let's first extract the columns into vectors and combine them into a matrix, and visualize the petal data. Below, each point is a sample. This is called a scatter plot.

In [5]:

```
X = reduce(hcat,
[df[:,:PetalLengthCm], df[:,:PetalWidthCm],
df[:,:SepalLengthCm], df[:,:SepalWidthCm]]);
```

In [6]:

```
scatter(X[:,1], X[:,2],
legend=false, xlabel="PetalLength", ylabel="PetalWidth")
```

Out[6]:

Observe a clear cluster of samples on the bottom left. This may be an indication that these samples come from a separate species. What is a cluster? Intuitively, it is a group of samples that are close to each other, but far from every other sample.

Now let's look at the full dataset. Visualizing the full $4$-dimensional data is not so straighforward. One way to do this is to consider all pairwise scatter plots.

In [7]:

```
cornerplot(X,
label=["PetalWidth", "PetalLength", "SepalWidth", "SepalLength"],
size=(500,500),
markersize=2)
```

Out[7]:

We would like a method that automatically identifies clusters -- whatever the dimension of the data. We will discuss a standard way to do this: $k$-means clustering.

This topic has two main goals:

- To review basic facts about Euclidean geometry, vector calculus, probability, and matrix algebra.
- To introduce a first data science problem and highlight some relevant, surprising phenomena arising in high-dimensional space.

We will come back to the iris dataset in an accompanying tutorial notebook.

You may want to review basic linear algebra and probability. In particular, take a look at

- Chapter 1 in [Sol] and Sections 1.2.1-1.2.4 in [Bis]

where, throughout this course, we will refer to the following textbooks available online:

- [Sol] Solomon, Numerical Algorithms: Methods for Computer Vision, Machine Learning, and Graphics, CRC, 2015
- [Bis] Bishop, Pattern Recognition and Machine Learning, Springer, 2006

Material for this lecture is covered partly in:

- Sections 1.4, 9.1 and 11.1.1 of [Bis]

Parts of this topic's notebooks are based on the following references.

[Carp] B. Carpenter, Typical Sets and the Curse of Dimensionality

[BHK] A. Blum, J. Hopcroft, R. Kannan, Foundations of Data Science, Cambridge University Press, 2020.