TOPIC 0

Introduction

First data science example: species delimitation


Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: August 28, 2020
Copyright: © 2020 Sebastien Roch


Imagine that you are an evolutionary biologist studying irises and that you have collected measurements on a large number of iris samples. Your goal is to identify different species within this collection.

Iris measurements (Source)

Here is a classical iris dataset first analyzed by Fisher. We will upload the data in the form of a DataFrame -- similar to a spreadsheet -- where the columns are different measurements (or features) and the rows are different samples. Below, we show the first $5$ lines of the dataset.

In [1]:
# Julia version: 1.5.1
using CSV, DataFrames, Statistics, Plots, LinearAlgebra, StatsPlots
In [2]:
df = CSV.read("iris-measurements.csv")
first(df, 5)
Out[2]:

5 rows × 5 columns

IdPetalLengthCmPetalWidthCmSepalLengthCmSepalWidthCm
Int64Float64Float64Float64Float64
111.40.25.13.5
221.40.24.93.0
331.30.24.73.2
441.50.24.63.1
551.40.25.03.6

There are $150$ samples.

In [3]:
nrow(df)
Out[3]:
150

Here is a summary of the data.

In [4]:
describe(df)
Out[4]:

5 rows × 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolFloat64RealFloat64RealNothingNothingDataType
1Id75.5175.5150Int64
2PetalLengthCm3.758671.04.356.9Float64
3PetalWidthCm1.198670.11.32.5Float64
4SepalLengthCm5.843334.35.87.9Float64
5SepalWidthCm3.0542.03.04.4Float64

Let's first extract the columns into vectors and combine them into a matrix, and visualize the petal data. Below, each point is a sample. This is called a scatter plot.

In [5]:
X = reduce(hcat, 
        [df[:,:PetalLengthCm], df[:,:PetalWidthCm], 
        df[:,:SepalLengthCm], df[:,:SepalWidthCm]]);
In [6]:
scatter(X[:,1], X[:,2], 
    legend=false, xlabel="PetalLength", ylabel="PetalWidth")
Out[6]:

Observe a clear cluster of samples on the bottom left. This may be an indication that these samples come from a separate species. What is a cluster? Intuitively, it is a group of samples that are close to each other, but far from every other sample.

Now let's look at the full dataset. Visualizing the full $4$-dimensional data is not so straighforward. One way to do this is to consider all pairwise scatter plots.

In [7]:
cornerplot(X, 
    label=["PetalWidth", "PetalLength", "SepalWidth", "SepalLength"], 
    size=(500,500),
    markersize=2)
Out[7]:

We would like a method that automatically identifies clusters -- whatever the dimension of the data. We will discuss a standard way to do this: $k$-means clustering.

This topic has two main goals:

  1. To review basic facts about Euclidean geometry, vector calculus, probability, and matrix algebra.
  2. To introduce a first data science problem and highlight some relevant, surprising phenomena arising in high-dimensional space.

We will come back to the iris dataset in an accompanying tutorial notebook.

Optional reading

You may want to review basic linear algebra and probability. In particular, take a look at

  • Chapter 1 in [Sol] and Sections 1.2.1-1.2.4 in [Bis]

where, throughout this course, we will refer to the following textbooks available online:

Material for this lecture is covered partly in:

  • Sections 1.4, 9.1 and 11.1.1 of [Bis]