Individuals - objects in a data
set
A variable - characteristic of
an individual
Example: The students in this class, would be individuals. We could study their height, weight, favorite music. Those would be variables.
The distribution of a variable - the values it takes and how often.
Example: the prof makes a list of student grades, i.e. a distribution.
When examining a distribution, we look for the overall pattern, which is described by
- shapeand for striking deviations
- center
- spread
- e.g. outliersA distribution can be:
- symmetriclook at p 185, 186. Here data is represented via histograms.
- skewed to the left
- skewed to the right
One way of representing data is via a stemplot (see p. 187).
Example: The Major League Baseball career and single -season home
run
records are held by Barry Bondsof the San Francisco Giants.. Here are
the
home run totals from 1986 (his first year) through 2007: 16, 25, 24, 19,
33, 25, 34, 46, 37, 33, 42, 40, 37, 34, 49, 73, 46, 45, 45, 5, 26, 28.
Make a stemplot of the data. Are there outliers?
The mean m of a distribution x1, x2, ... ,xn, is
m = (x1 + x2 ... + xn )/nThe median M of a distribution is its midpoint, i.e. the number such that half of the observations are smaller and the other half bigger than it.
- sort the observations in increasing orderExample: Bill Gates goes into a bar...
- if n is odd, M is the (n + 1)/2 st observation from the bottom of the list.
- if n is even, M is the average of the middle two observations.
In the Barry Bonds home run example:
Find his career mean and median number of home runs. How do these
numbers
change when you drop 73? What general fact about the mean and median
does
you result illustrate?
Describing Spread:
Example: two neighborhoods with median house price $193,000 can be very different. One has mansions and modest homes and the other one has little variation among the homes.
Compute quartiles:
- sort observation in increasing orderData can be represented by the Five Number Summary:
- the first quartile Q1, is the median of the values smaller than the median M.
- the third quartile Q3 , is the median of the values larger than the median M.
Minimum Q1 M Q3 Maximum
A boxplot is a graph of the five number summary:
- central box spans the quartiles Q1 and Q3see p.194
- line in the box marks median M
- lines extend from the box out to smallest and largest observations
In the Barry Bonds home run example:
Find the first and third quartiles and the five number summary.
Draw a box plot for Barry Bond's home run data.
The standard deviation s and the variance s2 measure spread by looking at deviations of observations from the mean.
s2 = [(m - x1)2 + (m - x2)2 + ... + (m - xn)2] / (n-1)
The standard deviation s is the square root of the variance.
In the Barry Bonds home run example:
Find the variance and the standard deviation.
Choosing a summary:
- symmetric distribution: mean m and standard deviation sA normal distribution:
- skewed distribution: five-number summary.
- curve is symmetric and bell shaped.Quartiles of Normal Distributions: (p.170)
- total area under the curve is 1
- the area under the curve above a interval (a,b) = probability that an outcome lies in (a,b).
- center of the distribution = mean
- spread = standard deviation = distance from the mean to the place on the curve, where there is a change of curvature
(see page 169)
The 68-95-99.7 Rule: for any normal distribution- Q1 is .67 standard deviations below the mean
- Q3 is .67 standard deviations above the mean
- 68% of the observations fall within 1 standard deviation of the mean(see pages 172, 173)
- 95% 2
- 99.7% 3
Example: Scores on the three section SAT Reasoning college entrance test for the class of 2007were roughly normalwith mean 1511 and standard deviation 194.
Example: The concentration of active ingredient in
capsules
of a prescription pain killer varies according to a normal
distribution.with
mean 10% and standard deviation 0.2%.