Descriptive Statistics and Sampling Distributions

Data Descriptions

In a sense, the underlying reason for statistical analysis is to reach an understanding of the data. Studies and experiments give rise to statistical units. These units are typically described with variables (and measurements). Variables are either qualitative (categorical) or quantitative (numerical). Categorical variables take values (levels) from a finite set of categories (or classes). Numerical variables take values from a (potentially infinite) set of quantities.

Numerical Summaries

As a first pass, a variable can be described along 2 dimensions: centrality, spread (skew and kurtosis are also used sometimes).

  • Centrality measures: (sample) median, (sample) mean, (mode, less frequent).
  • Spread (or dispersion) measures: standard deviation (sd), quartiles, inter-quartile range (IQR), range (less frequent).

The median, range and the quartiles are easily calculated from an ordered list of the data.

Sample Median

The median med(x1,...,xn)(x_1, . . . , x_n) of a sample of size nn is a numerical value which splits the ordered data into 2 equal subsets: half the observations are below the median, and half above it.

  • If nn is odd, then the position of the median is (n+1)/2(n + 1)/2
  • If nn is even, then the median is the average of the n/2n/2 and the n+1/2n+1/2 ordered observations.

Sample Mean

The mean of a sample is simply the arithmetic average of its observations.

Mean or Median?

Which measure of centrality should be used to report on the data?

  1. The mean is theoretically supported (see Central Limit Theorem).
  2. If the data distribution is roughly symmetric then both values will be near one another.
  3. If the data distribution is skewed then the mean is pulled toward the long tail and as a result gives a distorted view of the centre. Consequently, medians are generally used for house prices, incomes etc.
  4. The median is robust against outliers and incorrect readings whereas the mean is not.

Image

Quartiles

Another way to provide information about the spread of the data is with the help of centiles, deciles, or quartiles.

The lower quartile Q1(x1,...,xn)Q_1(x_1, . . . , x_n) of a sample of size nn, or Q1Q_1, is a numerical value which splits the ordered data into 2 unequal subsets: 25% of the observations are below Q1Q_1, and 75% of the observations are above Q1Q_1.

Similarly, the upper quartile Q3Q_3 splits the ordered data into 75% of the observations below Q3Q_3, and 25% of the observations above Q3Q_3. The median can be interpreted as the middle quartile, Q2Q_2, of the sample, the minimum as Q0Q_0, and the maximum as Q4Q_4.

Centiles pi,i=0,...,100p_i , i = 0, . . . , 100 and deciles dj,j=0,...,10d_j, j = 0, . . . , 10 run through different splitting percentages ⇒ p25=Q1,p75=Q3,d5=Q2p_{25} = Q_1, p_{75} = Q_3, d_5 = Q_2, etc.

The lower quartile Q1Q_1 is computed as the average of ordered observations with ranks [n4][\frac{n}{4}] and [n4+1][\frac{n}{4}+1] Similarly, Q3Q_3 is computed as the average of ordered observations with ranks [3n4][\frac{3n}{4}] and [3n4+1][\frac{3n}{4}+1]

Outliers

An outlier is an observation that lies outside the overall pattern in a distribution. Let x be an observation in the sample. It is a suspected outlier if x<Q11.5x < Q_1 - 1.5IQR or x>Q3+1.5x > Q_3 + 1.5IQR, where IQR =Q3Q1= Q_3 - Q_1 it is the inter-quartile range Q3Q1Q_3 - Q_1. This definition only applies with certainty to normally distributed data, although it is often used as a first outlier analysis method.

Visual Summaries

Box Plot

Box Plot Example

The box plot is a quick and easy way to present a graphical summary of a univariate distribution.

  1. Draw a box along the observation axis, with endpoints at the lower and upper quartiles, and with a “belt” at the median.
  2. Then, plot a line extending from Q1Q_1 to the smallest value less than 1.5IQR to the left of Q1Q_1 and from Q3Q_3 to the largest value less than 1.5IQR to the right of Q3Q_3.

Image

If the data distribution is symmetric then the (population) median and mean are equal and the first and third (population) quartiles are equidistant from the median.

If Q3Q2>Q2Q1Q_3 - Q_2 > Q_2 - Q_1 then the data distribution is skewed to the right.

If Q3Q2<Q2Q1Q_3 - Q_2 < Q_2 - Q_1 then the data distribution is skewed to left.

Measures of dispersion

The sample range is range(x1,...,xn)(x_1, . . . , x_n) = max{xi}\{x_i\} - min{xi}\{x_i\} = yny1y_n - y_1, where y1...yny_1 ≤ . . . ≤ y_n is the ranked data.

The inter-quartile range is IQR = Q3Q1Q_3 - Q1.

The sample standard deviation ss and sample variance s2s^2 are estimates of the underlying distribution’s σ\sigma and σ2\sigma^2.

Histogram

Histograms also provide an indication of the distribution of the sample.

Histograms should contain the following information:

  • the range of the histogram is rr = max{xi}\{x_i\} - min{xi}\{x_i\};
  • the number of bins should approach k = n\sqrt{n}, where nn is the sample size;
  • the bin width should approach r/kr/k,
  • the frequency of observations in each bin should be added to the chart.

Histogram Example

Random Sampling

A population is a set of similar items which is of interest in relation to some questions or experiments. In some situations, it is impossible to observe the entire set of observations that make up a population. In this case, we consider a sample (subset) of the population, and make inferences.

Linear Properties of Expectation and Variance

E[a+bX]=a+bE[X][a+bX] = a+b\mathrm{E}[X],
Var[a+bX]=b2Var[X][a+bX] = b^2\mathrm{Var}[X],
SD[a+bX]=bSD[X][a+bX] = |b|\mathrm{SD}[X]

Sum of Independent Random Variables

Generally if X1,X2,...,XnX_1, X_2, ... , X_n are independent random variables, then:

  • E[X]=i=1nE[Xi]\mathrm{E}[X] = \sum_{i=1}^n \mathrm{E}[X_i]
  • Var[X]=i=1nVar[Xi]\mathrm{Var}[X] = \sum_{i=1}^n \mathrm{Var}[X_i]

The IID Case

Independent and identically distributed(iid) random variables have exactly the same distribution.

  • E[i=1nE[Xi]]=nμ\mathrm{E}[\sum_{i=1}^n \mathrm{E}[X_i]] = n\mu
  • Var[i=1nVar[Xi]]=nσ2\mathrm{Var}[\sum_{i=1}^n \mathrm{Var}[X_i]] = n\sigma^2

Central Limit Theorem

If X\overline{X} is the mean of a random sample of size nn taken from an unknown population with mean μ\mu and finite variance σ2\sigma^2, then Z=Xμσ/nZ = \frac{\overline{X}-\mu}{\sigma/\sqrt{n}}, has the standard normal distribution N(0,1)N(0,1) as nn \rightarrow \infty

More precisely, the result is a limiting result.