Descriptive Statistics and Sampling Distributions

Data Descriptions

In a sense, the underlying reason for statistical analysis is to reach an understanding of the data. Studies and experiments give rise to statistical units. These units are typically described with variables (and measurements). Variables are either qualitative (categorical) or quantitative (numerical). Categorical variables take values (levels) from a finite set of categories (or classes). Numerical variables take values from a (potentially infinite) set of quantities.

Numerical Summaries

As a first pass, a variable can be described along 2 dimensions: centrality, spread (skew and kurtosis are also used sometimes).

Centrality measures: (sample) median, (sample) mean, (mode, less frequent).
Spread (or dispersion) measures: standard deviation (sd), quartiles, inter-quartile range (IQR), range (less frequent).

The median, range and the quartiles are easily calculated from an ordered list of the data.

Sample Median

The median med $(x_1, . . . , x_n)$ of a sample of size $n$ is a numerical value which splits the ordered data into 2 equal subsets: half the observations are below the median, and half above it.

If $n$ is odd, then the position of the median is $(n + 1)/2$
If $n$ is even, then the median is the average of the $n/2$ and the $n+1/2$ ordered observations.

Sample Mean

The mean of a sample is simply the arithmetic average of its observations.

Mean or Median?

Which measure of centrality should be used to report on the data?

The mean is theoretically supported (see Central Limit Theorem).
If the data distribution is roughly symmetric then both values will be near one another.
If the data distribution is skewed then the mean is pulled toward the long tail and as a result gives a distorted view of the centre. Consequently, medians are generally used for house prices, incomes etc.
The median is robust against outliers and incorrect readings whereas the mean is not.

Quartiles

Another way to provide information about the spread of the data is with the help of centiles, deciles, or quartiles.

The lower quartile $Q_1(x_1, . . . , x_n)$ of a sample of size $n$ , or $Q_1$ , is a numerical value which splits the ordered data into 2 unequal subsets: 25% of the observations are below $Q_1$ , and 75% of the observations are above $Q_1$ .

Similarly, the upper quartile $Q_3$ splits the ordered data into 75% of the observations below $Q_3$ , and 25% of the observations above $Q_3$ . The median can be interpreted as the middle quartile, $Q_2$ , of the sample, the minimum as $Q_0$ , and the maximum as $Q_4$ .

Centiles $p_i , i = 0, . . . , 100$ and deciles $d_j, j = 0, . . . , 10$ run through different splitting percentages ⇒ $p_{25} = Q_1, p_{75} = Q_3, d_5 = Q_2$ , etc.

The lower quartile $Q_1$ is computed as the average of ordered observations with ranks $[\frac{n}{4}]$ and $[\frac{n}{4}+1]$ Similarly, $Q_3$ is computed as the average of ordered observations with ranks $[\frac{3n}{4}]$ and $[\frac{3n}{4}+1]$

Outliers

An outlier is an observation that lies outside the overall pattern in a distribution. Let x be an observation in the sample. It is a suspected outlier if $x < Q_1 - 1.5$ IQR or $x > Q_3 + 1.5$ IQR, where IQR $= Q_3 - Q_1$ it is the inter-quartile range $Q_3 - Q_1$ . This definition only applies with certainty to normally distributed data, although it is often used as a first outlier analysis method.

Visual Summaries

Box Plot

Box Plot Example

The box plot is a quick and easy way to present a graphical summary of a univariate distribution.

Draw a box along the observation axis, with endpoints at the lower and upper quartiles, and with a “belt” at the median.
Then, plot a line extending from $Q_1$ to the smallest value less than 1.5IQR to the left of $Q_1$ and from $Q_3$ to the largest value less than 1.5IQR to the right of $Q_3$ .

If the data distribution is symmetric then the (population) median and mean are equal and the first and third (population) quartiles are equidistant from the median.

If $Q_3 - Q_2 > Q_2 - Q_1$ then the data distribution is skewed to the right.

If $Q_3 - Q_2 < Q_2 - Q_1$ then the data distribution is skewed to left.

Measures of dispersion

The sample range is range $(x_1, . . . , x_n)$ = max $\{x_i\}$ - min $\{x_i\}$ = $y_n - y_1$ , where $y_1 ≤ . . . ≤ y_n$ is the ranked data.

The inter-quartile range is IQR = $Q_3 - Q1$ .

The sample standard deviation $s$ and sample variance $s^2$ are estimates of the underlying distribution’s $\sigma$ and $\sigma^2$ .

Histogram

Histograms also provide an indication of the distribution of the sample.

Histograms should contain the following information:

the range of the histogram is $r$ = max $\{x_i\}$ - min $\{x_i\}$ ;
the number of bins should approach k = $\sqrt{n}$ , where $n$ is the sample size;
the bin width should approach $r/k$ ,
the frequency of observations in each bin should be added to the chart.

Histogram Example

Random Sampling

A population is a set of similar items which is of interest in relation to some questions or experiments. In some situations, it is impossible to observe the entire set of observations that make up a population. In this case, we consider a sample (subset) of the population, and make inferences.

Linear Properties of Expectation and Variance

E $[a+bX] = a+b\mathrm{E}[X]$ ,
Var $[a+bX] = b^2\mathrm{Var}[X]$ ,
SD $[a+bX] = |b|\mathrm{SD}[X]$

Sum of Independent Random Variables

Generally if $X_1, X_2, ... , X_n$ are independent random variables, then:

$\mathrm{E}[X] = \sum_{i=1}^n \mathrm{E}[X_i]$
$\mathrm{Var}[X] = \sum_{i=1}^n \mathrm{Var}[X_i]$

The IID Case

Independent and identically distributed(iid) random variables have exactly the same distribution.

$\mathrm{E}[\sum_{i=1}^n \mathrm{E}[X_i]] = n\mu$
$\mathrm{Var}[\sum_{i=1}^n \mathrm{Var}[X_i]] = n\sigma^2$

Central Limit Theorem

If $\overline{X}$ is the mean of a random sample of size $n$ taken from an unknown population with mean $\mu$ and finite variance $\sigma^2$ , then $Z = \frac{\overline{X}-\mu}{\sigma/\sqrt{n}}$ , has the standard normal distribution $N(0,1)$ as $n \rightarrow \infty$

More precisely, the result is a limiting result.

Continuous Distributions

Point and Interval Estimation