Descriptive Statistics and Sampling Distributions
Data Descriptions
In a sense, the underlying reason for statistical analysis is to reach an understanding of the data. Studies and experiments give rise to statistical units. These units are typically described with variables (and measurements). Variables are either qualitative (categorical) or quantitative (numerical). Categorical variables take values (levels) from a finite set of categories (or classes). Numerical variables take values from a (potentially infinite) set of quantities.
Numerical Summaries
As a first pass, a variable can be described along 2 dimensions: centrality, spread (skew and kurtosis are also used sometimes).
- Centrality measures: (sample) median, (sample) mean, (mode, less frequent).
- Spread (or dispersion) measures: standard deviation (sd), quartiles, inter-quartile range (IQR), range (less frequent).
The median, range and the quartiles are easily calculated from an ordered list of the data.
Sample Median
The median med of a sample of size is a numerical value which splits the ordered data into 2 equal subsets: half the observations are below the median, and half above it.
- If is odd, then the position of the median is
- If is even, then the median is the average of the and the ordered observations.
Sample Mean
The mean of a sample is simply the arithmetic average of its observations.
Mean or Median?
Which measure of centrality should be used to report on the data?
- The mean is theoretically supported (see Central Limit Theorem).
- If the data distribution is roughly symmetric then both values will be near one another.
- If the data distribution is skewed then the mean is pulled toward the long tail and as a result gives a distorted view of the centre. Consequently, medians are generally used for house prices, incomes etc.
- The median is robust against outliers and incorrect readings whereas the mean is not.
Quartiles
Another way to provide information about the spread of the data is with the help of centiles, deciles, or quartiles.
The lower quartile of a sample of size , or , is a numerical value which splits the ordered data into 2 unequal subsets: 25% of the observations are below , and 75% of the observations are above .
Similarly, the upper quartile splits the ordered data into 75% of the observations below , and 25% of the observations above . The median can be interpreted as the middle quartile, , of the sample, the minimum as , and the maximum as .
Centiles and deciles run through different splitting percentages ⇒ , etc.
The lower quartile is computed as the average of ordered observations with ranks and Similarly, is computed as the average of ordered observations with ranks and
Outliers
An outlier is an observation that lies outside the overall pattern in a distribution. Let x be an observation in the sample. It is a suspected outlier if IQR or IQR, where IQR it is the inter-quartile range . This definition only applies with certainty to normally distributed data, although it is often used as a first outlier analysis method.
Visual Summaries
Box Plot
Box Plot Example
The box plot is a quick and easy way to present a graphical summary of a univariate distribution.
- Draw a box along the observation axis, with endpoints at the lower and upper quartiles, and with a “belt” at the median.
- Then, plot a line extending from to the smallest value less than 1.5IQR to the left of and from to the largest value less than 1.5IQR to the right of .
If the data distribution is symmetric then the (population) median and mean are equal and the first and third (population) quartiles are equidistant from the median.
If then the data distribution is skewed to the right.
If then the data distribution is skewed to left.
Measures of dispersion
The sample range is range = max - min = , where is the ranked data.
The inter-quartile range is IQR = .
The sample standard deviation and sample variance are estimates of the underlying distribution’s and .
Histogram
Histograms also provide an indication of the distribution of the sample.
Histograms should contain the following information:
- the range of the histogram is = max - min;
- the number of bins should approach k = , where is the sample size;
- the bin width should approach ,
- the frequency of observations in each bin should be added to the chart.
Histogram Example
Random Sampling
A population is a set of similar items which is of interest in relation to some questions or experiments. In some situations, it is impossible to observe the entire set of observations that make up a population. In this case, we consider a sample (subset) of the population, and make inferences.
Linear Properties of Expectation and Variance
E,
Var,
SD
Sum of Independent Random Variables
Generally if are independent random variables, then:
The IID Case
Independent and identically distributed(iid) random variables have exactly the same distribution.
Central Limit Theorem
If is the mean of a random sample of size taken from an unknown population with mean and finite variance , then , has the standard normal distribution as
More precisely, the result is a limiting result.