Linear Regression

Coefficient of Correlation

Population Coefficient of Correlation

For paired variables (X,Y)(X, Y ), the population correlation coefficient of XX and YY is

ρXY=E(XμX)(YμY)E(XμX)2E(YμY)2=E(XμX)(YμY)σXσYρ_{XY} = \frac{\mathrm{E}(X - \mu_X)(Y - \mu_Y)}{\sqrt{E(X - \mu_X)^2E(Y - \mu_Y )^2}} = \frac{E(X - \mu_X)(Y - \mu_Y )}{\sigma_X\sigma_Y}

  • pXYp_{XY} is unaffected by changes of scale or origin. Adding constants to XX or YY will not change pXYp_{XY}.
  • pXYp_{XY} is symmetric in XX and YY (ie. pXY=pYXp_{XY} = p_{YX})
  • 1pXY1-1 \leq p_{XY} \leq 1
  • if ρXY=±1ρ_{XY} = ±1, then the observations (xi,yi)(x_i, y_i) all lie on a straight line with a positive (negative) slope
  • the sign pXYp_{XY} reflects the trend of the points

Sample Coefficient of Correlation

For paired variables (xi,yi)(x_i, y_i), the population correlation coefficient of xx and yy is

rXY=(xix)(yiy)(xix)2(yiy)2=SxySxxSyyr_{XY} = \frac{\sum(x_i - \overline{x})(y_i - \overline{y})}{\sqrt{\sum(x_i - \overline{x})^2\sum(y_i - \overline{y})^2}} = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}}

  • rXYr_{XY} is unaffected by changes of scale or origin. Adding constants to XX or YY will not change rXYr_{XY}.
  • rXYr_{XY} is symmetric in XX and YY (ie. rXY=rYXr_{XY} = r_{YX})
  • 1rXY1-1 \leq r_{XY} \leq 1
  • if ρXY=±1ρ_{XY} = ±1, then the observations (xi,yi)(x_i, y_i) all lie on a straight line with a positive (negative) slope
  • the sign rXYr_{XY} reflects the trend of the points
  • a high correlation coefficient value rXY|r_{XY}| does not necessarily imply a causal relationship between the two variables. A causal relationship between two variables exists if the occurrence of the first causes the other.

Simple Linear Regression

Regression analysis can be used to describe the relationship between a predictor variable (or regressor) XX and a response variable YY. Assume that they are related through the model:

Y=β0+β1X+ϵY = \beta_0 + \beta_1 X + \epsilon

where εε is a random error and β0,β1\beta_0, \beta_1 are the regression coefficients.

It is assumed that E[ε][ε] = 0, and that the error’s variance σϵ2=σ2\sigma^2_\epsilon = \sigma^2 is constant. Then the model can be re-written as

E[YX]=β0+β1X\mathrm{E}[Y|X] = \beta_0 + \beta_1X.

Suppose that we have observations (xi,yi),i=1,...,n(x_i, y_i), i = 1, ... , n so that

yi=β0+β1xi+ϵi,  i=1,...,ny_i = \beta_0 + \beta_1x_i + \epsilon_i , \ \ i = 1, ... , n

The aim is to find estimators b0,b1b_0, b_1 of the unknown parameters β0,β1\beta_0, \beta_1 in order to obtain the estimated (fitted) regression line

yi=b0+b1xi\overline{y}_i = b_0 + b_1x_i

The residual or error in predicting yiy_i using y\overline{y} is thus

ei=yiyi=yib0b1xi, i=1,...,ne_i = y_i-\overline{y}_i = y_i - b_0 - b_1x_i, \ i = 1, ..., n

Sum of Squared Errors (SSE)

SSE=i=1nei2=i=1n(yib0b1xi)2\mathrm{SSE} = \sum^n_{i=1} e^2_i = \sum^n_{i=1}(y_i - b_0-b_1x_i)^2

The optimal values of b0b_0 and b1b_1 are those that minimize the SSE. As such solving:

0=dSSEdb0=2(yib0b1xi)=2n(yb0b1x)0 = \frac{dSSE}{db_0} = -2\sum (y_i - b_0 - b_1x_i) = -2n (\overline{y}-b_0-b_1\overline{x})
b0=yb1x\therefore b_0 = \overline{y} - b_1\overline{x}

0=dSSEdb1=2(yib0b1xi)xi=2(xiyinb0xb1xi2)0 = \frac{dSSE}{db_1} = -2\sum (y_i - b_0 - b_1x_i)x_i = -2(\sum x_iy_i - nb_0\overline{x} - b_1\sum x^2_i)
b1=SxySxx\therefore b_1 = \frac{S_{xy}}{S_{xx}}

Variance Decomposition in Regression

Syy=i=1n(yiy)2S_{yy} = \sum_{i=1}^n (y_i-\overline{y})^2
=b1Sxy=SSE= b_1S_{xy} = \mathrm{SSE}
=SSR+SSE= \mathrm{SSR} + \mathrm {SSE}

SSE=Syyb1Sxy\mathrm{SSE} = S_{yy} - b_1S_{xy}

Estimating Variance

To estimate σ2\sigma^2 we use:

SSE=i=1nei2=i=1n(yiyi)2\mathrm{SSE} = \sum_{i=1}^{n} e^2_i = \sum^n_{i=1} (y_i - \overline{y}_i)^2

For the regression error, the unbiased estimator of σ2\sigma^2 is in fact

σ2=MSE=SSEn2=Syyb1Sxyn2\overline{\sigma}^2 = \mathrm{MSE} = \frac{\mathrm{SSE}}{n-2} = \frac{S_{yy} - b_1S_{xy}}{n-2}

where the SSE has n2n-2 degrees of freedom, because 2 parameters had to be estimated in order to obtain yi:b0\overline{y}_i: b_0 and b1b_1.

Properties of the Least Square Estimators

  • i=1nei=0\sum^n_{i=1} e_i = 0
  • i=1nxiei=0\sum^n_{i=1} x_i\cdot e_i = 0
  • The regression line y=b0+b1x\overline{y} = b_0 + b_1x always passes through the sample means of xx and yy, i.e., y=b0+b1x\overline{y} = b_0 + b_1\overline{x}
  • E[YX]=β0+β1X, Var[YX]=σ2\mathrm{E}[Y|X] = \beta_0 + \beta_1X, \ \mathrm{Var}[Y|X]= \sigma^2

Standard Errors

The estimated standard errors are:

se(b0)=σ2[1/n+x2/Sxx]\mathrm{se}(b_0) = \sqrt{\overline{\sigma}^2[1/n + \overline{x}^2/S_{xx}]} and
se(b1)=σ2/Sxx\mathrm{se}(b_1) = \sqrt{\overline{\sigma}^2/S_{xx}}

Hypothesis Testing for Linear Regression

With standard errors, we can test hypotheses on the regression parameters. We try to determine if the true parameters β0,β1\beta_0, \beta_1 take on specific values, and whether the line of best fit describes a bivariate dataset will.

  1. set up a null hypothesis H0H_0 and an alternative hypothesis H1H_1.
  2. compute a test statistic (often by some form of standardizing)
  3. find a critical region/p-value for the test statistic under H0H_0
  4. reject or fail to reject H0H_0 based on the critical region/p-value

Hypothesis Test for the Intercept

H0:β0=β1,0H_0 : \beta_0 = \beta_{1,0} against H1:β1β1,0H_1 : \beta_1 \neq \beta_{1,0}

Since b1 is a linear function of the observed responses yiy_i, it has normal distribution with mean β1β_1 and variance σ2/Sxxσ^2/S_{xx}. Therefore, under H0,

Z0=b1β1,0σ2/SxxN(0,1)Z_0 = \frac{b_1 - \beta_{1,0}}{\sqrt{\sigma^2/S_{xx}}} \sim N (0,1)

But σ2σ^2 is not known, so the test statistic with σ2=MSE\overline{σ}^2 = \mathrm{MSE}

T0=b1β1,0σ2/Sxxt(n2)T_0 = \frac{b_1 - \beta_{1,0}}{\sqrt{\overline{\sigma}^2/S_{xx}}} \sim t (n-2)

follows a Student t-distribution with n2n - 2 degrees of freedom.

Alternative HypothesisCritical Region
H1:β1>β1,0H_1 : \beta_1 > \beta_{1,0}t0>tα(n2)t_0 > t_\alpha(n-2)
H1:β1<β1,0H_1 : \beta_1 < \beta_{1,0}t0<tα(n2)t_0 < -t_\alpha(n-2)
H1:β1β1,0H_1 : \beta_1 \neq \beta_{1,0}t0>tα/2(n2)\|t_0\| > t_{\alpha/2}(n-2)

Reject H0H_0 if t0t_0 in the critical region.

Significance of Regression

Given a regression line, we may want to test whether it is significant. The test for significance of the regression is

H0:β1=0H_0 : \beta_1 = 0 against H1:β10H_1 : \beta_1 \neq 0

Confidence and Prediction Intervals for Linear Regression

We can also build confidence intervals (C.I.) for the regression parameters and prediction intervals for the predicted values.

  1. find a point estimate WW for the parameter β\beta or the prediction YY
  2. find the appropriate standard error se(W)\mathrm{se}(W)
  3. select a confidence level α\alpha and find the appropriate critical value kα/2k_{\alpha/2}
  4. build the 100(1α)100(1-\alpha)% interval W±kα/2se(W)W \pm k_{\alpha/2}\mathrm{se}(W)

Confidence Interval for Intercept and slope

The 100(1α)100(1 - \alpha)% C.I. for β0\beta_0 and β1\beta_1 are:

  • β0:b0±tα/2(n2)σ2xi2nSxx\beta_0 : b_0 \pm t_{\alpha/2}(n-2)\sqrt{\overline{\sigma}^2\frac{\sum x_i^2}{nS_{xx}}}
  • β1:b1±tα/2(n2)σ2Sxx\beta_1 : b_1 \pm t_{\alpha/2}(n-2)\sqrt{\frac{\overline{\sigma}^2}{S_{xx}}}

Confidence Intervals for the Mean Response

With the usual tα/2(n2)t_{\alpha/2}(n - 2), the 100(1α)100(1 - \alpha)% C.I. for the mean response μYx0\mu_{Y|x_0} (or for the line of regression) is

μYx0±tα/2(n2)σ2[1/n+(x0x)2Sxx]\overline{\mu}_{Y|x_0} \pm t_{\alpha/2}(n-2)\sqrt{\overline{\sigma}^2[1/n +\frac{(x_0 - \overline{x})^2}{S_{xx}}]}

Analysis of Variance

The test for significance of regression,

H0:β1=0H_0 : \beta_1 = 0 against H1:β10H_1 : \beta_1 \neq 0

can be restated in term of the analysis-of-variance (ANOVA), given the following the table:

Source of VariationSum of SquaredfMean SquareF*p-Value
RegressionSSR1MSRMSRMSE\frac{\mathrm{MSR}}{\mathrm{MSE}}P(F>F)P(F > F*)
ErrorSSEn2n-2MSE--
TotalSSTn1n-1---

The rejection region for the null hypothesis H0:β1=0H_0 : \beta_1 = 0 is still given by:

b1β1,0σ2/Sxx>tα/2(n2)|\frac{b_1 - \beta_{1,0}}{\sqrt{\overline{\sigma}^2/S_{xx}}}| > t_{\alpha/2}(n-2)

Coefficient of Determination

For observations (xi,yi), i=1,...,n(xi, yi), \ i = 1, ... , n, we define the coefficient of determination as

R2=1SSESSTR^2 = 1 - \frac{SSE}{SST}

where SSE and SST are as in the ANOVA.

The coefficient of determination is the proportion of the variability in the response that is explained by the fitted model. Note that R2R^2 always lies between 00 and 11; when R21R2 ≈ 1, the fit is considered to be very good.