rXY is unaffected by changes of scale or origin. Adding constants to X or Y will not change rXY.
rXY is symmetric in X and Y (ie. rXY=rYX)
−1≤rXY≤1
if ρXY=±1, then the observations (xi,yi) all lie on a straight line with a positive (negative) slope
the sign rXY reflects the trend of the points
a high correlation coefficient value ∣rXY∣ does not necessarily imply a causal relationship between the two variables. A causal relationship between two variables exists if the occurrence of the first causes the other.
Simple Linear Regression
Regression analysis can be used to describe the relationship between a predictor variable (or regressor) X and a response variableY. Assume that they are related through the model:
Y=β0+β1X+ϵ
where ε is a random error and β0,β1 are the regression coefficients.
It is assumed that E[ε] = 0, and that the error’s variance σϵ2=σ2 is constant. Then the model can be re-written as
E[Y∣X]=β0+β1X.
Suppose that we have observations (xi,yi),i=1,...,n so that
yi=β0+β1xi+ϵi,i=1,...,n
The aim is to find estimatorsb0,b1 of the unknown parameters β0,β1 in order to obtain the estimated (fitted) regression line
yi=b0+b1xi
The residual or error in predicting yi using y is thus
ei=yi−yi=yi−b0−b1xi,i=1,...,n
Sum of Squared Errors (SSE)
SSE=∑i=1nei2=∑i=1n(yi−b0−b1xi)2
The optimal values of b0 and b1 are those that minimize the SSE. As such solving:
For the regression error, the unbiased estimator of σ2 is in fact
σ2=MSE=n−2SSE=n−2Syy−b1Sxy
where the SSE has n−2 degrees of freedom, because 2 parameters had to be estimated in order to obtain yi:b0 and b1.
Properties of the Least Square Estimators
∑i=1nei=0
∑i=1nxi⋅ei=0
The regression line y=b0+b1x always passes through the sample means of x and y, i.e., y=b0+b1x
E[Y∣X]=β0+β1X,Var[Y∣X]=σ2
Standard Errors
The estimated standard errors are:
se(b0)=σ2[1/n+x2/Sxx] and se(b1)=σ2/Sxx
Hypothesis Testing for Linear Regression
With standard errors, we can test hypotheses on the regression parameters. We try to determine if the true parameters β0,β1 take on specific values, and whether the line of best fit describes a bivariate dataset will.
set up a null hypothesis H0 and an alternative hypothesis H1.
compute a test statistic (often by some form of standardizing)
find a critical region/p-value for the test statistic under H0
reject or fail to reject H0 based on the critical region/p-value
Hypothesis Test for the Intercept
H0:β0=β1,0 against H1:β1=β1,0
Since b1 is a linear function of the observed responses yi, it has normal distribution with mean β1 and variance σ2/Sxx. Therefore, under H0,
Z0=σ2/Sxxb1−β1,0∼N(0,1)
But σ2 is not known, so the test statistic with σ2=MSE
T0=σ2/Sxxb1−β1,0∼t(n−2)
follows a Student t-distribution with n−2 degrees of freedom.
Alternative Hypothesis
Critical Region
H1:β1>β1,0
t0>tα(n−2)
H1:β1<β1,0
t0<−tα(n−2)
H1:β1=β1,0
∥t0∥>tα/2(n−2)
Reject H0 if t0 in the critical region.
Significance of Regression
Given a regression line, we may want to test whether it is significant. The test for significance of the regression is
H0:β1=0 against H1:β1=0
Confidence and Prediction Intervals for Linear Regression
We can also build confidence intervals (C.I.) for the regression parameters and prediction intervals for the predicted values.
find a point estimate W for the parameter β or the prediction Y
find the appropriate standard error se(W)
select a confidence level α and find the appropriate critical value kα/2
build the 100(1−α) interval W±kα/2se(W)
Confidence Interval for Intercept and slope
The 100(1−α) C.I. for β0 and β1 are:
β0:b0±tα/2(n−2)σ2nSxx∑xi2
β1:b1±tα/2(n−2)Sxxσ2
Confidence Intervals for the Mean Response
With the usual tα/2(n−2), the 100(1−α) C.I. for the mean response μY∣x0 (or for the line of regression) is
μY∣x0±tα/2(n−2)σ2[1/n+Sxx(x0−x)2]
Analysis of Variance
The test for significance of regression,
H0:β1=0 against H1:β1=0
can be restated in term of the analysis-of-variance (ANOVA), given the following the table:
Source of Variation
Sum of Square
df
Mean Square
F*
p-Value
Regression
SSR
1
MSR
MSEMSR
P(F>F∗)
Error
SSE
n−2
MSE
-
-
Total
SST
n−1
-
-
-
The rejection region for the null hypothesis H0:β1=0 is still given by:
∣σ2/Sxxb1−β1,0∣>tα/2(n−2)
Coefficient of Determination
For observations (xi,yi),i=1,...,n, we define the coefficient of determination as
R2=1−SSTSSE
where SSE and SST are as in the ANOVA.
The coefficient of determination is the proportion of the variability in the response that is explained by the fitted model. Note that R2 always lies between 0 and 1; when R2≈1, the fit is considered to be very good.