3.1 Simple Linear Regression¶
Simple linear regression assumes that there is approximately a linear relationship between a predictor variable \(X\) and a quantitative response \(Y\), given by
Here, \(\beta_0\) and \(\beta_1\) are the coefficients representing the intercept and slope terms in the linear model.
3.1.1 Estimating the Coefficients¶
Let
represent \(n\) observation paris. Our goal is to obtain coefficient estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\) such that \(y_i \approx \hat{\beta}_0 + \hat{\beta}_1x_i\). Let \(\hat{y_i} = \hat{\beta}_0 + \hat{\beta}_1 x_i\) be the prediction for \(Y\) on the \(i\) th value of \(X\). Then \(e_i = y_i - \hat{y}_i\) represents the \(i\) th residual, and we define the residual sum of squares (RSS) as
The least squares approach choose \(\hat{\beta}_0\) and \(\hat{\beta}_1\) to minimize the RSS. We can show that the minimizers are
where \(\bar{y}\) and \(\bar{x}\) are the sample means.
3.1.2 Assessing the Accuracy of the Coefficient Estimates
Recall that we assume the true relationship between \(X\) and \(Y\) takes the form \(Y = f(X) + \varepsilon\) for some unknown function \(f\), where \(\varepsilon\) is a mean-zero random error term. If \(f\) is to be approximated by a linear function, then we can write this relationship as
The error term may include the error caused by an inappropriate model, missed variables, or measurement error. We typically assume that the error term is independent of \(X\). The model given by eq3-5 defines the population regression line, and the estimates in eq3-4 characterize the least squares line.