This note introduces simple linear regression from first principles and focuses on how slope and intercept are estimated from data. The problem it solves is modeling a linear relationship between one predictor and one response. It is intended as the base layer for the later regression notes in this series.
In secondary school, we learn that the equation of a straight line is given by $y = mx + c$, where $m$ is the slope and $c$ is the y-intercept. In statistics and machine learning, we use a similar but more general form to model the relationship between a dependent variable $y$ and an independent variable $x$. This is known as simple linear regression.
The Model
In simple linear regression, we express the relationship as:
$$\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i,$$
where $\hat{y}_i$ is the predicted value of the dependent variable for the $i$-th observation, $\hat{\beta}_0$ is the estimated y-intercept, and $\hat{\beta}_1$ is the estimated slope coefficient. The "hat" notation indicates that these are estimates derived from data, not the true (unknown) population parameters $\beta_0$ and $\beta_1$.
The true model is assumed to be:
$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i,$$
where $\varepsilon_i$ represents the random error term for the $i$-th observation.
Ordinary Least Squares (OLS)
The question is: how do we find the best estimates $\hat{\beta}_0$ and $\hat{\beta}_1$? We need a systematic method that determines the line of best fit. The most common approach is Ordinary Least Squares (OLS).
OLS minimizes the sum of the squared differences between the observed values $y_i$ and the predicted values $\hat{y}_i$. These differences are called residuals, defined as $e_i = y_i - \hat{y}_i$. By squaring the residuals, we treat positive and negative errors equally and penalize larger deviations more heavily.
The objective function is the Sum of Squared Errors (SSE):
$$SSE = \sum^n_{i=1}(y_i - \hat{y}_i)^2 = \sum^n_{i=1}(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2.$$
Deriving the Normal Equations
To minimize the SSE, we take partial derivatives with respect to $\hat{\beta}_0$ and $\hat{\beta}_1$ and set them equal to zero.
Partial derivative with respect to $\hat{\beta}_0$
$$\frac{\partial SSE}{\partial \hat{\beta}_0} = -2\sum^n_{i=1}(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0.$$
Dividing both sides by $-2$ and expanding the sum:
$$\sum^n_{i=1} y_i - n\hat{\beta}_0 - \hat{\beta}_1 \sum^n_{i=1} x_i = 0.$$
Solving for $\hat{\beta}_0$:
$$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x},$$
where $\bar{x} = \frac{1}{n}\sum^n_{i=1}x_i$ and $\bar{y} = \frac{1}{n}\sum^n_{i=1}y_i$ are the sample means.
Partial derivative with respect to $\hat{\beta}_1$
$$\frac{\partial SSE}{\partial \hat{\beta}_1} = -2\sum^n_{i=1}x_i(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0.$$
Dividing by $-2$ and expanding:
$$\sum^n_{i=1} x_i y_i - \hat{\beta}_0 \sum^n_{i=1} x_i - \hat{\beta}_1 \sum^n_{i=1} x_i^2 = 0.$$
Substituting $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$:
$$\sum^n_{i=1} x_i y_i - (\bar{y} - \hat{\beta}_1 \bar{x})\sum^n_{i=1} x_i - \hat{\beta}_1 \sum^n_{i=1} x_i^2 = 0.$$
After simplification, we obtain:
$$\hat{\beta}_1 = \frac{\sum^n_{i=1}(x_i - \bar{x})(y_i - \bar{y})}{\sum^n_{i=1}(x_i - \bar{x})^2}.$$
Introducing $S_{xx}$ and $S_{xy}$ Notation
To write the estimators more concisely, we define the following summary statistics:
$$S_{xx} = \sum^n_{i=1}(x_i - \bar{x})^2 = \sum^n_{i=1}x_i^2 - n\bar{x}^2,$$
$$S_{xy} = \sum^n_{i=1}(x_i - \bar{x})(y_i - \bar{y}) = \sum^n_{i=1}x_i y_i - n\bar{x}\bar{y}.$$
Using this notation, the OLS estimators become:
$$\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}},$$
$$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}.$$
These are elegant expressions that reveal the structure of the estimates. The slope $\hat{\beta}_1$ is the ratio of the joint variability of $x$ and $y$ (captured by $S_{xy}$) to the variability of $x$ alone (captured by $S_{xx}$). The intercept $\hat{\beta}_0$ ensures the regression line passes through the point $(\bar{x}, \bar{y})$.
Residuals and Fitted Values
Once we have $\hat{\beta}_0$ and $\hat{\beta}_1$, we can compute:
- Fitted values: $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$ for each observation.
- Residuals: $e_i = y_i - \hat{y}_i$ for each observation.
Two important properties of OLS residuals are worth noting:
- The residuals sum to zero: $\sum^n_{i=1} e_i = 0$.
- The residuals are uncorrelated with the fitted values: $\sum^n_{i=1} e_i \hat{y}_i = 0$.
These properties follow directly from the normal equations.
Key Assumptions
For OLS to produce reliable estimates, the following assumptions are typically required:
- Linearity: The relationship between $x$ and $y$ is linear in the parameters.
- Independence: The observations are independent of one another.
- Homoscedasticity: The variance of the error terms $\varepsilon_i$ is constant across all values of $x$.
- Normality: The error terms are normally distributed, that is, $\varepsilon_i \sim N(0, \sigma^2)$.
When these assumptions hold, OLS produces the Best Linear Unbiased Estimators (BLUE) according to the Gauss-Markov theorem.
Summary
In this post, we covered the fundamentals of simple linear regression:
- The model $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$ describes the estimated linear relationship between $x$ and $y$.
- OLS minimizes the sum of squared errors to find the best-fitting line.
- The estimators $\hat{\beta}1 = S{xy}/S_{xx}$ and $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}$ are derived from the normal equations.
- The $S_{xx}$ and $S_{xy}$ notation provides a compact way to express these results.
In the next post, we will extend this framework to handle multiple independent variables through multiple linear regression.