Multiple Linear Regression
Regression Analysis Series
- Simple Linear Regression
- Multiple Linear Regression (You are here)
- Polynomial Regression
- Correlation
- R-squared: SST, SSE, SSR and the Relationship with Correlation
- Standard Error in Regression
- Confidence Intervals for Regression Coefficients
- Statistical Testing in Regression
- ANOVA in Regression: SST, SSR, SSE and the F-Test
In the previous post, we explored simple linear regression with a single predictor. In practice, the response variable $y$ often depends on more than one predictor. Multiple linear regression extends the framework to accommodate $p - 1$ independent variables.
The Model
The multiple linear regression model for the $i$-th observation is:
$$y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_{p-1} x_{i,p-1} + \varepsilon_i,$$
where $x_{ij}$ is the value of the $j$-th predictor for the $i$-th observation, $\beta_j$ are the regression coefficients, and $\varepsilon_i$ is the error term. The total number of parameters is $p$ (including the intercept $\beta_0$).
Matrix Notation
Writing out the model for each observation individually becomes cumbersome as the number of predictors grows. Matrix notation provides a compact and powerful alternative.
Define the following:
$$\mathbf{Y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix}, \quad \mathbf{X} = \begin{pmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1,p-1} \\ 1 & x_{21} & x_{22} & \cdots & x_{2,p-1} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{n,p-1} \end{pmatrix},$$
$$\boldsymbol{\beta} = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_{p-1} \end{pmatrix}, \quad \boldsymbol{\varepsilon} = \begin{pmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{pmatrix}.$$
The model can now be written as:
$$\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}.$$
Here, $\mathbf{Y}$ is an $n \times 1$ vector of responses, $\mathbf{X}$ is an $n \times p$ design matrix (the first column of ones accounts for the intercept), $\boldsymbol{\beta}$ is a $p \times 1$ vector of coefficients, and $\boldsymbol{\varepsilon}$ is an $n \times 1$ vector of errors.
OLS in Matrix Form
The sum of squared errors can be written in matrix form as:
$$SSE = (\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}).$$
Expanding this expression:
$$SSE = \mathbf{Y}^T\mathbf{Y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{Y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}.$$
To minimize, we take the derivative with respect to $\boldsymbol{\beta}$ and set it to zero:
$$\frac{\partial SSE}{\partial \boldsymbol{\beta}} = -2\mathbf{X}^T\mathbf{Y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \mathbf{0}.$$
This gives us the normal equations in matrix form:
$$\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^T\mathbf{Y}.$$
Provided that $\mathbf{X}^T\mathbf{X}$ is invertible (that is, the columns of $\mathbf{X}$ are linearly independent), we can solve for the OLS estimator:
$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}.$$
This single formula generalizes the simple linear regression result. When $p = 2$ (one predictor plus the intercept), this reduces to $\hat{\beta}1 = S{xy}/S_{xx}$ and $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}$ as derived in the previous post.
Interpreting the Coefficients
Each coefficient $\hat{\beta}_j$ (for $j = 1, 2, \ldots, p - 1$) represents the estimated change in $y$ for a one-unit increase in $x_j$, while holding all other predictors constant. This "holding other variables constant" interpretation is what distinguishes multiple regression from running separate simple regressions.
The intercept $\hat{\beta}_0$ represents the estimated value of $y$ when all predictors are equal to zero.
The Hat Matrix
The vector of fitted values is:
$$\hat{\mathbf{Y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} = \mathbf{H}\mathbf{Y},$$
where $\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$ is called the hat matrix. It "puts a hat on" $\mathbf{Y}$, transforming observed values into fitted values. The hat matrix is symmetric ($\mathbf{H}^T = \mathbf{H}$) and idempotent ($\mathbf{H}^2 = \mathbf{H}$).
The residual vector is:
$$\mathbf{e} = \mathbf{Y} - \hat{\mathbf{Y}} = (\mathbf{I} - \mathbf{H})\mathbf{Y}.$$
Assumptions
The assumptions for multiple linear regression extend those of simple linear regression:
- Linearity: $\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$, the model is linear in the parameters.
- Full rank: The design matrix $\mathbf{X}$ has full column rank, meaning $\text{rank}(\mathbf{X}) = p$. This ensures $\mathbf{X}^T\mathbf{X}$ is invertible.
- Exogeneity: $E(\boldsymbol{\varepsilon} | \mathbf{X}) = \mathbf{0}$, the errors have zero conditional mean.
- Homoscedasticity: $\text{Var}(\boldsymbol{\varepsilon} | \mathbf{X}) = \sigma^2\mathbf{I}_n$, the errors have constant variance and are uncorrelated.
- Normality (for inference): $\boldsymbol{\varepsilon} \sim N(\mathbf{0}, \sigma^2\mathbf{I}_n)$.
When assumptions 1 through 4 hold, the Gauss-Markov theorem guarantees that $\hat{\boldsymbol{\beta}}$ is the Best Linear Unbiased Estimator (BLUE).
Connection to Simple Linear Regression
In the special case where $p = 2$, we have a single predictor $x$ and the design matrix becomes:
$$\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}.$$
Computing $(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}$ in this case yields the familiar results $\hat{\beta}1 = S{xy}/S_{xx}$ and $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}$, confirming that the matrix formulation is a true generalization.
Summary
In this post, we extended the regression framework to handle multiple predictors:
- The model $\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ uses matrix notation to express the relationship compactly.
- The OLS estimator $\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}$ generalizes the simple linear regression solution.
- Each $\hat{\beta}_j$ measures the effect of one predictor while holding the others constant.
- The hat matrix $\mathbf{H}$ projects observed values onto fitted values.
In the next post, we will explore polynomial regression, which is a special case of multiple linear regression where the predictors are powers of a single variable.