1. What Is Linear Regression?

Regression is that if we have a dataset \( (x_1, y_1), (x_2, y_2), \cdots \) and we wish to predict quantitative (numeric) \( y \) using x. One of the easiest models is linear models. Simple linear regression is a straightforward approach for predicting (called target) \( y \) on the basis of a single predictor variable \( x \)

\begin{equation*} \widehat{y} = \theta_0 + \theta_1 x, \end{equation*}
such that the actual target is given by
\begin{equation*} y = \theta_0 + \theta_1 x + \epsilon, \end{equation*}
where \( \epsilon \) is a mean-zero random error term (which we used to assume it obeys Gaussian distribution). The simple linear regression model is determined by \( (\theta_0, \theta_1) \).

Simple linear regression is a useful approach for predicting a response on the basis of a single predictor variable. However, in practice we often have more than one predictor, such as \( (X_1, y_1), (X_2, y_2), \cdots \), where \( X_i = (x^i_1, x^i_2, \cdots, x^i_p) \). Then we can have a multiple linear regression with \( p \) predictors

\begin{equation*} \widehat{y} = \theta_0 + \theta_1 x_1 + \theta_2 x_2 +\cdots + \theta_p x_p \end{equation*}
The multiple linear regression model is determined by \( (\theta_0, \theta_1,\cdots, \theta_p) \). Next we wanna ask, how can we determine the values of \( \theta \)'s?

2. Cost (Loss) Function for Linear Regression

Before we talk about how to determine the values of \( \theta \)'s, we need to know how to evaluate our models. Suppose we have a regression model, we can evaluate the error \( (y_i-\widehat{y}_i) \) (called residual) for \( i \)-th data. Of course, we wish our prediction to have errors as small as possible (without overfitting). Hence, linear regression has very straightforward cost (loss) function

\begin{equation*} C = \frac{1}{n}\sum^n_{i=1} (y_i-\widehat{y}_i)^2 \end{equation*}
and our goal is to MINIMIZE \( C \), the least square. Here we assume there are \( n \) samples in data. Larger \( C \), worse predictions as well as models. In the followings, we introduce several approaches to minize \( C \). One common metric for regression problems is called rooted-mean-square-error, which is \( \sqrt{C} \).

2.1 Coefficients for simple linear regression

For one predictor, \( \widehat{y}=\theta_0+\theta_1 x \), the cost function becomes

\begin{equation*} C = \frac{1}{n}\sum^n_{i=1} \Big(y_i-(\theta_0 + \theta_1 x_i) \Big)^2 \end{equation*}
To minimize \( C \), we perform derivative on \( C \) with respect to \( \theta_0 \) and \( \theta_1 \) respectively, and make them equal to zero. For \( \theta_0 \),
\begin{equation*} \frac{\partial C}{\partial \theta_0}=-\frac{2}{n}\sum^n_{i=1} ( y_i- \theta_0 - \theta_1 x_i ) = 0, \end{equation*}
we can easily obtain the optimal \( \theta_0 \)
\begin{equation*} \theta^*_0 = \frac{1}{n}\sum^n_{i=1} ( y_i - \theta_1 x_i ) = \bar{y} - \theta_1 \bar{x}. \end{equation*}
On the other hand, for \( \theta_1 \), we have
\begin{equation*} \frac{\partial C}{\partial \theta_1}=-\frac{2}{n}\sum^n_{i=1} ( y_i- \theta_0 - \theta_1 x_i )x_i = 0, \end{equation*}
\begin{equation*} \sum^n_{i=1} \Big( y_i- (\bar{y} - \theta_1 \bar{x}) - \theta_1 x_i \Big) x_i = \sum^n_{i=1} \Big( ( y_i- \bar{y}) -\theta_1(x_i-\bar{x}) \Big) (x_i -\bar{x}) + \Big( ( y_i- \bar{y}) -\theta_1(x_i-\bar{x}) \Big)\bar{x} = 0. \end{equation*}
Note that the term \( \sum^n_{i=1 }( \cdots )\bar{x} \) is actually equal to zero. Therefore
\begin{equation*} \theta^*_1 = \frac{ \sum^n_{i=1} ( y_i- \bar{y})(x_i -\bar{x})} {\sum^n_{i=1}(x_i -\bar{x})^2}, \end{equation*}
the optimal \( \theta_1 \) is just Pearson's correlation coefficient between \( x \) and \( y \). However, this only works for simple linear regression. For multiple regression, we have \( p \) predictors as well as such \( p \) equations to solve. It is not applicable for large \( p \). Next step, we will ask what about other approaches for more than one predictor?

2.2 Gradient descent

For multiple linear regression, the cost function is

\begin{equation*} C = \frac{1}{n}\sum^n_{i=1} \Big(y_i-(\theta_0 + \theta_1 x^i_1 + \theta_2 x^i_2+ \cdots + \theta_p x^i_p) \Big)^2 \end{equation*}
Gradient descent is an approach that instead solving \( \theta \) directly, we iteratively update \( \theta \) to minimize \( C \) by using gradient of \( C \), i.e. \( \partial C/\partial \theta \). It can be written as
\begin{equation*} \theta_k := \theta_k - \alpha \frac{\partial C}{\partial \theta_k} \end{equation*}
where \( \alpha \) is the learning rate. It is art to choose value of \( \alpha \): at large \( \alpha \) the gradient descent may not converge, and at small \( \alpha \) its convergence may be pretty slow.

The above relation can easily understood in the following way. If we reduce in one-dimensional problems, \( C=f(\theta) \), \( \partial C/\partial \theta \) is just the slope of curve \( C \) at \( \theta \). Positive slope means the minima happens on the left side, so we should decrease the value of \( \theta \) by \( \alpha \). On the other hand, if negative slope, minima is located on the right side, we should increase the value of \( \theta \). Thus for multiple regression model, parameters \( \theta_k \) are updated as

\begin{equation*} \theta_k := \theta_k + \alpha \Big( \frac{2}{n}\sum^n_{i=1} (y_i - \widehat{y}_i)x^i_k\Big). \end{equation*}

2.3 Normal equation

The third case is to exactly solve the model parameters. Assume we have dataset with \( n \) records and \( p \) predictors, for the \( i \)-th data record we have

\begin{equation*} y_i = \theta_0 + \theta_1 x^i_1 + \theta_2 x^i_2 + \cdots + \theta_p x^i_p \end{equation*}
We can rewrite the relation in a more elegant way
\begin{equation*} {\bf y} = {\bf X} \theta, \end{equation*}
where \( {\bf y} \) and \( {\bf \theta} \) are \( n \) and \( (p+1) \)-dimensional vectors, respectively, and \( {\bf X} \) is an \( n\times (p+1) \) matrix, such that
\begin{equation*} {\bf y}^T = (y_1, y_2, \cdots, y_n), \\ {\bf \theta}^T = (\theta_0, \theta_1, \cdots, \theta_p), \\ {\bf X} = \begin{bmatrix} 1 & x^1_{1} & x^1_{2} & \dots & x^1_{p} \\ 1 & x^2_{1} & x^2_{2} & \dots & x^2_{p} \\ \vdots \\ 1 & x^n_{1} & x^n_{2} & \dots & x^n_{p} \end{bmatrix}. \end{equation*}
With this, we can arrive at
\begin{equation*} {\bf \theta} = \Big( {\bf X}^T {\bf X} \Big)^{-1} {\bf X}^T {\bf y} \end{equation*}
With the normal equations, we don't have to be bothered by selection of \( \alpha \) and number of gradient descent iterations. The disadvantage is however the limitation of the number of predictors. Note that \( {\bf X}^T{\bf X} \) is a \( (p+1)\times (p+1) \) matrix, it will get more and more difficulty to inverse the matrix upon increasing \( p \). One the other hand, gradient descent has no constraint to the number of predictors. However, the normal equations are commonly implemented in collaboration filtering, since in that case \( p \) (number of latent factor) is at most few hundreds.

3. Metrics for Evaluating Regreesion

3.1 RMSE

The first common metric for regression is RMSE: rooted mean square errors:

\begin{equation*} \textrm{RMSE} = \sqrt{\frac{1}{n}\sum^n_{i=1}(y_i-\widehat{y}_i)^2}. \end{equation*}
Smaller RMSE denotes better regression.

3.2 \( R^2 \)

Before describing what \( R^2 \) is, let's introduce

\begin{equation*} \textrm{TSS} = \sum^n_{i=1} (y_i-\bar{y})^2, \\ \textrm{R}_{re}\textrm{SS} = \sum^n_{i=1} (\widehat{y}_i-y_i)^2, \\ \textrm{R}_{reg}\textrm{SS} = \sum^n_{i=1} (\widehat{y}_i-\bar{y})^2, \\ \end{equation*}
\( \textrm{TSS} \) is the total the total sum of squares, measuring the total variance in the response y. \( \textrm{R}_{re}\textrm{SS} \) measures the amount of variability (residual) that is NOT explained after performing the regression, whereas \( \textrm{R}_{reg}\textrm{SS} \) meansures the variance EXPLAINED by regression. \( R^2 \) measures the proportion of variability in \( y \) that can be explained using predictors \( {\bf X} \). It is defined as the percent of variance in the dependent variable that is explained by the predictors,
\begin{equation*} R^2 = \frac{\textrm{R}_{reg}\textrm{SS}}{\textrm{TSS}} = \frac{\textrm{TSS}-\textrm{R}_{re}\textrm{SS}}{\textrm{TSS}} = 1-\frac{\textrm{R}_{re}\textrm{SS}}{\textrm{TSS}}. \end{equation*}
An \( R^2 \) statistic that is close to 1 indicates that a large proportion of the variability in the response has been explained by the regression. A number near 0 indicates that the regression did not explain much of the variability in the response. Every time you add a predictor to a model, the R-squared increases, even if due to chance alone. It never decreases. Consequently, a model with more terms may appear to have a better fit simply because it has more terms.

3.3 Adjusted \( R^2 \)

There is a big issue if only considering \( R^2 \). Every time you add a predictor to a model, the R-squared increases, even if due to chance alone. It never decreases. Consequently, a model with more terms may appear to have a better fit simply because it has more terms.

The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. The adjusted R-squared can be negative, but it’s usually not. It is always lower than the R-squared. For detail, see the blog: Multiple Regression Analysis: Use Adjusted R-Squared and Predicted R-Squared to Include the Correct Number of Variables

4. Hypothesis Test for Regression Model Coefficients

The model parameters are not unique; they are different by different initial guesses. In the simple regreesion, the values of \( \theta \) form a t-distribution, and we can have null hypothesis:

\begin{equation*} H_0: \ \theta_1 =0, \end{equation*}
meaning that there is no relation between predictor and target. The alternative hypothesis is
\begin{equation*} H_a: \ \theta_1 \ne 0. \end{equation*}
we can compute t-statistics
\begin{equation*} t_{stat} = \frac{\theta_1-0}{\sqrt{\frac{\sigma^2}{n}}}. \end{equation*}
and p-value to determine if \( \theta_1 \) deviate from \( 0 \). For small p-value we reject the null hypothesis and arrive at which there are relations between predictor and target.

For multiple linear regression, the null hypothesis is

\begin{equation*} H_0: \theta_1=\theta_2=\cdots=\theta_p =0. \end{equation*}
and the alternative hypothesis is
\begin{equation*} H_a: \textrm{At least one of } \theta_k \ne 0. \end{equation*}
This hypothesis test is performed by computing F-statistic
\begin{equation*} F_{stat} = \frac{\textrm{R}_{reg}\textrm{SS}/p}{\textrm{R}_{re}\textrm{SS}/(n-p-1)} = \Big( \frac{n-p-1}{p}\Big) \frac{R^2}{1-R^2}, \end{equation*}
the ratio between normalized variances explained by regression and unexplained by regression. It obeys F-distribution, and the degrees of freedom are \( \textrm{df}_1=p \) and \( \textrm{df}_2=n-p-1 \). From the above equation, we can recognize higher \( R^2 \) and more observations (larger \( n \)) make higher \( F_{stat} \).

Then we examine what critical value of \( F_{stat} \) corresponds to significant level \( \alpha=0.05 \) and determine if it is statistical significant (if the predictors are significantally different from 0). For example, if there are 10 observations in simple regression, \( df_1=1 \) and \( df_2=8 \), the critical \( F_{stat} \) for statistical significance is 5.3. For \( \textrm{df}_1=3 \) and \( \textrm{df}_2=36 \), critical \( F_{stat}=3.3 \). For more detail, head to Understanding Analysis of Variance (ANOVA) and the F-test.

5. Feature Selection

5.1 Stepwise future selection

best subset selection: consider all combinations \( 2^p \). Forward stepwise selection: start from null model, and each time add one predictor by selecting models with lowest RMSE or highest\( R^2 \). Backward stepwise selection start from all \( p \) predictor linear regression model and each time remove one predictor. It requires \( n > p \). So at very large \( p \), only forward stepwise works. Both stepwise selection methods are not guaranteed to yield the best model containing a subset of the p predictors.

5.2 Regularization

L1 and L2 regularization

6. Alternative Aspect for LR: Maximize likelihood Estimate