where \(\beta_{0}\) and \(\beta_{1}\) are two unknown constants that represent the intercept and slope, also known as coefficients or parameters, and \(\epsilon\) is the error term.
The least squares approach chooses \(\hat{\beta_{0}}\) and \(\hat{\beta_{1}}\) to minimize the RSS.
• Let \(\hat{y}_{i}\) = \(\hat{\beta_{0}}\) + \(\hat{\beta_{1}}\)\(x_{i}\) be the prediction for Y based on the \(i^{th}\) value of X. Then \(e_{i}\)=\(y_{i}\)-\(\hat{y}_{i}\) represents the \(i^{th}\) residual
• We define the residual sum of squares (RSS) as:
The least squares approach chooses βˆ0 and βˆ1 to minimize the RSS. The minimizing values can be shown to be:
Estimation of the parameters by least squares
Advertising example
The least squares fit for the regression of sales onto TV. In this case a linear fit captures the essence of the relationship.
Assessment of regression model
Standard errors
Using se and estimated coefficients, statistics such as t can be calculated for statistical significance:
\(t\)=\(\beta\)/\(se\)
Assessment of regression model
Standard errors can be used to compute confidence intervals. A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter. It has the form:
Confidence interval are important for both frequentists and Bayesians to determine a predictor is "significant" or important in predicting y
Assessment of regression model
Assessment of regression model
Assessment of regression model
Model fit:
Residual Standard Error (RSS)
Multiple Linear regression
Multiple regression extends the model to multiple \(X\)'s or \(X_{p}\) where \(p\) is the number of predictors.
We interpret \(\beta_{j}\) as the average effect on Y of a one unit increase in \(X_{j}\), holding all other predictors fixed. In the advertising example, the model becomes:
Each coefficient is estimated and tested separately, provided that there is no correlations among \(X\)'s
Interpretations such as “a unit change in \(X_{j}\) is associated with a \(\beta_{j}\) change in \(Y\) , while all the other variables stay fixed”, are possible.
Correlations amongst predictors cause problems:
The variance of all coefficients tends to increase, sometimes dramatically
Interpretations become hazardous — when \(X_{j}\)changes, everything else changes.
Multiple Linear regression
Problem of Multicollinearity
Claims of causality should be avoided for observational data.
Multiple Linear regression
Hyperplane: multiple (linear) regression
Multiple (linear) regression
Advertising example
With current computation power, we can test all subsets or combinations of the predictors in the regression models.
It will be costly, since there are \(2^{p}\) of them; it could be hitting millions or billions of models!
New technology or packages have been developed to perform such replicative modeling.
A Stata package mrobust by Young & Kroeger (2013)
What are important?
Begin with the null model — a model that contains an intercept but no predictors.
Fit \(p\) simple linear regressions and add to the null model the variable that results in the lowest RSS.
Add to that model the variable that results in the lowest RSS amongst all two-variable models.
Continue until some stopping rule is satisfied, for example when all remaining variables have a p-value above some threshold.
Selection methods: Forward selection
Start with all variables in the model.
Remove the variable with the largest p-value, that is, the variable that is the least statistically significant.
The new (\(p\) − 1)-variable model is fit, and the variable with the largest p-value is removed.
Continue until a stopping rule is reached. For instance, we may stop when all remaining variables have a significant p-value defined by some significance threshold.
Selection methods: backward selection
Mallow’s \(C_{p}\)
Akaike information criterion (AIC)
Bayesian information criterion (BIC)
Cross-validation (CV).
New selection methods
Categorical variables
Dichotomy (1,0)
class (1,2,3)
Leave one category as base category for comparison:
e.g. race, party
Qualitative variables
Treatment of two highly correlated variables
Multiplicative term--> Interaction term
Useful for multicollinearity problem
Hard to interpret and could lead to other problems
Hierarchy principle:
If we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant.
Interaction term/variables
Nonlinear effects
Nonlinear effects
Linear regression is a good starting point for predictive analytics