In practice, models typically use multiple variables where it is possible to isolate the unique contribution of each explanatory variable.ย A ๐-variate regression model enables the coefficients to measure the distinct contribution of each explanatory variable to the variation in the dependent variable.
Multiple regression is regression analysis with more than one independent variable. The general form of a multiple regression can be written as
where:
Yi = ith observation of dependent variable Y
Xki = ith observation of kth independent variable X
ฮฑ = intercept term
ฮฒk = slope coefficient of kth independent variable
ฯตi = error term of ith observation
n = number of observations
k = total number of independent variables
Additional Assumptions of Multiple Regression
Extending the model to multiple regressors requires one additional assumption, along with some modifications, to the six assumptions of linear regression with single regressors.
The additional assumption is โ
Multiple linear regression assumes that the explanatory variables are not perfectly linearly dependent (i.e., each explanatory variable must have some variation that cannot be perfectly explained by the other variables in the model).
If this assumption is violated, then the variables are perfectly collinear.
The remaining assumptions require simple modifications to account for ๐-explanatory variables. These become โ
All variables must have positive variances so that
The error is assumed to have mean zero conditional on the explanatory variables
The random variables
The probability of large outliers in each explanatory variable should be small so that
The constant variance assumption is similarly extended to hold for all explanatory variables
The error terms should be uncorrelated across all observations, i.e.
Interpretation of Coefficients
Slope coefficient (ฮฒk) โ Itโs the change in the dependent variable from a unit change in the corresponding independent (Xki) variable keeping all other independent variables constant.
When all explanatory variables are distinct (i.e., no variable is an exact function of the others), then the coefficients are interpreted as holding all other values fixed. For example, ฮฒ_1 is the effect of a small increase in X1 holding all other variables constant.
When the value of the independent variable changes by one unit, the change in the dependent variable is not equal to the slope coefficient but depends on the correlation among the independent variables as well.
Therefore, the slope coefficient are called partial slope coefficients.
Intercept coefficient (ฮฑ) โ The intercept (or the constant) is the expected value of the dependent variableย Y when all the independent variables Xks are equal to 0.
Interpretation of coefficients โ Indistinct Variables
If some explanatory variables are functions of the same random variable (e.g., if X2=X12) then:
In this case,
it is not possible to change X1 while holding the other variables constant.
The interpretation of the coefficients in models with this structure depends on the value of X1 because a small change of ฮX1ย in X1 changes ๐ by
This effect captures theย direct, linear effect of a change in X1 through ฮฒ1 and its nonlinear effect through ฮฒ2.
Ols Estimators for Multiple Regression Parameters
Estimating the multiple regression parameters can be quite demanding since it involves a lot of calculations. A basic understanding can be developed using the multiple regression model with two independent variables, which can be extended for than 2 independent variables.
Suppose the two variable model is
The OLS estimator for ฮฒ1 can be computed using three single-variable regressions.
The first regresses X1 on X2 and retains the residuals from this regression.
The second regression does the same for ๐
The final step regresses the residual of ๐ on the residual of X1
The first two regressions have a single purpose โ to remove the direct effect of X2 from ๐ and X1. They do this by decomposing each variable into two components: one that is perfectly correlated with X2(i.e., the fitted value) and one that is uncorrelated with X2 (i.e., the residual). As a result, the two residuals are uncorrelated with X2 by construction.
The final regression estimates the linear relationship (i.e., ฮฒ1) between the components of ๐ and X_1 that are uncorrelated with (and so cannot be explained by) X_2.
Finally, the OLS estimate of ฮฒ2 can be computed in the same manner by reversing the roles of X1 and X2 (i.e., so that ฮฒ_2 measures the effect of the component in X2 that is uncorrelated with X1).
This stepwise estimation can be used to estimate models with any number of regressors. In the ๐-variable model:
the OLS estimate of ฮฒ1 is computed by first regressing each of X1 and Y on a constant and the remaining k-1 explanatory variables. The residuals from these two regressions are mean zero and uncorrelated with the remaining k-1 explanatory variables. The OLS estimator of ฮฒ1 is then estimated by regressing the residuals of ๐ on the residuals of X1.
Measuring Model Fit
The total variation in the dependent variable is called the total sum of squares (TSS), which is defined as the sum of the squared deviations of Yi around the sample mean Yย ฬ ย :
Each dependent variable is decomposed into two components: the fitted value (Yย ฬ_i ) and the estimated residual (ฯตย ฬ_i), so that:
Minimizing the squared residuals decomposes the total variation of the dependent data into two distinct components:
RSS โ one that captures the unexplained variation (due to the error in the model), and
ESS โ another that measures explained variation (which depends on both the estimated parameters and the variation in the explanatory variables).
The residual sum of squares (RSS/SSR) is the sum of squared deviations of the actual (or observed) values of Yi, from the predicted value of Yi (i.e. Yย ฬi).
Hence RSS is simply the sum of the squares of the error terms, i.e.
The explained sum of squares (ESS) is the sum of squared deviations of the predicted values of Yi (i.e. (Yi )ย ฬย ), from the sample mean the Yiโs (i.e. Yย ฬ ).
It is important to note that
TSS = ESS + RSS
Standard Error of Regression
SER measures the degree of variability of the actual Y-values Yi, with respect to the estimated Y values (i.e.. (Yi )ย ฬ). It is a measure of the spread (or standard deviation) of the observations around the regression line.ย
The SER conveys the โfitโ of the regression line, and the fit is better if SER is smaller.
Coefficient of Determination, R2
R^2 is the proportion of the variance in the dependent variable that is explained by the (variation in) the independent variables. It is calculated as the ratio of the explained sum of squares to the total sum of squares
Because OLS estimates parameters by finding the values that minimize the ๐ ๐๐, the OLS estimator also maximizes R2.
In case of linear regression with a single regressor, R2ย is defined as the squared correlation between the dependent variable and the explanatory variable in a model with a single explanatory variable.
When a model has multiple explanatory variables, R2 is a complicated function of the correlations among the explanatory variables and those between the explanatory variables and the dependent variable.
However, R2 in a model with multiple regressors is the squared correlation between Yi and the fitted value Yย ฬi,
This interpretation of R2 provides another interpretation of the OLS estimator: The regression coefficients ฮฒย ฬ1, โฆ ,ฮฒย ฬk are chosen to produce the linear combination of X1, โฆ , Xk that maximizes the correlation with ๐.
A model that is completely incapable of explaining the observed data has an R2 of 0 (because all variation is in the residuals). A model that perfectly explains the data (so that all residuals are 0) has an R2 of 1. All other models must produce values that fall between these two bounds so that R2 is never negative and always less than 1.
Limitation of R2
While R^2 is a useful method to assess model fit, it has three important limitations.
Adding a new variable to the model always increases the R2, even if the new variable has an insignificant effect on the dependent variable. For example, if a regression model with one explanatory variable is modified to have two explanatory variables, the new R2 is greater or equal to that of the original model which contained a single explanatory variable. i.e., if the original model is
and the expanded model is
then the R2 of the expanded model must be greater than or equal to the R2 of the original model. This is because the expanded model always has the same TSS and nearly always has a smaller ๐ ๐๐, resulting in a higher R2. The only situation where adding a variable does not increase R2 is if ฮฒ2=0. In that case, the RSS remains the same (as does the R2).
The coefficient of determination R2 cannot be compared across models with different dependent variables. For example, when Yi is always positive, it is not possible to compare the R2 of a model in levels (Y) and logs (lnโกYiย ).
It is also not possible to compare theย R2 for two models that are logically equivalent (in the sense that both the fit of the model as measured by ๐ ๐๐ and predictions from the models are identical). This can occur when the dependent variable is transformed by adding or subtracting one or more of the explanatory variables.
There is no general value which can be considered as a โgoodโ value for R2. Whether a model provides a good description of the data depends on the nature of the data. For example โ
An R2 of 5% would be implausibly high for a model that predicts the one-day ahead return on a liquid equity index futures contract using the current value of explanatory variables.
On the other hand, an R2 less than 70% would be quite low for a model for predicting the returns on a well-diversified large-cap portfolio using the contemporaneous return on the equity market (i.e., CAPM).
Adjusted R2
As discussed, R^2ย mostly increases with the increase in the number of independent variables, even if those new independent variables may not contribute in explaining the variation in the dependent variable. Hence, a high value of R^2 might be falsely indicative of a high collective explanatory power of the independent variables, but in reality, it might just reflect the impact of a large set of independent variables.
This limitation is addressed (in a limited way) through another measure known as Adjusted R^2 (written as Rย ฬ 2ย or Ra2) which adjusts the R2 for the degrees of freedom (or number of independent variables). It is defined as
where
๐ is the number of observations in the sample, and
๐ is the number of explanatory variables included in the model (not including the constant term ๐ผ).
Adjusted R^2 can also be expressed as:
where the adjustment factor
Note that ๐ must be greater than 1 because the denominator is less than the numerator.
Including additional explanatory variables (i.e., increasing ๐) always increases ๐. The adjusted R^2 captures the tradeoff between increasing ๐ and decreasing ๐ ๐๐ as models become larger.ย If a model with additional explanatory variables produces a negligible decrease in the ๐ ๐๐ when compared to a base model, then the loss of a degree of freedom produces a smaller Rย ฬ ^2.
The adjustment to the R^2 may produce negative values if a model produces an exceptionally poor fit. In most financial data applications, ๐ is relatively large and so the loss of a degree of freedom has little effect on Rย ฬ ^2. In large samples, the adjustment term ๐ is very small and so Rย ฬ ^2tends to increase even when an additional variable has little explanatory power.
Testing parameters in regression Models
Testing a hypothesis about a single coefficient in a model with multiple regressors is identical to testing in a model with a single explanatory variable. Tests of the null hypothesis
are implemented using a ๐ก-test with sample test statistic as
where
(s.e.)ย ฬ(ฮฒย ฬj ) is the estimated standard error of ฮฒย ฬjย .
However, the ๐ก-test is not directly applicable when testing complex hypotheses that involve more than one parameter, because the parameter estimators can be correlated. This correlation complicates extending the univariate ๐ก-test to tests of multiple parameters.
Instead, the more common choice is an alternative called the ๐น-test. This type of test compares the fit of the model (measured using the ๐ ๐๐) when the null hypothesis is trueย relative to the fit of the model without the restriction on the parameters assumed by the null.
The F-Test-Joint hypothesis Testing
Implementing an ๐น-test requires estimating two models. The first is the full model that is to be tested. This model is called the unrestricted model and has an ๐ ๐๐ denoted by RSSU. The second model, called the restricted model, imposes the null hypothesis on the unrestricted model and its ๐ ๐๐ is denoted RSSR. The ๐น-test compares the fit of these two models:
where
๐ is the number of restrictions imposed on the unrestricted model to produce the restricted model,
kU is the number of explanatory variables in the unrestricted model, and
๐น-test has an Fq,n-kU-1 distribution.
๐น-tests can be equivalently expressed in terms of the R2 from the restricted and unrestricted models. Using this alternative parameterization:
If the restriction imposed by the null hypothesis does not meaningfully alter the fit of the model, then the two ๐ ๐๐ measures are similar, and the test statistic is small.
On the other hand, if the unrestricted model fits the data significantly better than the restricted model, then the ๐ ๐๐ from the two models should differ by a large amount so that the value of the ๐น-test statistic is large.
A large test statistic indicates that the unrestricted model provides a superior fit and so the null hypothesis is rejected.
Implementing an ๐น-test requires imposing the null hypothesis on the model and then estimating the restricted model using OLS. For example, consider a test of whether CAPM, which only includes the market return as a factor, provides as good a fit as a multi-factor model that additionally includes the size and value factors.
The unrestricted model includes all three explanatory variables, so that:
where
๐ indicates the market (i.e., so that Rmรiย is the return to the market factor above the risk-free-rate),
๐ indicates size, and
๐ฃ indicates value.
The null hypothesis is then:
The alternative hypothesis is that at least one of parameters is not equal to zero:
so that the null should be rejected if at least one of the coefficients is different from zero.
In this hypothesis test, two coefficients are restricted to specific values and so q=2. The ๐น-test is then computed by estimating both regressions, storing the two ๐ ๐๐ values, and then computing:
Finally, if the test statistic ๐น is larger than the critical value of an F2,n-4 distribution using a size of ๐ผ (e.g., 5%), then the null is rejected. If the test statistic is smaller than the critical value, then the null hypothesis is not rejected, and it is concluded that CAPM appears to be adequate in explaining the returns to the portfolio.
Imposing the null hypothesis requires replacing the parameters with their assumed value if the null is true. Imposing the null hypothesis on the unrestricted model produces the restricted model:
which is the CAPM.
Multivariate Confidence Intervals
The method for constructing a confidence interval for single coefficients in the multiple regression model is also the same as in the single-regressor model.
The confidence interval for ฮฒ_j can be constructed as