Multiple linear regression works the same as simple linear regression, except for the introduction of more independent variables and their corresponding coefficients.
When there are more independent variables, the assumption is there are multiple factors affecting the outcome of the dependent variable. A predictive regressor model can be more accurate if multiple independent variables are known.
Following are some examples:
If the dependent variable is profit, then independent variables could be R&D spending and marketing expenditures.
If the dependent variable is grade, then independent variables could be time of study, and hours of sleep.
When to Make Dummy Variables for Multiple Linear Regression
Consider the example where profit is the dependent variable. The challenge is find the correlation of how the independent variables affect profit. In the image above, in the independent variables have a blue background in the header. The first three independent variables are expenditures. Thus, it’s easy to associate each expenditure to a variable. But for the state, it’s not that simple. This is called a categorical variable. The approach to use in this instance, is to create dummy variables for the categorical variable.
As shown in the above image, you need to create separate columns for each category. You populate each row of your dummy variable column with a 1, if that row in the state column matches the heading for the dummy variable column. Otherwise, you populate that row with a zero. And you only need to include on of the dummy variables in your equation. In the image above, we know that if D1 is a 1, then the company is in New York. If it’s zero, then the company is in California. We do not lose any information by including only one dummy variable in the equation. This approach may seem bias, because you lose the b4 coefficient when the state is California. But that is not the case. The regression model works by altering the equation for California by changing the b0 coefficient.
Avoid The Dummy Variable Trap
The problem with including both dummy variables is it amounts to a duplication of variables. This phenomenon, when one (or more) variable predicts another is called multicollinearity. The result is that the regression model will not work as it should because it is not able to distinguish the effects of the one dummy variable from the other. This problem is referred to as the dummy variable trap. The key takeaway is, when building a multiple linear regression model with dummy variables, you should always omit one dummy variable from the equation. This rule applies irrespective of the number of dummy variables.
Finally, there are different ways to build a multiple linear regression model. The common methods are backwards elimination, forward selection, and stepwise regression.