## What is Multiple Linear Regression?

Multiple linear regression is an extension to methodology of simple linear regression. Simple linear regression is to study the two variables in which one variable is independent variable (X) and the other one is dependent variable (Y). In other words predict the change in dependent variable according to change in independent variable

## When to Use Multiple Linear Regression

Multiple linear regression is to study more than two variables. In fact the basic difference between simple and multiple regression is in terms of explanatory variables. In multiple regression unlike simple linear regression there are more than one independent variable (X), these independent variables used to predict a single dependent variable(Y). Predict the change in dependent variable (Y) according to change in independent variables.

Example: The house price (Dependent variable Y) depends on the various Independent variables (X) like locality, number of bed rooms, number of bathrooms, age of the house and also square foot of the house.

## Notes about Multiple Linear Regression

Y is the linear transformation of the X variables and subjected to the condition that the sum of squared deviations of the observed and predicted Y is minimized, in other words the sum of squared errors is minimized

Residual also called error is the difference between the actual observed values of dependent variable Y and the predicted values that we get as a linear transformation of the X variables.

The coefficient of determination is R^{2}. It is the proportion of the explained variation divided by the total variation. When numbers of predictors are adding to the model then R^{2} will also increases, despite the fact that predictors have no relation with output variable.

Likewise r^{2} (the linear coefficient of determination) R^{2} (the multiple coefficient of determination) take values in the interval:

0≤ R^{2} ≤1

If the value of R^{2} is 0 then outcome cannot be predicated, where as if R^{2} is 1 outcome can be predicated and it is error free from the independent variables (X), but same it does not mean a great model

The computation in case of multiple regression is complex due to the number of explanatory variables in the model. However because of interrelationship among the variables the interpretation also changes accordingly

## Assumptions of Multiple Linear Regression

- Independent Residuals
- No Multicollinearity – Not too high correlation between the independent variables
- Residuals must be normally distributed
- Furthermore relationship between each predictor variable and the outcome variable is linear

## Formula to calculate Multiple Linear Regression

### A first order linear model

The formula for two independent variables the prediction of Y is

Y= β_{0}+β_{1}X_{1}+β_{2}X_{2} +…….. β_{k}X_{k} + ε

Where

- Y is dependent variable
- X is independent variable
- β
_{0}is Y intercept - ε is residual also called error
- β
_{k }slope coefficient for each independent variable

β can also be compute in a such a way that minimizes the sum of squared errors

Where k is the number of predictor variables

And estimated regression line shall be y = b̂_{0}+b̂_{1}X_{1}+b̂_{2}X_{2}

Formulas to calculate estimates of parameters betas’

b̂_{0} =
Y̅-b̂_{1}X̅_{1}– b̂_{2}X̅_{2}

### A Second –Order Linear Model (Two Predictor Variables)

Y= β_{0}+β_{1}X_{1}+β_{2}X_{2}+ β_{3} X_{1}X_{2}+ β_{4} X_{1}^{2}++ β_{5} X_{2}^{2}+ε

## Example of Multiple Linear Regression in DMAIC

Multiple Linear Regression will be used in Analyze phase of DMAIC to study more than two variables. In a laboratory chemist recorded the yield of the process which will be impacted by the two factors. Chemist wants to model the first order regression.

- Y̅ =354/8=44.25
- p̅=61/8=7.625
- q̅=38/8=4.75

b̂_{0} = Y̅-b̂_{1}p̅- b̂_{2}q̅_{ }=31.37

The estimated regression line would be

y = 31.37+0.75p+1.5q

## Multiple Linear Regression Videos

## Additional Resources

Multiple Regression presentation

## Six Sigma Green Belt Multiple Linear Regression Questions

**Question:** A ____________________ is used to create a model of the affect on an output by the variation in two or more of the inputs.

(A) Correlation Coefficient

(B) Anova

(C) Multiple Regression

(D) X-Y Diagram

## Comments (4)

Excellent presentation

Glad it is helpful, Fiaz.

This is great walkthrough.. thank you so much.

You’re very welcome, Modammad. Glad it helps!