Regression Analysis is a way of estimating the relationships between different variables by examining the behavior of the system. There are many techniques for modeling and analyzing the dependent and independent variables. For example, transformations can be used to reduce the higher-order terms in the model.
Remember the equation for a line that you learned in high school? Y = mx + b where m is the slope of the line and b is the point on the y axis where the line intercepts? Given the slope (m) and the y intercept (b), you can plug in any value for X and get a result y. Very straightforward and very useful. That’s what we are trying to do in root cause analysis when we say “solve for y.”
Though statistical linear models described as a classic straight line, often linear models are represented by curvilinear graphs. While non-linear regression aka Attributes Data Analysis is used to explain the nonlinear relationship between a response variable and one or more than one predictor variable (mostly curve line).
Unfortunately (or perhaps entertainingly) real life systems do not always boil down to a simple equation. Sometimes you just have a collection of points on your graph and you need to make sense of them. That’s where regression analysis comes in to play; you are basically trying to derive an equation from the graph of your data.
“In the business world, the rear view mirror is always clearer than the windshield.”
Linear Regression Analysis
The easiest kind of regression is linear regression. Imagine that all of your data lined up in a neat row. You could draw a straight line connecting all points and would be able to create a simple equation Y = mx + b that we talked about earlier. That way you would have a model that would faithfully predict what your system would do given any input of x.
But what if your data only “kinda-sorta” looks like a line?
Method of Least Squares
Method of least squares is a method to create the best-possible approximation of a line given the data set.
How well the created line fits the data can be measured by the Standard Error of Estimate. The larger the Standard Error of the Estimate, the greater the dispersion of the charted points around the line.
The normal rules of Standard Deviation apply here; 68% of the points should be within +/- 1 Standard Error of the line, 95.5% of the points within +/- 2 Standard Error.
For more examples of Least Squares, see linear regression
Coefficient of Determination (R^2 aka R Squared)
The Coefficient of Determination provides the percentage of variation in Y that is explained by the regression line.
Coefficient of Correlation is r.
-Just take the square root of the coefficient of determination. Sqrt(R Squared)
Go here for more on the correlation coefficient.
Measuring the validity of the model
Use the F statistic to find a p value of the system. The degrees of freedom for the regression is equal to the number of Xs in the equation (in linear regression, this is 1 because there is only 1 x in the equation y=mx+b). The degrees of freedom for the
The smaller the p value, the better. But really you judge this by finding the acceptable level of alpha risk and seeing if that percent is greater than the p value. For example, if your alpha risk level is 5% and the p value is 0.14, then you have to reject the hypothesis – in this case you’d reject that the line that was created is a suitable model as it was not able to create significant results.
Additional Helpful Resources
Residual Analysis: “Since a linear regression model is not always appropriate for the data, assess the appropriateness of the model by defining residuals and examining residual plots.”
- Step by Step regression analysis
- When should we use regression analysis
- Regression assumptions
- Regression output interpretation in Minitab
- Extrapolation beyond a regression model
Regression Analysis and Correlation Videos
ASQ Six Sigma Black Belt Exam Regression Analysis Questions
Question: In regression analysis, which of the following techniques can be used to reduce the higher-order terms in the model?
A) Large samples.
B) Dummy variables.