Linear regression is a widely used statistical technique to estimate the mathematical relationship between a dependent variable (usually denoted as Y) and an independent variable (usually denoted as X), In other words predict the change in dependent variable according to change in independent variable.
- Dependent variable or Criterion variable – is the variable for which we wish to make a predictions
- Independent variable or Predictor variable – The variable used to explain the dependent variable
When to Use Linear Regression
In simple linear regression there is only one independent variable used to predict a single dependent variable, whereas in multiple linear regression there are more than one independent variable, these independent variables used to predict a single dependent variable. In fact the basic difference between simple and multiple regression is in terms of explanatory variables.
For example compare the crop yield rate against the rain fall rate in a season.
Notes about Linear Regression
The first step of linear regression is to test the linearity assumption, this can be performed by plot the values in a graph known as scatter plot, to observe the relation between dependent and independent variable, because if the data is exponentially scattered then there is no meaning to create the regression equation.
Draw the line which covers the majority of the points, further this line considered as the “best fit” line
The mathematical equation of the line is y=a+bx+ε
- b – Slope of the line
- a – y intercept when x=0
- Random error (ε-Epsilon) – The difference between an observed value of y and the mean value of y for a given value of x.
Assumption of Linear regression
- Linear relationship between dependent and independent variable
- All variables of regression to be multivariate normal
- Particularly there is no or little multicollinearity in the data
- Response variable is continuous and also residuals are almost same throughout the regression line
The method of Least Squares
The method of least squares is a standard approach in regression analysis to determine the best fit line for a given data, It basically provides a visual relationship between the given data points.
In general, the dependent variables are demonstrated on y-axis, while the independent variables are demonstrated on x-axis. The least square method determines the position of a straight line or also called trend line and the equation of the line. This straight line is also known as best fit line.
The least square method means that the overall solution minimizes the sum of squares of the errors made in the results of every single equation. For instance, Least Squares Equation can be used to find the values of the coefficients a and b
The least square estimator of a and b are compute as follows:
Compute â and b̂ values and then substitute these values into the equation of a line to obtain the least squares prediction equation or regression line
Linear Regression example in DMAIC
Example: Linear Regression is specifically uses in Analyze phase of DMAIC to estimate the mathematical relationship between a dependent variable and an independent variable.
A passenger vehicle manufacturer reviewing the 10 sales persons training records, In fact their main aim is to compare the sales persons achieved target (in %) with the number of sales module trainings’ completed out of 20 modules
Compute the least square prediction equation or regression line
Furthermore predict y for a given value of x by substitution into the prediction equation. For example, If a sales person completes 15 training modules, the predicted achieved target sales would be:
ŷ = 31.09+3.5742(15)= 84.7019=84.7%
Estimate the variability of random errors
Referring the mathematical equation of the line is y=a+bx+ε and also the least square line is
A random error (Є) affects the error of prediction. Hence the variability of the random errors (σε2) is the key parameter while predicting by the least squares line.
Estimate variability of the random error σε2
Example: From the above data, compute the variability of the random errors
From the above calculation σ̂Є is 5.38. Thus, most of the points will fall within ±1.96 σ̂Є i.e 10.54 of the line, hence approx 95% of the values should be in this region. Moreover from the above graph, it is clearly evident that all the values are within ±10.54 of the line.
Test of slope coefficient
The existence of a significant relationship between dependent and independent variable can be tested by whether b is equal to 0. If b is not equal to 0 there is a linear relationship. The null and alternative hypotheses are
- The null hypothesis H0 : b=0
- The alternative hypothesis H1: b≠0
Degrees of freedom = n-2
Example: From the above data determine if the slope results are significant at a 95% confidence level
Determine the critical values of t for 8 degrees of freedom at 95% confidence level
t0.025, 8 = -2.306 and 2.306
The calculated t value is 5.481, which is not in between -2.306 and 2.306, we can reject the null hypothesis if t value is greater than 2.306 or less than -2.306
In this case, we can reject the null hypothesis and concluded that b≠0 and there is a linear relationship between dependent and independent variable
Confidence interval estimate for the slop b
The confidence interval estimate for the slope b is
Example: from the above data, compute the confidence interval around the slope of the line
The linear correlation coefficient r measures the strength of the linear relationship between the paired x and y values in a sample.
Pearson’s Correlation Coefficient
Example: from the above data, find the correlation coefficient
Note that -1≤ r ≤ +1
- The line slopes upward to the right when r indicates positive value
- The line slopes downward to the right when r indicates negative value
- A value closer to 1, indicates the stronger positive linear relationship
- A value closer to -1, indicates the stronger negative linear relationship
- When r=0 implies no linear correlation
The Correlation Coefficient is often used in comparing bivariate data. Ex job satisfaction stratified by income.
The correlation coefficient varies between -1 and +1. Values approaching -1 or +1 indicate strong correlation (negative or positive) and values close to 0 indicate little or no correlation between x and y.
Correlation does not mean causation.
A positive correlation can be either good news or bad news
A negative correlation is not necessarily bad news. It merely means that as the independent variable goes more negative, the dependent variable goes negative as well.
r = 0; does not indicate the absence of a relationship, a curvilinear pattern may exist; r=-0.76 has the same predictive power as r = +0.76
Correlation Coefficient Videos
Coefficient of determination (R2)
The coefficient of determination is the proportion of the explained variation divided by the total variation, when a linear regression is performed.
r2 lines in the interval of 0≤ r2 ≤1.
Example: from the above data, compute the coefficient of determination
We can say that 79% of the variation in sales target achieved can be explained by variation in number of training modules completed.
Linear Regression Related Topics
Residual Analysis: “Because a linear regression model is not always appropriate for the data, you should assess the appropriateness of the model by defining residuals and examining residual plots.”