Data distribution is a function that specifies all possible values for a variable and also quantifies the relative frequency (probability of how often they occur). Distributions are considered any population that has a scattering of data. It’s important to determine the kind of distribution that population has so we can apply the correct statistical methods when analyzing it.
Data distributions are widely used in statistics. Suppose an engineer collects 500 data points in a shop floor it does not give any value to the management unless he/she categorize or organize the data in a useful way. Data distribution method organizes the raw data into graphical methods (like histograms, box plots, run charts etc) and provides the useful information.
The basic advantage of data distribution is to estimate the probability of any specific observation in a sample space. Probability distribution is a mathematical model that calculates the probability of occurrence of different possible outcomes in a test or experiment. Used to define different types of random variables (Typically discreet or continuous) to make the decision depends on these models. Based on random variable category one can use mean, mode, range, probability or other statistical methods.
Types of Distribution
Distributions are basically classified based on the type of data (Typically discreet or continuous)
A discrete distribution resulting from countable data that has finite number of possible values. Furthermore discrete distributions can be reported in tables and the respective values of the random variable are countable. Ex: rolling dice, choosing a number of heads etc.
Probability mass function (pmf): Probability mass function is a frequency function which gives the probability for discrete random variables, also known as discrete density function.
Simply Discrete= counted
Different types of discreet distributions are
- The binomial distribution measures the probability of the number of successes or failure outcome in an experiment in each try.
- Characteristics that are classified into two mutually exclusive and exhaustive classes, such as number of success/failures, number accepted/rejected follow binomial distribution.
- Ex: Tossing a coin: Probability of coin landing Head is ½ and probability of coin landing tail is ½
- The Poisson distribution is the discrete probability distribution that measures the likelihood of a number of events occurring in a given time period, when the events occur one after the another in time in a well-defined manner.
- Characteristics that can theoretically take large values, but actually take small values have Poisson distribution.
- Ex : Number of defects, errors , accidents, absentees etc.
- Hypergeometric distribution is a discrete distribution that measures the probability of a specified number of successes in (n) trials, without replacement, from a relatively large population (N). In other words, sampling without replacement.
- The hypergeometric distribution is similar to binomial distribution;
- The basic difference of binomial distribution is that probability of success to be the same for all trails while it is not same case for hypergeometric distribution.
- Geometric distribution is a discrete distribution that measures the likelihood of when the first success will occur.
- An extension of it may be consider as negative binomial distribution.
- Ex: Marketing representative from advertising agency randomly selects the hockey players from various universities until he finds a hockey player attended the Olympics.
A continuous distribution containing infinite (variable) data points that may be displayed on a continuous measurement scale. A continuous random variable is a random variable with a set of possible values that is infinite and uncountable. It measures something rather than just count and typically described by probability density functions (pdf).
Probability density function (pdf): The probability density function describe the behavior of a random variable. It is normally grouped frequency distribution. Hence probability density function see it as ‘shape’ of the distribution.
Simply Continuous = can take many different values
Different types of continuous distributions are
- Normal distribution is also known as Gaussian distribution. It is a symmetrical bell shape curve with higher frequency (probability density) around the central value. The frequency sharply decreasing as values are away from the central value on either side.
- In other words characteristics whose dimensions are expect on either side of the aimed at value with equal probability, follow normal distribution.
- Mean, Median and Mode are equal for normal distribution.
- A continuous random variable x follows a lognormal distribution if its natural logarithm, ln(x) follows a normal distribution.
- When you sum the random variables, as the sample size increases, the distribution of the sum becomes a normal distribution, regardless of the distribution of the individuals. Same scenario for multiplication.
- The location parameter is the mean of the data set after transformation by taking the logarithm, and also the scale parameter is the standard deviation of the data set after transformation.
- The F distribution extensively use to test for equality of variances from two normal populations
- The F distribution is an asymmetric distribution that has a minimum value 0, but no maximum value.
- Notably the curve approaches zero but never quite touches the horizontal axis.
- The chi square distribution results when independent variables with standard normal distribution are squared and summed.
- Ex: if Z is standard normal random variable then
- y =Z12+ Z22 +Z32 +Z42+…..+ Zn2
- The chi square distribution is symmetrical, bounded below by zero. And approaches the normal distribution in shape as the degrees of freedom increases.
- The exponential distribution is the probability distribution and of the widely used continuous distributions. Often used to model items with a constant failure rate.
- The exponential distribution is closely related to the Poisson distribution.
- Has a constant failure rate as it will always have the same shape parameters.
- Ex: The lifetime of a bulb, the time between fires in a city
- t distribution or student’s t distribution is a bell shape probability distribution, symmetrical about its mean.
- Commonly used for hypothesis testing and constructing confidence intervals for means.
- Used in place of the normal distribution when the standard deviation is unknown.
- Like the normal distribution, when random variables are averages, the distribution of the average tends to be normal, regardless of the distribution of the individuals.
- The basic purpose of Weibull distribution is to model time-to-failure data.
- Widely used in reliability, medical research and statistical applications.
- Assumes many shapes depending upon the shape, scale, and location parameters. Effect of Shape parameter β on Weibull distribution:
- For instance, if shape parameter β is 1, it becomes identical to exponential distribution.
- If β is 2, then Rayleigh distribution.
- and If β between 3 and 4, then Normal distribution.
Generally an assumption is that while performing a hypothesis test that the data is a sample from a certain distribution commonly normal distribution, but always that is not the case that data may not follow normal distribution. Hence nonparametric tests used when there is no assumption of a specific distribution for the population.
Particularly nonparametric test results are more robust against violation of the assumptions. Different types of nonparametric test are Sign test, Mood’s Median Test (for two samples) , Mann-Whitney Test for Independent Samples, Wilcoxon Signed-Rank Test for a Single Sample , Wilcoxon Signed-Rank Test for Paired Samples
- The continuous distribution (like normal, chi square, exponential) and discrete distribution (like binomial, geometric) are the probability distribution of one random variable
- Whereas bivariate distribution is a probability of a certain event occur in case two independent random variables exists it may be continuous or discrete distribution.
- Bivariate distribution is unique because it is the joint distribution of two variables.
- A bi-modal distribution which has two modes, in other words two outcomes that are most likely compare the outcomes of their region.
- 2 sources of data coming into a single process screen.
How to Evaluate a Data Distribution
The shape of data distribution is depicted by its number of peaks and symmetry possession, skewness, or uniformity. Skewness is a measure of the lack of symmetry. In other words, skewness is the measure of how much the probability distribution of a random variable deviates from the Normal Distribution.
The skewed distribution is either left-side (also known as a negatively skewed distribution) or right-side (known as a positively skewed distribution). Both sides of the mean do NOT match.
Use Graphical Analysis to show how the points are distributed and the data arranged throughout the set. These distributions graphically illustrate the spread (dispersion, variability, or scatter) of the data.
Generally, symmetrical distribution appears as a bell curve. The perfect normal distribution is the probability distribution that has zero skewness. However, it is always impossible to have a perfect normal distribution in the real world, so the skewness is not equal to zero; it is almost zero. Symmetrical distribution occurs when mean, median, and mode occur at the same point, and the values of variables occur at regular frequencies. Both sides of the mean match & mirror each other.
Example: The weights of high school students are reported between 80lb to 100lbs, while the majority of students weights are around 90lbs. The weights are equally distributed on both sides of 90lb, which is the center value. This type of distribution is called a Normal Distribution.
Similarly, symmetric distribution also can be reviewed with boxplot
The above graph is a boxplot of symmetric distribution. The distance between Q1 & Q2 and also Q2 & Q3 is equal.
Q3-Q2 = Q2-Q1
Though the distance between Q1 & Q2 and Q2 & Q3 is equal, that is not sufficient to conclude that the data follows a symmetric distribution. The team also has to look at the length of the whisker, if the distance of the whisker is equal, then we can conclude the distribution is symmetric.
Examples of Symmetric data distributions :Normal Distribution, Uniform
Positively Skewed Distribution
A distribution is said to be skewed to the right if it has a long tail that trails toward the right side. The skewness value of a positively skewed distribution is greater than zero.
Example: Income details of the manufacturing employees in Chicago indicates that the majority of people earn somewhere between $20K to $50K per annum. Very few earn less than $10K, and very few earn $100K. The center value is $50K. It is very clear from the graph a long tail is on the right side of the center value.
As the tail is on the positive side of the center value, the distribution is positively skewed. Unlike symmetric distribution, it is not equally distributed on both sides of the center value. From the graph, it is clearly understood that the mean value is the highest one, followed by median and mode.
Since the skewness of the distribution is towards the right, the mean is greater than the median and ultimately move towards the right. Also, the mode of the values occurs at the highest frequency, which is on the left side of the median. Hence, mode < median < mean.
From the above boxplot, Q2 is present very near to Q1. So, it is a positive skew distribution.
So, it clearly indicates that the data is skewness towards positive. But, for instance, if the graph is like below, then:
From the above picture, Q2-Q1 and Q3-Q2 are equal. But the distribution is still positive skew distribution because the length of the right whisker is much greater than the left whisker. So, we can conclude that the data is positively skewed.
In summary, always check the equality of Q2-Q1 and Q3-Q2. If it is equal, then check the length of whiskers to conclude the data distribution.
Negatively Skewed Distribution
A distribution is said to be skewed to the left if it has a long tail that trails toward the left side. The skewness value of a negatively skewed distribution is less than zero.
Example: A professor collected students’ marks in a science subject. The majority of students score between 50 and 80 while the center value is 50 marks. The long tail is on the left side of the center value because it is skewed the left-hand side of the center value. So the data is negative skew distribution.
Here mean < median < mode
From the above boxplot, Q2 is present very near to Q3. So, it is a negative skew distribution.
Similarly, like above, Q2-Q1 and Q3-Q2 are equal. But the distribution is still negatively skewed because the length of the left whisker is much greater than the right whisker. So, we can conclude that the data is negatively skewed.
It is good to transform the skewed data to normally distributed data. Data can transform using methods like Power transformation, Log transformation, Exponential transformation.
Statistical Tests Used to Identify Data distribution
There are different methods to test the normality of data, including visual or graphical method and Quantifiable or numerical methods.
Visual method: Visual inspection approach may be used to assess the data distribution normality, although this method is unpredictable and does not guarantee that the data distribution is normal. However, visual method somewhat help user to judge the data normality.
Ex: Histogram), boxplot, stem-and-leaf plot, probability-probability plot, and quantile-quantile plot.
Quantifiable method: Quantifiable methods are supplementary to the visual methods. Particularly these tests compare the scores in the sample to a normally distributed set of scores with the same mean and standard deviation.
Ex: Anderson-Darling Test, Shapiro-Wilk W Test, Kolmogorov-Smirnov Test etc.,
Other Data Distribution Notes:
Don’t like or can’t use the existing distribution? So, use a data transformation to turn the data set into something more easily analyzed.
How to Make a Process Follow a Normal Distribution by Using Transforms
Sometimes you will be analyzing a process and the data will come out in a non-normal shape. Since, normal distributions have wonderful mathematical properties that make analysis and control so much easier, try to transform the data to a normal distribution if possible.
The approach to address the non-normal distribution is to make transformation to “normalize” the data. Some typical data transformation methods are Box Cox, Log transformation, Square root or power transformation, Exponential and Reciprocal etc.,
- A Box Cox transformation is a useful power transformation technique to transform non-normal dependent variables into a normal shape.
- George Box and Sir D.R.Cox. are the authors for this method
- The applicable formula is yl =yλ (λ is the power or parameter the to be transform the data).
- For instance, λ=2, the data is squared and if λ=0.5 a square root is required.
- Z transformation is an analysis tool in signal processing
- It is a generalization of the Discrete-Time Fourier Transform (DTFT), in particular it applies to signals for which DFTF doesn’t exists thus allowing to analyze those signals
- It also helps to see the new ideas in the sense of a system with respect to stability and causality
- Z transform is the discrete time counterpart to the Lapse transform