Data distribution is a function that specifies all possible values for a variable and also quantifies the relative frequency (probability of how often they occur). Distributions are considered any population that has a scattering of data. It’s important to determine the kind of distribution that population has so we can apply the correct statistical methods when analyzing it.
Data distributions are widely used in statistics. Suppose an engineer collects 500 data points in a shop floor it does not give any value to the management unless he/she categorize or organize the data in a useful way. Data distribution method organizes the raw data into graphical methods (like histograms, box plots, run charts etc) and provides the useful information.
The basic advantage of data distribution is to estimate the probability of any specific observation in a sample space. Probability distribution is a mathematical model that calculates the probability of occurrence of different possible outcomes in a test or experiment. Used to define different types of random variables (Typically discreet or continuous) to make the decision depends on these models. Based on random variable category one can use mean, mode, range, probability or other statistical methods.
Types of Distribution
Distributions are basically classified based on the type of data (Typically discreet or continuous)
A discrete distribution resulting from countable data that has finite number of possible values. Furthermore discrete distributions can be reported in tables and the respective values of the random variable are countable. Ex: rolling dice, choosing a number of heads etc. The discrete distributions are defined by probability mass functions (pmf).
Probability mass function (pmf): Probability mass function is a frequency function which gives the probability for discrete random variables, also known as discrete density function.
Simply Discrete= counted
Different types of discreet distributions are
- The binomial distribution measures the probability of the number of successes or failure outcome in an experiment in each try.
- Characteristics that are classified into two mutually exclusive and exhaustive classes, such as number of success/failures, number accepted/rejected follow binomial distribution.
- Ex: Tossing a coin: Probability of coin landing Head is ½ and probability of coin landing tail is ½
- The Poisson distribution is the discrete probability distribution that measures the likelihood of a number of events occurring in a given time period, when the events occur one after the another in time in a well-defined manner.
- Characteristics that can theoretically take large values, but actually take small values have Poisson distribution.
- Ex : Number of defects, errors , accidents, absentees etc.
- Hypergeometric distribution is a discrete distribution that measures the probability of a specified number of successes in (n) trials, without replacement, from a relatively large population (N). In other words sampling is done without replacement.
- The hypergeometric distribution is similar to binomial distribution;
- The basic difference of binomial distribution is that probability of success to be the same for all trails while it is not same case for hypergeometric distribution.
- Geometric distribution is a discrete distribution that measures the likelihood of when the first success will occur.
- An extension of it may be consider as negative binomial distribution.
- Ex: Marketing representative from advertising agency randomly selects the hockey players from various universities until he finds a hockey player attended the Olympics.
A continuous distribution containing infinite (variable) data points that may be displayed on a continuous measurement scale. A continuous random variable is a random variable with a set of possible values that is infinite and uncountable. It measures something rather than just count and typically described by probability density functions (pdf).
Probability density function (pdf): The probability density function describe the behavior of a random variable. It is normally grouped frequency distribution. Hence probability density function see it as ‘shape’ of the distribution.
Simply Continuous = can take many different values
Different types of continuous distributions are
- Normal distribution is also known as Gaussian distribution is a symmetrical bell shape curve with higher frequency (probability density) around the central value, and frequency sharply decreasing as values are away from central value on either side.
- In other words characteristics whose dimensions are expected on either side of the aimed at value with equal probability, follow normal distribution.
- Mean, Median and Mode are equal for normal distribution.
- A continuous random variable x follows a lognormal distribution if its natural logarithm, ln(x) follows a normal distribution.
- When you sum the random variables, as the sample size increases, the distribution of the sum becomes a normal distribution, regardless of the distribution of the individuals. Same scenario for multiplication.
- The location parameter is the mean of the data set after transformation by taking the logarithm, and also the scale parameter is the standard deviation of the data set after transformation.
- The F distribution extensively use to test for equality of variances from two normal populations
- The F distribution is an asymmetric distribution that has a minimum value 0, but no maximum value.
- Notably the curve approaches zero but never quite touches the horizontal axis.
- The chi square distribution results when independent variables with standard normal distribution are squared and summed.
- Ex: if Z is standard normal random variable then
- y =Z12+ Z22 +Z32 +Z42+…..+ Zn2
- The chi square distribution is symmetrical, bounded below by zero, and approaches the normal distribution in shape as the degrees of freedom increases.
- The exponential distribution is the probability distribution and of the widely used continuous distributions. Often used to model items with a constant failure rate.
- The exponential distribution is closely related to the Poisson distribution.
- Has a constant failure rate as it will always have the same shape parameters.
- Ex: The lifetime of a bulb, the time between fires in a city
- t distribution or student’s t distribution is a bell shape probability distribution, symmetrical about its mean.
- Commonly used for hypothesis testing and constructing confidence intervals for means.
- Used in place of the normal distribution when the standard deviation is unknown.
- Like the normal distribution, when random variables are averages, the distribution of the average tends to be normal, regardless of the distribution of the individuals.
- The basic purpose of Weibull distribution is to model time-to-failure data.
- Widely used in reliability, medical research and statistical applications.
- Assumes many shapes depending upon the shape, scale, and location parameters. Effect of Shape parameter β on Weibull distribution:
- For instance, if shape parameter β is 1, it becomes identical to exponential distribution.
- If β is 2, then Rayleigh distribution.
- and If β between 3 and 4, then Normal distribution.
Generally an assumption is that while performing a hypothesis test that the data is a sample from a certain distribution commonly normal distribution, but always that is not the case that data may not follow normal distribution. Hence nonparametric tests used when there is no assumption of a specific distribution for the population.
Particularly nonparametric test results are more robust against violation of the assumptions. Different types of nonparametric test are Sign test, Mood’s Median Test (for two samples) , Mann-Whitney Test for Independent Samples, Wilcoxon Signed-Rank Test for a Single Sample , Wilcoxon Signed-Rank Test for Paired Samples
- The continuous distribution (like normal, chi square, exponential) and discrete distribution (like binomial, geometric) are the probability distribution of one random variable
- Whereas bivariate distribution is a probability of a certain event occur in case two independent random variables exists it may be continuous or discrete distribution.
- Bivariate distribution is unique because it is the joint distribution of two variables.
- A bi-modal distribution which has two modes, in other words two outcomes that are most likely compare the outcomes of their region.
- 2 sources of data coming into a single process screen.
How to Evaluate a Data Distribution
The shape of a distribution represent by its number of peaks and symmetry possession, skewness or its uniformity. Furthermore these distributions graphically illustrate the spread (dispersion, variability, or scatter) of the data.
Evaluate the shape of distribution -Symmetric or Asymetric?
Symmetric distribution: Generally symmetric distribution appears as a bell curve. This type of distribution occurs when mean, median and mode take place at the same point and the values of variables occur at regular frequencies. Both sides of the mean match & mirror each other.
Examples of Symmetric data distributions :Normal Distribution, Uniform
Asymmetric distribution: It is opposite to symmetric distribution and it does not skew, in the other words with zero skewness. Skewness is a measure of the lack of symmetry. This distribution is either left-skewed (also known as negative distribution) or right-skewed (known as positive distribution). Both sides of the mean do NOT match.
Examples of Asymetric Data Distributions: Exponential, Gamma, Log-normal and Weibull
Statistical Tests Used to Identify Data distribution
There are different methods to test the normality of data, including visual or graphical method and Quantifiable or numerical methods.
Visual method: Visual inspection approach may be used to assess the data distribution normality, although this method is unpredictable and does not guarantee that the data distribution is normal. However, visual method somewhat help user to judge the data normality.
Ex: Histogram), boxplot, stem-and-leaf plot, probability-probability plot, and quantile-quantile plot.
Quantifiable method: Quantifiable methods are supplementary to the visual methods. Particularly these tests compare the scores in the sample to a normally distributed set of scores with the same mean and standard deviation.
Ex: Anderson-Darling Test, Shapiro-Wilk W Test, Kolmogorov-Smirnov Test etc.,
Other Data Distribution Notes:
Don’t like or can’t use the existing distribution? So, use a data transformation to turn the data set into something more easily analyzed.
How to Make a Process Follow a Normal Distribution by Using Transforms
Sometimes you will be analyzing a process and the data will come out in a non-normal shape. Since normal distributions have wonderful mathematical properties that make analysis and control so much easier, try to transform the data to a normal distribution if possible.
The approach to address the non-normal distribution is to make transformation to “normalize” the data. Some typical data transformation methods are Box Cox, Log transformation, Square root or power transformation, Exponential and Reciprocal etc.,
- A Box Cox transformation is a useful power transformation technique to transform non-normal dependent variables into a normal shape.
- George Box and Sir D.R.Cox. are the authors for this method
- The applicable formula is yl =yλ (λ is the power or parameter the to be transform the data).
- For instance, λ=2, the data is squared and if λ=0.5 a square root is required.
- Z transformation is widely used as an analysis tool in signal processing
- It is a generalization of the Discrete-Time Fourier Transform (DTFT), in particular it applies to signals for which DFTF doesn’t exists thus allowing to analyze those signals
- It also helps to see the new ideas in the sense of a system with respect to stability and causality
- Z transform is the discrete time counterpart to the Lapse transform