Measures of Variability using R

Measures of variability tells ‘how spread’ out is the data. How far values are from the mean or median. Here, we’ll be discussing 6 common types of measures of variability which are:

Range
Interquartile Range
Mean Absolute Deviation
Variance
Standard Deviation
Median Absolute Deviation

Let’s see these in detail. We’ll be using ‘iris’ dataset for the examples which is built-in dataset in R.

Range

We can calculate range as largest value minus the smallest value.

#Range of Sepal.Length column

#1. Using without function
max(iris$Sepal.Length) - min(iris$Sepal.Length)   #Output: 3.6

#2. With function range() - It gives both the largest & smallest value in a vector
range(iris$Sepal.Length)                       #Output: 4.3, 7.9

Note! Range is not a good measure of variability as it is highly influences by outliers. If a dataset has some extreme outliers, we’ll get totally different range.

2. Interquartile Range

IQR is like a range but it calculates the difference between the difference between 25th & 75th quantile. It is basically the middle half (50%) of the data i.e. one quarter of the data falls below the 25th percentile & one quarter of the data is above the 75th percentile, leaving the “middle half” of the data lying in between the two.

Quantiles(most commonly called percentiles). Eg. 10th percentile of a dataset is the smallest number (x) such that 10% of the data is less than x.

Median of a dataset is at 50th quantile/percentile.

#Find 50th quantile/median of Sepal.Length column
quantile(iris$Sepal.Length, probs = 0.5)       #Output: 5.8

#Median of Sepal.Length
median(iris$Sepal.Length)                      #Output: 5.8

#Find 25th & 75th perceentile of Sepal.Length
quantile(iris$Sepal.Length, probs = c(0.25, 0.75)) 
#Output: 25%:5.1, 75%:6.4                    Hence, IQR= 6.4-5.1=1.3

#Find IQR using function
IQR(iris$Sepal.Length)                         #Output: 1.3

3. Mean Absolute Deviation

Mean Absolute Deviation is a measure of average of the absolute deviation between each observation and the mean.

We use absolute here because we’re just interested in how ‘close’ it is to the mean doesn’t matter if the value is higher or lower than the mean

Formula to calculate Mean absolute deviation is:

##Find Mean absolute deviation of Sepal.Length column

#1. Without function
data<-iris$Sepal.Length
mean<-mean(iris$Sepal.Length)
dev<-abs(iris$Sepal.Length - mean)
mad<-mean(dev)                                #Output: 0.687

#2. Using aad() function in lsr package
aad(iris$Sepal.Length)                        #Output: 0.687

4. Variance

Variance measures how far/spread out each data point is from the mean. Formula of variance is:

This formula is very similar to Mean Absolute deviation, here we just use squared deviations instead of absolute deviation. That is why sometimes Variance is also called ‘Mean Squared Deviation’.

One explanation of why we use squared deviation in variance is if variance is less , it means on an average every value has a low difference with the mean. and hence, we can conclude that all the values are approximately close to the mean. However if the variance is high, then we can understand there are lot of extreme values in the dataset.

Note! Variance is additive i.e. let’s say there is a variable X with Var(x) & variable y with Var(y), then we can create a column Z where:

Var(z)= Var(x) + Var(y)

#Variance of column Sepal.Length
var(iris$Sepal.Length)                             #Output:0.685

Now in R, instead of averaging the squared deviations where we divide by N, R chose to divide by N-1 i.e. R uses below formula for variance:

We divide by N-1 instead of N because we define Variance(s²) in a way such that it is an unbiased sample variance.

Variance with a divisor of N-1 is a variance calculated from the sample as an estimate of the variance of the population from which the sample was drawn. Variance which is calculated using deviations from the sample mean underestimates the desired variance of the population. Using N-1 instead of N as the divisor, corrects for that by making the result a little bit bigger.

When sample is the whole population, we use N as a divisor because then mean is population mean not sample mean.

Note! Variance is not interpretable, one reason is since its squared so it’s unit is not same as the unit of the dataset. That is why instead of variance, people prefer to use Standard Deviation which we will cover now.

5. Standard Deviation

Standard deviation is the square root of the variance. It is more interpretable because it is expressed in the same units as the data (i.e., values, not squared values).

Eg. Suppose there are two grocery delivery apps, both advertise 20 minutes average delivery time. Now, let’s say App 1 has a SD of 10 minutes and App 2 has an SD of 5 minutes. Now we can understand, App 1 with larger SD has more variable delivery times and a broader distribution curve compared to App 2 with less variability, so we’ll choose App 2.

Small standard deviation indicates that the data points are closer to the mean i.e. values in the dataset are consistent.
While high standard deviation means data values are spread out further from the mean, they become more dissimilar and extreme values become more likely.

Formula of sample standard deviation is:

#Find SD of Sepal.Length column
sd(iris$Sepal.Length)                        #Output: 0.828

Note!

If a distribution is normal, symmetric or bell shaped, then in general 68% of the data fall within 1 standard deviation of the mean
95% of the data fall within 2 standard deviation of the mean
and 99.7% of the data fall within 3 standard deviations of the mean.

6. Median Absolute Deviation

Median Absolute Deviation is similar to Mean Absolute Deviation, instead of mean, it used median.

Eg. In an iris dataset, the median Sepal.Length is 5.8. However, there is some amount of variance in the dataset. The MAD value is 0.7, indicating that a typical Sepal.length would differ from the median value by about 0.7 points.

#Find Median Absolute Deviation of Sepal.Length column. Default value of constant = 1.4826 which relies on assumption that data is symmetric & follows normal distribution.
mad(iris$Sepal.Length, constant=1)              #Output: 0.7

Thank you for reading.

References:

https://stats.stackexchange.com/questions/3931/intuitive-explanation-for-dividing-by-n-1-when-calculating-standard-deviation

https://statisticsbyjim.com/basics/standard-deviation/

https://learningstatisticswithr.com/lsr-0.6.pdf

Read / Share / Repeat

Measures of Variability using R

Leave a Reply Cancel reply

Measures of Variability using R

Measures of Central Tendency in R

Leave a Reply Cancel reply