Data Science Blog

Read / Share / Repeat

Measures of Variability using R

Measures of variability tells ‘how spread’ out is the data. How far values are from the mean or median. Here, we’ll be discussing 6 common types of measures of variability which are:

  1. Range
  2. Interquartile Range
  3. Mean Absolute Deviation
  4. Variance
  5. Standard Deviation
  6. Median Absolute Deviation

Let’s see these in detail. We’ll be using ‘iris’ dataset for the examples which is built-in dataset in R.

  1. Range

We can calculate range as largest value minus the smallest value.

#Range of Sepal.Length column

#1. Using without function
max(iris$Sepal.Length) - min(iris$Sepal.Length)   #Output: 3.6

#2. With function range() - It gives both the largest & smallest value in a vector
range(iris$Sepal.Length)                       #Output: 4.3, 7.9 

Note! Range is not a good measure of variability as it is highly influences by outliers. If a dataset has some extreme outliers, we’ll get totally different range.

2. Interquartile Range

IQR is like a range but it calculates the difference between the difference between 25th & 75th quantile. It is basically the middle half (50%) of the data i.e. one quarter of the data falls below the 25th percentile & one quarter of the data is above the 75th percentile, leaving the “middle half” of the data lying in between the two.

Quantiles(most commonly called percentiles). Eg. 10th percentile of a dataset is the smallest number (x) such that 10% of the data is less than x.

Median of a dataset is at 50th quantile/percentile.

#Find 50th quantile/median of Sepal.Length column
quantile(iris$Sepal.Length, probs = 0.5)       #Output: 5.8

#Median of Sepal.Length
median(iris$Sepal.Length)                      #Output: 5.8

#Find 25th & 75th perceentile of Sepal.Length
quantile(iris$Sepal.Length, probs = c(0.25, 0.75)) 
#Output: 25%:5.1, 75%:6.4                    Hence, IQR= 6.4-5.1=1.3

#Find IQR using function
IQR(iris$Sepal.Length)                         #Output: 1.3

3. Mean Absolute Deviation

Mean Absolute Deviation is a measure of average of the absolute deviation between each observation and the mean.

We use absolute here because we’re just interested in how ‘close’ it is to the mean doesn’t matter if the value is higher or lower than the mean

Formula to calculate Mean absolute deviation is:

##Find Mean absolute deviation of Sepal.Length column

#1. Without function
data<-iris$Sepal.Length
mean<-mean(iris$Sepal.Length)
dev<-abs(iris$Sepal.Length - mean)
mad<-mean(dev)                                #Output: 0.687

#2. Using aad() function in lsr package
aad(iris$Sepal.Length)                        #Output: 0.687

4. Variance

Variance measures how far/spread out each data point is from the mean. Formula of variance is:

This formula is very similar to Mean Absolute deviation, here we just use squared deviations instead of absolute deviation. That is why sometimes Variance is also called ‘Mean Squared Deviation’.

One explanation of why we use squared deviation in variance is if variance is less , it means on an average every value has a low difference with the mean. and hence, we can conclude that all the values are approximately close to the mean. However if the variance is high, then we can understand there are lot of extreme values in the dataset.

Note! Variance is additive i.e. let’s say there is a variable X with Var(x) & variable y with Var(y), then we can create a column Z where:

Var(z)= Var(x) + Var(y)

#Variance of column Sepal.Length
var(iris$Sepal.Length)                             #Output:0.685

Now in R, instead of averaging the squared deviations where we divide by N, R chose to divide by N-1 i.e. R uses below formula for variance:

We divide by N-1 instead of N because we define Variance(s²) in a way such that it is an unbiased sample variance.

Variance with a divisor of N-1 is a variance calculated from the sample as an estimate of the variance of the population from which the sample was drawn. Variance which is calculated using deviations from the sample mean underestimates the desired variance of the population. Using N-1 instead of N as the divisor, corrects for that by making the result a little bit bigger.

When sample is the whole population, we use N as a divisor because then mean is population mean not sample mean.

Note! Variance is not interpretable, one reason is since its squared so it’s unit is not same as the unit of the dataset. That is why instead of variance, people prefer to use Standard Deviation which we will cover now.

5. Standard Deviation

Standard deviation is the square root of the variance. It is more interpretable because it is expressed in the same units as the data (i.e., values, not squared values).

Eg. Suppose there are two grocery delivery apps, both advertise 20 minutes average delivery time. Now, let’s say App 1 has a SD of 10 minutes and App 2 has an SD of 5 minutes. Now we can understand, App 1 with larger SD has more variable delivery times and a broader distribution curve compared to App 2 with less variability, so we’ll choose App 2.

  • Small standard deviation indicates that the data points are closer to the mean i.e. values in the dataset are consistent.
  • While high standard deviation means data values are spread out further from the mean, they become more dissimilar and extreme values become more likely.

Formula of sample standard deviation is:

#Find SD of Sepal.Length column
sd(iris$Sepal.Length)                        #Output: 0.828

Note!

  • If a distribution is normal, symmetric or bell shaped, then in general 68% of the data fall within 1 standard deviation of the mean
  • 95% of the data fall within 2 standard deviation of the mean
  • and 99.7% of the data fall within 3 standard deviations of the mean.

6. Median Absolute Deviation

Median Absolute Deviation is similar to Mean Absolute Deviation, instead of mean, it used median.

Eg. In an iris dataset, the median Sepal.Length is 5.8. However, there is some amount of variance in the dataset. The MAD value is 0.7, indicating that a typical Sepal.length would differ from the median value by about 0.7 points.

#Find Median Absolute Deviation of Sepal.Length column. Default value of constant = 1.4826 which relies on assumption that data is symmetric & follows normal distribution.
mad(iris$Sepal.Length, constant=1)              #Output: 0.7

Thank you for reading.

References:

https://stats.stackexchange.com/questions/3931/intuitive-explanation-for-dividing-by-n-1-when-calculating-standard-deviation

https://statisticsbyjim.com/basics/standard-deviation/

https://learningstatisticswithr.com/lsr-0.6.pdf

Measures of Central Tendency in R

A measure of central tendency is a summary statistic that represents the center point of a dataset. It indicates where most values in a distribution fall. In this blog, we will see the basic definitions of mean, median & mode and see how to calculate these in R.

There are broadly 3 types of measures of Central tendency:

  • Mean (Sometimes, people also use Trimmed mean)
  • Median
  • Mode
  1. Mean

The mean of a set of observations is just an average: add all the values and then divide by the total number of values.

Let’s see how we can calculate mean in R. We are taking an example of ‘iris’ dataset here which is a built-in dataset in R.

##Calculate mean of Sepal.Length column. We can either calculate mean without function or with function mean()

# 1. Without function
sum(iris$Sepal.Length)/length(iris$Sepal.Length)  #output: 5.84

# 2. With function
mean(iris$Sepal.Length)                           #output: 5.84

Trimmed Mean

Mean is highly sensitive to outliers, therefore if in a dataset there are some extreme values present, mean is not a good measure of tendency. We can either use Median or Trimmed mean. First let’s see what trimmed mean is.

To calculate trimmed mean, we discard the extreme data points from both the ends(i.e. the largest & smallest) and then take the mean of rest of the data.

We describe trimmed mean in terms of %. So, a 10% trimmed mean will discard the largest 10% of the data & smallest 10% of the data and then takes the mean of 80% of the remaining data.

Let’s see how to calculate trimmed mean in R

data <- c( -15,1,6,4,7,8,4,5,9,12 )

#find 10% trimmed mean (10% trim from both sides)
mean(data, trim = 0.1)                     #output: 5.5

2. Median

The median of a set of observations is the middle value. It is the value that splits the dataset in half.

  • To find median, sort the data in ascending order
  • In case of odd no. of observations, median is the middle value.
  • In case of even no. of observations, median is the average of 2 middle values.

Outliers has a small effect on median since median doesn’t depend on all the values in a dataset.

Income is the classic example where we should use median instead of mean. Because if income of some wealthy person is also added in the dataset, mean will overestimate where most of the household income falls.

Let’s see how to calculate median in R.

#Find median of Sepal.Length column
median(iris$Sepal.Length)                    #Output: 5.8

3. Mode

Mode is the value that occurs most frequently in a dataset. If no value repeats, there is no mode in the dataset. And if there are multiple values that occurs most frequently(same no. of times), then it’s called multimodal meaning data has multiple modes.

Let’s see how to calculate mode in R. Since core packages in R don’t have function for calculating the mode, we will use the package ‘lsr’ which has functions to calculate mode.

#Load package lsr
library(lsr)

data<-c(2,2,3,4,5,2,3,2,3,2,3,2,3,4,5,4,4,4,5,6,6,6,1,1,1,1,1)

#Find mode
modeOf(data)                                #Output: 2

#Find frequency value of the mode
maxFreq(data)                               #Output: 6

Note! When to use mean, median or mode

  • In a symmetrical continuous data, mean, median and mode are equal. Here, mean is preferred since it considers all the data points in a dataset.
  • If a distribution is skewed, median is preferred.
  • For ordinal data, median or mode is preferred.
  • For categorical data, use mode.

Thank you for reading 🙂

References:

Changing column names using dplyr

Many times we are required to change the column names of a dataframe for the analysis. In this blog, we’ll see the common dplyr functions using which we can easily change the column names. We’ll be using ‘iris’ dataset here which is a built-in dataset in R.

  1. select()

We can rename the columns using select() function which will select the columns and at the ame time change their names by providing a new name on the left-hand side of an equals operator (=).

#Rename Sepal.Width column to sepal_width and Species to species
iris %>%select(sepal_width=Sepal.Width, species=Species)

2. rename()

If you want to retain all the columns but with some renamed, you can use the rename() function. rename() will output all the columns with the names adjusted for mentioned columns.

#Select all the columns but rename Sepal.Width column to sepal_width and Species to species
iris %>%rename(sepal_width=Sepal.Width, species=Species)

3. Variations of rename() function

We can use *_at(), *_if(), and *_all() versions of rename() function to change some or all the columns.

#Load library stringr
library(stringr)

#Rename all the columns to lowercase
iris %>%rename_all(str_to_lower)

#Rename only the numeric columns to lowercase
iris %>% rename_if(is.numeric, str_to_lower)

#Rename all the columns to lowercase which starts with 'S'
iris %>%rename_at(vars(starts_with("S")), str_to_lower)

4. Convert row_names to column

#Load the library tidyverse
library(tidyverse)

#Convert rownames in mtcars dataset to column 'car'
mtcars %>%rownames_to_column("car") #rownames_to_column comes from tibble package which gets loaded with tidyverse

Thank you for reading 🙂

References: https://itsalocke.com/files/DataManipulationinR.pdf

Sort rows & columns using dplyr

We often need to sort our rows or reorder the columns. We can do this using the functions in dplyr package. We’ll be using ‘iris’ dataset in this blog which is the built-in dataset in R. Let’s see the functions usage below.

  1. arrange()

arrange() function sort the rows based on columns where the first column will be the first one to be sorted then based on second column and so on.

#Sort rows based on Species in descending order & Sepal.Length in ascending order
iris%>%arrange(desc(Species), Sepal.Length)

2. arrange_all()

This functions sort all the data from left to right

#Sort all the data from left to right in descending order
iris%>%arrange_all(desc)

3. arrange_if()

arrange_if() sort the rows based on column criteria

#Based on numeric columns, sort the data in descending order
iris%>%arrange_if(is.numeric, desc)

4. arrange_at()

arrange_at() function sort the rows based on selected columns.

#Sort rows based on Species & columns starting with 'P' in descending order
iris %>%arrange_at(vars(Species, starts_with("P")), desc)

Select() function to reorder the columns

We can use select() function to reorder the columns.

#Reorder the columns starting from columns containing 'P' in the beginning and then the rest
iris %>%select(starts_with("P"), everything())
#Sort the columns alphabetically (Extract column names using current_vars() function)
iris %>%select(sort(current_vars()))

Thank you for reading 🙂

References: https://itsalocke.com/files/DataManipulationinR.pdf

Filter columns in R using dplyr

We often need to select some columns out of all the columns in the dataframe for our analyses. We can do so using the dplyr package in R. In this blog, we’ll see some common functions to filter the columns. We’ll be using ‘iris’ dataset which is the built-in dataset in R.

select() function

Using select() function, we can select the columns we want or don’t want.

#Select columns Species & Sepal.Length from iris dataset
iris%>%select(Species, Sepal.Length)

#Exclude Species column
iris %>%select(-Species)

#Provide range of columns
iris %>%select(Sepal.Length:Petal.Length)

#Exclude group of columns
iris %>%select(-(Sepal.Length:Petal.Length))

Name based Selection

We can select the columns containing the name or string.

#Return columns beginning with 'S'
iris %>%select(starts_with("S"))

#Return columns ending with 's'
iris %>%select(ends_with("s"))

#Return columns containing string 'Length'
iris %>%select(contains("Length"))

Content based Selection

We can also select the columns using some criteria or custom conditions.

#Select only numeric columns
iris %>%select_if(is.numeric)

#Select numeric columns where number of unique values in the column is more than 30. (Use ~ to denote we're writing a custom condition
iris %>%select_if(~is.numeric(.) & n_distinct(.)>30)

If you want to reuse some conditions multiple times, we can convert it into a function using as_mapper()

custom_cond <- as_mapper(
  ~is.numeric(.) & n_distinct(.)>30
)

This can be used in a standalone fashion or within select_if() functions.

#Returns TRUE/FALSE 
custom_cond(LETTERS)
custom_cond(1:50)

#Use in select_if() function
iris%>%select_if(custom_cond)

Thank you for reading 🙂

References: https://itsalocke.com/files/DataManipulationinR.pdf

Filter rows in R using dplyr

Filtering rows are always required while working with dataframes in R. In this blog, we’ll see common functions of dplyr package that we can use to filter the rows in various ways. We’ll be using ‘iris’ dataset in our examples which is built-in dataset in R.

  1. slice()

Slice() function takes vector of values that denote the positions. They can be positive for including the rows and negative for excluding the rows.

#Select top 5 rows
iris%>%slice(1:5)

#Exclude row 3 from top
iris%>%slice(-3)

#Remove top 50 rows. [ n() returns total rows ]
iris %>%slice(-(1:floor(n()/3)))

2. filter()

filter() function filters the rows based on certain conditions if the condition evaluates to True.

#Filter data with Species='Virginica'
iris%>%filter(Species=="virginica")

#Filter data with Species='Virginica' and Sepal.Length >= mean of Sepal.Length
iris%>%filter(Species=="virginica" & Sepal.Length >= mean(Sepal.Length))

3. filter_all()

filter_all() applies the filter to each column. It returns only the rows where condition is TRUE for all columns (AND) or where condition is TRUE for any single column (OR).

  • If condition is TRUE for all the columns, wrap the condition in all_vars()
  • If condition id TRUE for any one of the columns, wrap the condition in any_vars()
#Return any row where a column's value exceeds a 7
iris%>%filter_all(any_vars(.>7.5))

#Return each row where every numeric column's value is smaller than average
data %>%filter_all(all_vars(. < mean(.)))

4. filter_if()

filter_if() first applies a column level check and then filter the rows.

#Return each row where every numeric column's value is smaller than average
iris %>%filter_if(is.numeric, all_vars(.<mean(.)))

We can also use custom functions by using a tilde (~) and data place holder (.)

#For all numeric columns and if distinct count of rows >20 in dataframe, return rows where column's value is smaller than average 
iris %>%filter_if(~is.numeric(.) & n_distinct(.)>20,any_vars(.<mean(.)))

5. filter_at()

filter_at() applies filter to columns that match some criteria.

#Based on columns which ends with 'Length', return rows where column's value is smaller than average
iris %>%filter_at(vars(ends_with("Length")),all_vars(.<mean(.)))

Thank you for reading 🙂

References: https://itsalocke.com/files/DataManipulationinR.pdf

Powered by WordPress & Theme by Anders Norén