Read / Share / Repeat

Category: Data Manipulation using R

Changing column names using dplyr

Many times we are required to change the column names of a dataframe for the analysis. In this blog, we’ll see the common dplyr functions using which we can easily change the column names. We’ll be using ‘iris’ dataset here which is a built-in dataset in R.

  1. select()

We can rename the columns using select() function which will select the columns and at the ame time change their names by providing a new name on the left-hand side of an equals operator (=).

#Rename Sepal.Width column to sepal_width and Species to species
iris %>%select(sepal_width=Sepal.Width, species=Species)

2. rename()

If you want to retain all the columns but with some renamed, you can use the rename() function. rename() will output all the columns with the names adjusted for mentioned columns.

#Select all the columns but rename Sepal.Width column to sepal_width and Species to species
iris %>%rename(sepal_width=Sepal.Width, species=Species)

3. Variations of rename() function

We can use *_at(), *_if(), and *_all() versions of rename() function to change some or all the columns.

#Load library stringr
library(stringr)

#Rename all the columns to lowercase
iris %>%rename_all(str_to_lower)

#Rename only the numeric columns to lowercase
iris %>% rename_if(is.numeric, str_to_lower)

#Rename all the columns to lowercase which starts with 'S'
iris %>%rename_at(vars(starts_with("S")), str_to_lower)

4. Convert row_names to column

#Load the library tidyverse
library(tidyverse)

#Convert rownames in mtcars dataset to column 'car'
mtcars %>%rownames_to_column("car") #rownames_to_column comes from tibble package which gets loaded with tidyverse

Thank you for reading 🙂

References: https://itsalocke.com/files/DataManipulationinR.pdf

Sort rows & columns using dplyr

We often need to sort our rows or reorder the columns. We can do this using the functions in dplyr package. We’ll be using ‘iris’ dataset in this blog which is the built-in dataset in R. Let’s see the functions usage below.

  1. arrange()

arrange() function sort the rows based on columns where the first column will be the first one to be sorted then based on second column and so on.

#Sort rows based on Species in descending order & Sepal.Length in ascending order
iris%>%arrange(desc(Species), Sepal.Length)

2. arrange_all()

This functions sort all the data from left to right

#Sort all the data from left to right in descending order
iris%>%arrange_all(desc)

3. arrange_if()

arrange_if() sort the rows based on column criteria

#Based on numeric columns, sort the data in descending order
iris%>%arrange_if(is.numeric, desc)

4. arrange_at()

arrange_at() function sort the rows based on selected columns.

#Sort rows based on Species & columns starting with 'P' in descending order
iris %>%arrange_at(vars(Species, starts_with("P")), desc)

Select() function to reorder the columns

We can use select() function to reorder the columns.

#Reorder the columns starting from columns containing 'P' in the beginning and then the rest
iris %>%select(starts_with("P"), everything())
#Sort the columns alphabetically (Extract column names using current_vars() function)
iris %>%select(sort(current_vars()))

Thank you for reading 🙂

References: https://itsalocke.com/files/DataManipulationinR.pdf

Filter columns in R using dplyr

We often need to select some columns out of all the columns in the dataframe for our analyses. We can do so using the dplyr package in R. In this blog, we’ll see some common functions to filter the columns. We’ll be using ‘iris’ dataset which is the built-in dataset in R.

select() function

Using select() function, we can select the columns we want or don’t want.

#Select columns Species & Sepal.Length from iris dataset
iris%>%select(Species, Sepal.Length)

#Exclude Species column
iris %>%select(-Species)

#Provide range of columns
iris %>%select(Sepal.Length:Petal.Length)

#Exclude group of columns
iris %>%select(-(Sepal.Length:Petal.Length))

Name based Selection

We can select the columns containing the name or string.

#Return columns beginning with 'S'
iris %>%select(starts_with("S"))

#Return columns ending with 's'
iris %>%select(ends_with("s"))

#Return columns containing string 'Length'
iris %>%select(contains("Length"))

Content based Selection

We can also select the columns using some criteria or custom conditions.

#Select only numeric columns
iris %>%select_if(is.numeric)

#Select numeric columns where number of unique values in the column is more than 30. (Use ~ to denote we're writing a custom condition
iris %>%select_if(~is.numeric(.) & n_distinct(.)>30)

If you want to reuse some conditions multiple times, we can convert it into a function using as_mapper()

custom_cond <- as_mapper(
  ~is.numeric(.) & n_distinct(.)>30
)

This can be used in a standalone fashion or within select_if() functions.

#Returns TRUE/FALSE 
custom_cond(LETTERS)
custom_cond(1:50)

#Use in select_if() function
iris%>%select_if(custom_cond)

Thank you for reading 🙂

References: https://itsalocke.com/files/DataManipulationinR.pdf

Filter rows in R using dplyr

Filtering rows are always required while working with dataframes in R. In this blog, we’ll see common functions of dplyr package that we can use to filter the rows in various ways. We’ll be using ‘iris’ dataset in our examples which is built-in dataset in R.

  1. slice()

Slice() function takes vector of values that denote the positions. They can be positive for including the rows and negative for excluding the rows.

#Select top 5 rows
iris%>%slice(1:5)

#Exclude row 3 from top
iris%>%slice(-3)

#Remove top 50 rows. [ n() returns total rows ]
iris %>%slice(-(1:floor(n()/3)))

2. filter()

filter() function filters the rows based on certain conditions if the condition evaluates to True.

#Filter data with Species='Virginica'
iris%>%filter(Species=="virginica")

#Filter data with Species='Virginica' and Sepal.Length >= mean of Sepal.Length
iris%>%filter(Species=="virginica" & Sepal.Length >= mean(Sepal.Length))

3. filter_all()

filter_all() applies the filter to each column. It returns only the rows where condition is TRUE for all columns (AND) or where condition is TRUE for any single column (OR).

  • If condition is TRUE for all the columns, wrap the condition in all_vars()
  • If condition id TRUE for any one of the columns, wrap the condition in any_vars()
#Return any row where a column's value exceeds a 7
iris%>%filter_all(any_vars(.>7.5))

#Return each row where every numeric column's value is smaller than average
data %>%filter_all(all_vars(. < mean(.)))

4. filter_if()

filter_if() first applies a column level check and then filter the rows.

#Return each row where every numeric column's value is smaller than average
iris %>%filter_if(is.numeric, all_vars(.<mean(.)))

We can also use custom functions by using a tilde (~) and data place holder (.)

#For all numeric columns and if distinct count of rows >20 in dataframe, return rows where column's value is smaller than average 
iris %>%filter_if(~is.numeric(.) & n_distinct(.)>20,any_vars(.<mean(.)))

5. filter_at()

filter_at() applies filter to columns that match some criteria.

#Based on columns which ends with 'Length', return rows where column's value is smaller than average
iris %>%filter_at(vars(ends_with("Length")),all_vars(.<mean(.)))

Thank you for reading 🙂

References: https://itsalocke.com/files/DataManipulationinR.pdf

Powered by WordPress & Theme by Anders Norén