Tidyverse Essentials - dplyr

The grammar of data wrangling

tutorial
r
tidyverse
dplyr
intro
Data wrangling: solve the most common data manipulation challenges
Author

Matthew Scott

Published

July 24, 2024

A cropped version of the main image

Abstract Canvas

Introduction

dplyr helps manipulate datasets. That is, it can be used to change their shape - the number of rows, columns, as well as verbs to create new columns.

Dataset used in this tutorial

We will use the mpg dataset from ggplot2 in this tutorial and run through a lot of the common syntax used to manipulate a table.

library(ggplot2)
glimpse(mpg)
1
We can see the column names, their datatype (inside < >) and example data in a list-like format. We also get information on the number of rows and columns, the ‘shape’
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

select() & rename()

Choose which columns to include or exclude.

select([df], helper_function([col_1], [col_2]))

You can wrap the columns in helper functions:

  • contains(match, ignore.case = T) - column name contains a certain string
  • starts_with(match, ignore.case = T) - column names start with
  • ends_with(match, ignore.case = T) - column names end with
  • matches(match, ignore.case = T) - matches a regular expression
  • num_range(prefix, range, ...) - names following the pattern e.g. num_range("Wk", 1:3): “Wk1”, “Wk2”, “Wk3”
  • any_of() - will select all columns available from a list
  • all_of() - used for strict selection. Throws an error if one of the columns named in the list aren’t present in the dataframe
  • matches() - exactly matches a certain string
  • everything() - every other column not specified already
  • where(is.numeric()) - will select all numeric columns (int or dbl). Also is.character, is.factor, is.integer, is.double

rename([new_name] = [old_name])

# selecting column by index (position)

# select columns in position 1-3
select(df, 1:3)
# select columns in position 1-3
select(df, c(2, 5, 7))
# select from the second to last column to the last column
select(df, (ncol(df) -2) : ncol(df))

# selecting columns that contain 'l' (upper or lower case) 
# and renaming 'model' to 'type' etc.
df %>%  
  select(contains('l', ignore.case = T)) %>% 
  rename(type = model, cylinder = cyl)

# select and rename in one call
select(df, 
       mnfc = manufacturer,
       mod = model,
       everything())

# select all numeric columns plus some extras
mpg %>%
  select(where(is.numeric),
         mnfc = manufacturer,
         mod = model)

# another approach: create a vector and use this with any_of()
numerical_cols <- c('displ', 'year', 'cyl', 'cty', 'hwy')
numerical_df <- df %>% 
  select(any_of(numerical_cols))

# negative selections
df %>% 
  select(-any_of(c(model, displ, cyl)))
1
select and rename in one call
2
select all remaining columns that haven’t already been specified
3
this will select all column of data type int or dbl
4
add two additional non-numeric cols to your selection
5
remove any of these columns that appear in the data frame

mutate() & transmute()

Create new columns from existing ones.

mutate([new_col] = [old_col(s) + logic])

  • creates a new column based on the logic provided
  • you can create multiple columns at one time
  • adds the new col to the df

transmute([new_col] = [old_col(s) + logic])

  • creates a new column and drops other columns (this can be useful when do machine learning and you don’t want the new columns interfering with the model)

Helpful functions for dealing with strings:

Functions
as.character()
as.date()
as.integer()

Note There is also lubridate and hms packages for working with dates and times that give a lot more flexibility. stringr is another useful package commonly used with mutate to extract parts of string columns.

# create 'avg mpg' column and paste together two cols into 'car'
# this will return the full df plus these two new cols
mpg %>% 
    mutate(
         car = paste(manufacturer, model, sep = ' '),
        `avg mpg` = (( cty + hwy ) / 2 ),
         cty,
        .keep = "none") %>%
    head(5)
1
We can specify the col name with no calc and it will get included
2
Acts like transmute to keep only created columns
# A tibble: 5 × 3
    cty car     `avg mpg`
  <int> <chr>       <dbl>
1    18 audi a4      23.5
2    21 audi a4      25  
3    20 audi a4      25.5
4    21 audi a4      25.5
5    16 audi a4      21  

Use .by to group our selection in a single expression. We return the max avg mpg for each model type:

mpg %>% 
    mutate(`avg mpg` = ((cty+hwy)/2),
           `max avg mpg` = max(`avg mpg`),
           .by = model,
           .keep = "used")
1
Now we keep only columns used in our mutate verb, including those used in the calculations.
# A tibble: 234 × 5
   model        cty   hwy `avg mpg` `max avg mpg`
   <chr>      <int> <int>     <dbl>         <dbl>
 1 a4            18    29      23.5          25.5
 2 a4            21    29      25            25.5
 3 a4            20    31      25.5          25.5
 4 a4            21    30      25.5          25.5
 5 a4            16    26      21            25.5
 6 a4            18    26      22            25.5
 7 a4            18    27      22.5          25.5
 8 a4 quattro    18    26      22            24  
 9 a4 quattro    16    25      20.5          24  
10 a4 quattro    20    28      24            24  
# ℹ 224 more rows

Common usage

Replace existing columns with mutated version of themselves, either by converting the data type or cleaning the column of NAs. Create new columns with mutate.

  • case_when()
  • case_match()
  • if_else()
  • na_if()
# mutate with case_when
mpg %>% 
    mutate(
        "hwy_bins" = case_when(
            hwy < 20 ~ "low",
            between(hwy, 20, 30) ~ "medium",
            hwy > 30 ~ "high",
            TRUE ~ NA_character_
            ))

# mutate with if_else
mpg %>% 
    mutate(
        "is_offroad" = if_else(
            drv == "4", "Y", "N"
        ))

# mutate with na_if() to replace values with NA
mpg %>% 
    mutate(year = na_if(year, NaN))
1
if NaN present, replace it with NA to help clean the column

Other mutate verbs

  • mutate_all() - affects every variable
  • mutate_if() - affects variables selected with a predicate function
  • mutate_at() - affects variables selected with a character variable or vars()
# This code:
table_sep <- table %>%
  separate(col = date,
           into = c("year", "month", "dayofmonth"),
           sep = "-") %>%
  mutate(month = as.numeric(month),
         dayofmonth = as.numeric(dayofmonth)) %>%
  arrange(year, month, dayofmonth)

# is the same as this:
table_sep <- table %>%
  separate(col = date,
           into = c("year", "month", "dayofmonth"),
           sep = "-") %>%
  mutate_at(.vars = c("month", "dayofmonth"),
            .funs = as.numeric)
  arrange(year, month, dayofmonth)
1
Where we have to specify every column to change data type here…
2
…we can supply the columns in a list and only have to specify the mutate function once

You can imagine if we had lots of columns to mutate, this would be much more succinct.

Note

It is generally preferred to use across() instead of the above verbs. See Column-wise operations for a detailed explanation of this.

You can mutate any column that is.character/is.numeric…:

mpg %>% 
    mutate_if(is.character, stringr::str_to_title) %>%
    select(is.character) %>% 
    head(4)
1
convert all character columns to title case (upper case at start of each word)
# A tibble: 4 × 6
  manufacturer model trans      drv   fl    class  
  <chr>        <chr> <chr>      <chr> <chr> <chr>  
1 Audi         A4    Auto(L5)   F     P     Compact
2 Audi         A4    Manual(M5) F     P     Compact
3 Audi         A4    Manual(M6) F     P     Compact
4 Audi         A4    Auto(Av)   F     P     Compact

across()

across works very well with summarise() & mutate().

mpg %>% 
    summarise(across(where(is.character), n_distinct))
1
across() is used to apply the summary, n_distint(), across the selected columns - in this case all character columns
# A tibble: 1 × 6
  manufacturer model trans   drv    fl class
         <int> <int> <int> <int> <int> <int>
1           15    38    10     3     5     7

filter() & slice()

filter()

Used to extract or create a subset of rows based on logical operators - it will produce a new table

Logical operators to use with filter():

Operator Meaning
== equal to
!= not equal to
< less than
> greater than
<= less than or equal to
>= greater than or equal to
between() between two values (inclusive)
! not
is.na() is NA
!is.na() is not NA
%in% is in (a vector)
| or
& and
xor() exclusive or - ‘one or the other but not both’

Filtering a table:

# this is the same as '&'
mpg %>% 
    filter(manufacturer == "audi",
           year == 1999) %>%
    head()
# A tibble: 6 × 11
  manufacturer model      displ  year   cyl trans  drv     cty   hwy fl    class
  <chr>        <chr>      <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
1 audi         a4           1.8  1999     4 auto(… f        18    29 p     comp…
2 audi         a4           1.8  1999     4 manua… f        21    29 p     comp…
3 audi         a4           2.8  1999     6 auto(… f        16    26 p     comp…
4 audi         a4           2.8  1999     6 manua… f        18    26 p     comp…
5 audi         a4 quattro   1.8  1999     4 manua… 4        18    26 p     comp…
6 audi         a4 quattro   1.8  1999     4 auto(… 4        16    25 p     comp…

slice()

Used to select certain rows of the table based on row position

Related slice functions:

  • slice_head()
  • slice_tail()
  • slice_sample()
  • slice_max()
  • slice_min()
# select rows 10-20
slice(df, 10:20)

# slice top 10 rows
slice_head(df, n = 10)

# select bottom rows
slice_tail(df, prop = .1)

# select 10 rows at random from df
slice_sample(df, 10)
# select rows at random based on proportion of df
slice_sample(df, prop = .1)

df %>% slice_max(hwy, n = 10)
df %>% slice_min(hwy, prop = .1)
1
select the bottom 10% of the df
2
order by the selected column and select the top/bottom rows
slice_sample() Note

slice-sample can be done in two main ways, by taking the sampled rows ‘out’ of the data frame (so when subsequent sample slices are made, they cannot be chosen again), or by ‘replacing’ the sample (so in subsequent sample slices the same rows can be chosen again).

This is done using the replace = argument:

  • df %>% slice_sample(prop = .1, replace = True)
  • df %>% slice_sample(prop = .1, replace = False)

replace = False is the default.

  • You can set a seed for reproducibility. Using the same seed will yield the same random selection each time, useful for code reviews where your results need to be reproduced.
# use base r for this
set.seed(200)

arrange()

arrange([data], desc([column_a], [column_b], ...))

  • Sort rows by columns or values that you define
  • Analogous to ORDER BY in SQL syntax
  • Can wrap columns in desc() to arrange in descending order
# arrange by year asc
arrange(df, year)
#  arrange by year desc
arrange(df, desc(year))
# arrange by multiple columns
arrange(df, desc(cyl), displ) 

# or using the pipe
df %>% arrange(desc(year))

distinct()

distinct([df], [col_a], [col_b], ...)

  • Removes rows with duplicate values
  • Specify columns in distinct to select only those column
# will create a table where the combination 
# of manufacturer and model is distinct
mpg %>% 
  select(manufacturer, model) %>% 
  distinct()
1
Because no columns are specified, distinct returns all columns
# A tibble: 38 × 2
   manufacturer model             
   <chr>        <chr>             
 1 audi         a4                
 2 audi         a4 quattro        
 3 audi         a6 quattro        
 4 chevrolet    c1500 suburban 2wd
 5 chevrolet    corvette          
 6 chevrolet    k1500 tahoe 4wd   
 7 chevrolet    malibu            
 8 dodge        caravan 2wd       
 9 dodge        dakota pickup 4wd 
10 dodge        durango 4wd       
# ℹ 28 more rows
mpg %>% 
    group_by(model) %>%
    distinct(manufacturer, .keep_all = T) %>%
    head()
1
distinct will respect group_by and return distinct values for each group
2
keep_all will keep the first row of values from all other columns
# A tibble: 6 × 11
# Groups:   model [6]
  manufacturer model       displ  year   cyl trans drv     cty   hwy fl    class
  <chr>        <chr>       <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi         a4            1.8  1999     4 auto… f        18    29 p     comp…
2 audi         a4 quattro    1.8  1999     4 manu… 4        18    26 p     comp…
3 audi         a6 quattro    2.8  1999     6 auto… 4        15    24 p     mids…
4 chevrolet    c1500 subu…   5.3  2008     8 auto… r        14    20 r     suv  
5 chevrolet    corvette      5.7  1999     8 manu… r        16    26 p     2sea…
6 chevrolet    k1500 taho…   5.3  2008     8 auto… 4        14    19 r     suv