Tidy selection

A powerful tool when working with columns

r
tidyverse
dplyr
Make programming with the tidyverse easier
Author

Matthew Scott

Published

August 10, 2024

A maroon ribbon snakes across the page

Maroon ribbon

Introduction

Tidy selection is a principle that makes it easier to work with columns in a dataset. Behind the principle is the tidyselect package [1]. This package is used by dplyr and tidyr and means you don’t have to use an exact column name to select it, you can use it’s name, position or type.

Tidy select can be used with select(), rename(), relocate(), across(), pull(), where(), any_of(), and all_of() in the dplyr package.

What does this mean for your programming?

Let’s use the starwars dataset to see how these helper functions can be useful. Expand the info box below to see the dataset:

glimpse(starwars)
Rows: 87
Columns: 14
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
$ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
$ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
$ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
$ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
$ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
$ films      <list> <"A New Hope", "The Empire Strikes Back", "Return of the J…
$ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
$ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…

Selecting by name

This can be done with the exact col name:

starwars %>% select(name, height, mass) %>% head(3)
# A tibble: 3 × 3
  name           height  mass
  <chr>           <int> <dbl>
1 Luke Skywalker    172    77
2 C-3PO             167    75
3 R2-D2              96    32

Or inexactly with the use of helper functions:

For example, starts_with() or contains(). Note: See more helper function options in my dplyr essentials blog

The column start with character(s)

starwars %>% 
    select(starts_with("S", ignore.case = T)) %>% 
    head(3)
# A tibble: 3 × 4
  skin_color  sex   species starships
  <chr>       <chr> <chr>   <list>   
1 fair        male  Human   <chr [2]>
2 gold        none  Droid   <chr [0]>
3 white, blue none  Droid   <chr [0]>

Ends with an exact suffix

starwars %>% select(ends_with("s")) %>% head(3)
# A tibble: 3 × 5
   mass species films     vehicles  starships
  <dbl> <chr>   <list>    <list>    <list>   
1    77 Human   <chr [5]> <chr [2]> <chr [2]>
2    75 Droid   <chr [6]> <chr [0]> <chr [0]>
3    32 Droid   <chr [7]> <chr [0]> <chr [0]>

Contains a literal string

starwars %>% select(contains("color")) %>% head(3)
# A tibble: 3 × 3
  hair_color skin_color  eye_color
  <chr>      <chr>       <chr>    
1 blond      fair        blue     
2 <NA>       gold        yellow   
3 <NA>       white, blue red      

Matches a regular expression or a stringr pattern

starwars %>% 
    select(matches("[s]{2}")) %>%
    head(3)
1
will match where ‘s’ appears twice in a row
# A tibble: 3 × 1
   mass
  <dbl>
1    77
2    75
3    32

Use of boolean operators

Tidy selection allows you to use Boolean operators (&, |) to specify columns as well:

starwars %>% 
    select(ends_with("color") & 
           !starts_with("h")) %>%
    head(3)
1
use of ! as ‘not’. Columns that end with ‘color’ and don’t start with ‘h’. This excludes the column ‘hair_color’
# A tibble: 3 × 2
  skin_color  eye_color
  <chr>       <chr>    
1 fair        blue     
2 gold        yellow   
3 white, blue red      

Position

You don’t have to use name, you can use the position of the column in the dataframe:

starwars %>% 
    select(1, 3, 4:7) %>%
    head(3)
1
select the first, third, and 4th, 5th, 6th, 7th columns
# A tibble: 3 × 6
  name            mass hair_color skin_color  eye_color birth_year
  <chr>          <dbl> <chr>      <chr>       <chr>          <dbl>
1 Luke Skywalker    77 blond      fair        blue              19
2 C-3PO             75 <NA>       gold        yellow           112
3 R2-D2             32 <NA>       white, blue red               33

You can also use last_col() which will automatically find the last column in the dataframe. (This saves you having to work out the length of the dataframe):

# this:
starwars %>% select(1, length(starwars))

# is the same as this:
starwars %>% select(1, last_col())
1
This is more explicit - it’s meaning is easier to infer from plain English

Type

Tidyselect means you can specify columns based on their datatype as well:

starwars %>% 
    select(name, where(is.list)) %>%
    head(3)
1
Use where() to make it more explicit what you’re selecting (although not mandatory)
# A tibble: 3 × 4
  name           films     vehicles  starships
  <chr>          <list>    <list>    <list>   
1 Luke Skywalker <chr [5]> <chr [2]> <chr [2]>
2 C-3PO          <chr [6]> <chr [0]> <chr [0]>
3 R2-D2          <chr [7]> <chr [0]> <chr [0]>

In this case we can use the columns data type to ensure every numeric column is a double (maybe we need this for machine learning purposes):

starwars %>% 
    select(name, where(is.numeric)) %>% 
    mutate_if(is.integer, as.double) %>%
    head(3)
1
If we have any integer columns slected by is.numeric, ensure they are mutated to doubles with as.double
# A tibble: 3 × 4
  name           height  mass birth_year
  <chr>           <dbl> <dbl>      <dbl>
1 Luke Skywalker    172    77         19
2 C-3PO             167    75        112
3 R2-D2              96    32         33

Using environment variables or function variables with tidyverse

This is related to the data masking property that tidy evaluation employs [2], rather than its tidy select properties, but it still useful to know when extending tidy selection.

With environment variables

In order to use a variable specified in the global environment (these are the variables you’ve assigned earlier in your script), you need to add the !! or .env$ syntax around the variable to tell tidyselect to look outside of the dataframe variable [3]:

cols <- c("homeworld", "sex", "eye_color")

starwars %>% select(!!cols) %>% head(3)
# A tibble: 3 × 3
  homeworld sex   eye_color
  <chr>     <chr> <chr>    
1 Tatooine  male  blue     
2 Tatooine  none  yellow   
3 Naboo     none  red      

In functions

You need to ‘embrace’ ({{ var }}) your tidyselect syntax when it is provided as an argument to a function [4]:

#' Select all characters that appear on the
#' given homeworld and supply their numeric
#' stats.
#' Only return rows where all numeric columns 
#' aren't NA.
starwars_toptrumps <- function(df = starwars, homeworld){
    df %>% 
        select(name, homeworld, where(is.numeric)) %>% 
        filter(homeworld == {{ homeworld }}) %>%
        tidyr::drop_na(where(is.numeric))
}

starwars_toptrumps(homeworld = "Naboo")
1
Specify the use of a function argument in filter
2
Use tidyr function to check all numeric columns and only return rows that don’t include NA using drop_na()
3
Call function and feed a ‘homeworld’ into the filter argument of the function
# A tibble: 4 × 5
  name          homeworld height  mass birth_year
  <chr>         <chr>      <int> <dbl>      <dbl>
1 R2-D2         Naboo         96    32         33
2 Palpatine     Naboo        170    75         82
3 Padmé Amidala Naboo        185    45         46
4 Jar Jar Binks Naboo        196    66         52

References

1.
posit. Technical description of tidyselect. [cited 27 Dec 2024]. Available: https://tidyselect.r-lib.org/articles/syntax.html
2.
posit. Programming with dplyr. [cited 27 Dec 2024]. Available: https://dplyr.tidyverse.org/articles/programming.html#data-masking
3.
posit. Technical description of tidyselect. [cited 27 Dec 2024]. Available: https://tidyselect.r-lib.org/articles/syntax.html#data-expressions-and-env-expressions
4.
posit. Programming with dplyr. [cited 27 Dec 2024]. Available: https://dplyr.tidyverse.org/articles/programming.html#indirection-1