Introduction to the tidyverse

Data Science flavoured R

tutorial

tidyverse

intro

Tidyverse is a collection of R packages designed for data science

Author

Matthew Scott

Published

July 20, 2024

A picture I drew one time — Red brush strokes

Tidyverse

Tidyverse is a collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Installation This can be done using 1) install.packages("tidyverse"); 2) library(tidyverse) to install and import all packages.

Tip

Installing tidyverse will result in (usually) a lot of libraries your code isn’t using being installed, which is not efficient. It is best practice (and helps your learning) to install packages individually. See below for popular tidyverse packages.

Popular Tidyverse packages

dplyr - Solve the most common data manipulation challenges (NB dbplyr allows you to use remote database tables by converting dplyr code to SQL)
readr - Read and write tabular data like csv and tsv formats. (NB there are options like readxl, writexl for working with excel files and googlesheets4 for Google sheets)
stringr - Set of functions designed to make working with strings as easy as possible. It also incorporates Reg Ex patterns into its syntax. Many common data cleaning and preparation tasks involve string cleaning such as detecting matches, sub-setting strings, mutating strings, ordering, …
tidyr - A set of functions to help tidy data (each column is a row, each row an observation, and each cell a single value). separate_wider_delim(), hoist(), pivot_longer(), …
ggplot2 - A declarative package for making graphics. See also R Graphics Cookbook
purrr - Provides a complete set of tools for working with functions and vectors. (The map() family can efficiently replace for loops). A good place to start learning is here.

Note There are more packages than this. Some other helpful ones to know about include: httr, lubridate, glue, modelr, forcats.

Installing & importing packages

# installing
install.packages("tidyverse")
# or install specific packages
install.packages("dplyr")
install.packages("ggplot2")

# now import them into your session
library(tidyverse)
library(dplyr)
library(ggplot2)

Inspecting a dataset

An essential first step in any data analytical task is inspecting your data visually. Some packages come with dataset you can work with so you’ll want to see what they look like, or you can inspect your own data.

Toy datasets

It is useful to use toy datasets which come included when you install and import the relevant package. Some examples are:

mpg from ggplot2
starwars from dplyr
storms from dplyr
band_members from dplyr (a small dataset, it contains three tables useful for demonstrating joins)

library(dplyr)

# open ggplot2's data dictionary for this packages internal dataset
# help("mpg")

# load the dataset into a variable
df <- ggplot2::mpg

# see information rich summary
# glimpse(df)

# see dimension of object 
# (number of rows and columns)
# dim(df)

# see top n rows
df %>% head(n = 5)

1: We need this package so we can access the %>% ‘pipe’ operator
2: this notation tells R to look in the ggplot2 package for the dataset mpg
3: You can also use tail() to see the bottom n rows. Use head() to default to the top 6 rows.

# A tibble: 5 × 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…