Introduction to the tidyverse

Data Science flavoured R

tutorial
r
tidyverse
intro
Tidyverse is a collection of R packages designed for data science
Author

Matthew Scott

Published

July 20, 2024

A picture I drew one time

Red brush strokes

Tidyverse

Tidyverse is a collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Installation This can be done using 1) install.packages("tidyverse"); 2) library(tidyverse) to install and import all packages.

Tip

Installing tidyverse will result in (usually) a lot of libraries your code isn’t using being installed, which is not efficient. It is best practice (and helps your learning) to install packages individually. See below for popular tidyverse packages.

Installing & importing packages

# installing
install.packages("tidyverse")
# or install specific packages
install.packages("dplyr")
install.packages("ggplot2")

# now import them into your session
library(tidyverse)
library(dplyr)
library(ggplot2)

Inspecting a dataset

An essential first step in any data analytical task is inspecting your data visually. Some packages come with dataset you can work with so you’ll want to see what they look like, or you can inspect your own data.

Toy datasets

It is useful to use toy datasets which come included when you install and import the relevant package. Some examples are:

  • mpg from ggplot2
  • starwars from dplyr
  • storms from dplyr
  • band_members from dplyr (a small dataset, it contains three tables useful for demonstrating joins)
library(dplyr)

# open ggplot2's data dictionary for this packages internal dataset
# help("mpg")

# load the dataset into a variable
df <- ggplot2::mpg

# see information rich summary
# glimpse(df)

# see dimension of object 
# (number of rows and columns)
# dim(df)

# see top n rows
df %>% head(n = 5)
1
We need this package so we can access the %>% ‘pipe’ operator
2
this notation tells R to look in the ggplot2 package for the dataset mpg
3
You can also use tail() to see the bottom n rows. Use head() to default to the top 6 rows.
# A tibble: 5 × 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…