09 - Tidy data and the Tidyverse

Introduction to tidy data and how to convert between wide and long data with the tidyr R package
module 2
week 2
R
programming
tidyr
here
tidyverse
Author
Affiliations

This lecture, as the rest of the course, is adapted from the version Stephanie C. Hicks designed and maintained in 2021 and 2022. Check the recent changes to this file through the GitHub history.

“Happy families are all alike; every unhappy family is unhappy in its own way.” —- Leo Tolstoy

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” —- Hadley Wickham

Pre-lecture materials

Read ahead

Read ahead

Before class, you can prepare by reading the following materials:

  1. Tidy Data paper published in the Journal of Statistical Software
  2. https://r4ds.had.co.nz/tidy-data
  3. tidyr cheat sheet from RStudio

Acknowledgements

Material for this lecture was borrowed and adopted from

Learning objectives

Learning objectives

At the end of this lesson you will:

  • Define tidy data
  • Be able to transform non-tidy data into tidy data
  • Be able to transform wide data into long data
  • Be able to separate character columns into multiple columns
  • Be able to unite multiple character columns into one column

Tidy data

As we learned in the last lesson, one unifying concept of the tidyverse is the notion of tidy data.

As defined by Hadley Wickham in his 2014 paper published in the Journal of Statistical Software, a tidy dataset has the following properties:

  1. Each variable forms a column.

  2. Each observation forms a row.

  3. Each type of observational unit forms a table.

Artwork by Allison Horst on tidy data

[Source: Artwork by Allison Horst]

The purpose of defining tidy data is to highlight the fact that most data do not start out life as tidy.

In fact, much of the work of data analysis may involve simply making the data tidy (at least this has been our experience).

  • Once a dataset is tidy, it can be used as input into a variety of other functions that may transform, model, or visualize the data.
Example

As a quick example, consider the following data illustrating religion and income survey data with the number of respondents with income range in column name.

This is in a classic table format:

library(tidyr)
relig_income
# A tibble: 18 × 11
   religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
   <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
 1 Agnostic      27        34        60        81        76       137        122
 2 Atheist       12        27        37        52        35        70         73
 3 Buddhist      27        21        30        34        33        58         62
 4 Catholic     418       617       732       670       638      1116        949
 5 Don’t k…      15        14        15        11        10        35         21
 6 Evangel…     575       869      1064       982       881      1486        949
 7 Hindu          1         9         7         9        11        34         47
 8 Histori…     228       244       236       238       197       223        131
 9 Jehovah…      20        27        24        24        21        30         15
10 Jewish        19        19        25        25        30        95         69
11 Mainlin…     289       495       619       655       651      1107        939
12 Mormon        29        40        48        51        56       112         85
13 Muslim         6         7         9        10         9        23         16
14 Orthodox      13        17        23        32        32        47         38
15 Other C…       9         7        11        13        13        14         18
16 Other F…      20        33        40        46        49        63         46
17 Other W…       5         2         3         4         2         7          3
18 Unaffil…     217       299       374       365       341       528        407
# ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
#   `Don't know/refused` <dbl>

While this format is canonical and is useful for quickly observing the relationship between multiple variables, it is not tidy.

This format violates the tidy form because there are variables in the columns.

  • In this case the variables are religion, income bracket, and the number of respondents, which is the third variable, is presented inside the table.

Converting this data to tidy format would give us

library(tidyverse)

relig_income %>%
    pivot_longer(-religion, names_to = "income", values_to = "respondents") %>%
    mutate(religion = factor(religion), income = factor(income))
# A tibble: 180 × 3
   religion income             respondents
   <fct>    <fct>                    <dbl>
 1 Agnostic <$10k                       27
 2 Agnostic $10-20k                     34
 3 Agnostic $20-30k                     60
 4 Agnostic $30-40k                     81
 5 Agnostic $40-50k                     76
 6 Agnostic $50-75k                    137
 7 Agnostic $75-100k                   122
 8 Agnostic $100-150k                  109
 9 Agnostic >150k                       84
10 Agnostic Don't know/refused          96
# ℹ 170 more rows

Some of these functions you have seen before, others might be new to you. Let’s talk about each one in the context of the tidyverse R packages.

The “Tidyverse”

There are a number of R packages that take advantage of the tidy data form and can be used to do interesting things with data. Many (but not all) of these packages are written by Hadley Wickham and the collection of packages is often referred to as the “tidyverse” because of their dependence on and presumption of tidy data.

Note

A subset of the “Tidyverse” packages include:

  • ggplot2: a plotting system based on the grammar of graphics

  • magrittr: defines the %>% operator for chaining functions together in a series of operations on data

  • dplyr: a suite of (fast) functions for working with data frames

  • tidyr: easily tidy data with pivot_wider() and pivot_longer() functions (also separate() and unite())

A complete list can be found here (https://www.tidyverse.org/packages).

We will be using these packages quite a bit.

The “tidyverse” package can be used to install all of the packages in the tidyverse at once.

For example, instead of starting an R script with this:

library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)

You can start with this:

library(tidyverse)

In the example above, let’s talk about what we did using the pivot_longer() function.

We will also talk about pivot_wider().

pivot_longer()

The tidyr package includes functions to transfer a data frame between long and wide.

  • Wide format data tends to have different attributes or variables describing an observation placed in separate columns.
  • Long format data tends to have different attributes encoded as levels of a single variable, followed by another column that contains tha values of the observation at those different levels.
Example

In the section above, we showed an example that used pivot_longer() to convert data into a tidy format.

The key problem with the tidyness of the data is that the income variables are not in their own columns, but rather are embedded in the structure of the columns.

To fix this, you can use the pivot_longer() function to gather values spread across several columns into a single column, here with the column names gathered into an income column.

Note: when gathering, exclude any columns that you do not want “gathered” (religion in this case) by including the column names with a the minus sign in the pivot_longer() function.

For example:

# Gather everything EXCEPT religion to tidy data
relig_income %>%
    pivot_longer(-religion, names_to = "income", values_to = "respondents")
# A tibble: 180 × 3
   religion income             respondents
   <chr>    <chr>                    <dbl>
 1 Agnostic <$10k                       27
 2 Agnostic $10-20k                     34
 3 Agnostic $20-30k                     60
 4 Agnostic $30-40k                     81
 5 Agnostic $40-50k                     76
 6 Agnostic $50-75k                    137
 7 Agnostic $75-100k                   122
 8 Agnostic $100-150k                  109
 9 Agnostic >150k                       84
10 Agnostic Don't know/refused          96
# ℹ 170 more rows

Even if your data is in a tidy format, pivot_longer() is occasionally useful for pulling data together to take advantage of faceting, or plotting separate plots based on a grouping variable. We will talk more about that in a future lecture.

pivot_wider()

The pivot_wider() function is less commonly needed to tidy data. It can, however, be useful for creating summary tables.

Example

You use the summarize() function in dplyr to summarize the total number of respondents per income category.

relig_income %>%
    pivot_longer(-religion, names_to = "income", values_to = "respondents") %>%
    mutate(religion = factor(religion), income = factor(income)) %>%
    group_by(income) %>%
    summarize(total_respondents = sum(respondents)) %>%
    pivot_wider(
        names_from = "income",
        values_from = "total_respondents"
    ) %>%
    knitr::kable()
<$10k >150k $10-20k $100-150k $20-30k $30-40k $40-50k $50-75k $75-100k Don’t know/refused
1930 2608 2781 3197 3357 3302 3085 5185 3990 6121

Notice in this example how pivot_wider() has been used at the very end of the code sequence to convert the summarized data into a shape that offers a better tabular presentation for a report.

Note

In the pivot_wider() call, you first specify the name of the column to use for the new column names (income in this example) and then specify the column to use for the cell values (total_respondents here).

Example of pivot_longer()

Let’s try another dataset. This data contain an excerpt of the Gapminder data on life expectancy, GDP per capita, and population by country.

library(gapminder)
gapminder
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

If we wanted to make lifeExp, pop and gdpPercap (all measurements that we observe) go from a wide table into a long table, what would we do?

# try it yourself
Example

One more! Try using pivot_longer() to convert the the following data that contains made-up revenues for three companies by quarter for years 2006 to 2009.

Afterward, use group_by() and summarize() to calculate the average revenue for each company across all years and all quarters.

Bonus: Calculate a mean revenue for each company AND each year (averaged across all 4 quarters).

df <- tibble(
    "company" = rep(1:3, each = 4),
    "year" = rep(2006:2009, 3),
    "Q1" = sample(x = 0:100, size = 12),
    "Q2" = sample(x = 0:100, size = 12),
    "Q3" = sample(x = 0:100, size = 12),
    "Q4" = sample(x = 0:100, size = 12),
)
df
# A tibble: 12 × 6
   company  year    Q1    Q2    Q3    Q4
     <int> <int> <int> <int> <int> <int>
 1       1  2006    99     6    54    47
 2       1  2007    28    79    90     9
 3       1  2008     7    72    69    24
 4       1  2009    16    56     6   100
 5       2  2006    42    58    75    25
 6       2  2007    64     1   100     6
 7       2  2008    43    88    37    77
 8       2  2009    95    74    17    44
 9       3  2006    34    47    77    38
10       3  2007    73    31    31    54
11       3  2008     4    49    93     0
12       3  2009    57     4    45    96
# try it yourself

separate() and unite()

The same tidyr package also contains two useful functions:

  • unite(): combine contents of two or more columns into a single column
  • separate(): separate contents of a column into two or more columns

First, we combine the first three columns into one new column using unite().

gapminder %>%
    unite(
        col = "country_continent_year",
        country:year,
        sep = "_"
    )
# A tibble: 1,704 × 4
   country_continent_year lifeExp      pop gdpPercap
   <chr>                    <dbl>    <int>     <dbl>
 1 Afghanistan_Asia_1952     28.8  8425333      779.
 2 Afghanistan_Asia_1957     30.3  9240934      821.
 3 Afghanistan_Asia_1962     32.0 10267083      853.
 4 Afghanistan_Asia_1967     34.0 11537966      836.
 5 Afghanistan_Asia_1972     36.1 13079460      740.
 6 Afghanistan_Asia_1977     38.4 14880372      786.
 7 Afghanistan_Asia_1982     39.9 12881816      978.
 8 Afghanistan_Asia_1987     40.8 13867957      852.
 9 Afghanistan_Asia_1992     41.7 16317921      649.
10 Afghanistan_Asia_1997     41.8 22227415      635.
# ℹ 1,694 more rows

Next, we show how to separate the columns into three separate columns using separate() using the col, into and sep arguments.

gapminder %>%
    unite(
        col = "country_continent_year",
        country:year,
        sep = "_"
    ) %>%
    separate(
        col = "country_continent_year",
        into = c("country", "continent", "year"),
        sep = "_"
    )
# A tibble: 1,704 × 6
   country     continent year  lifeExp      pop gdpPercap
   <chr>       <chr>     <chr>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia      1952     28.8  8425333      779.
 2 Afghanistan Asia      1957     30.3  9240934      821.
 3 Afghanistan Asia      1962     32.0 10267083      853.
 4 Afghanistan Asia      1967     34.0 11537966      836.
 5 Afghanistan Asia      1972     36.1 13079460      740.
 6 Afghanistan Asia      1977     38.4 14880372      786.
 7 Afghanistan Asia      1982     39.9 12881816      978.
 8 Afghanistan Asia      1987     40.8 13867957      852.
 9 Afghanistan Asia      1992     41.7 16317921      649.
10 Afghanistan Asia      1997     41.8 22227415      635.
# ℹ 1,694 more rows

Post-lecture materials

Final Questions

Here are some post-lecture questions to help you think about the material discussed.

Questions
  1. Using prose, describe how the variables and observations are organised in a tidy dataset versus an non-tidy dataset.

  2. What do the extra and fill arguments do in separate()? Experiment with the various options for the following two toy datasets.

tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
    separate(x, c("one", "two", "three"))

tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
    separate(x, c("one", "two", "three"))
  1. Both unite() and separate() have a remove argument. What does it do? Why would you set it to FALSE?

  2. Compare and contrast separate() and extract(). Why are there three variations of separation (by position, by separator, and with groups), but only one unite()?

Additional Resources

Tip

R session information

options(width = 120)
sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16)
 os       macOS Ventura 13.5
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2023-08-17
 pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)
 colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)
 colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)
 digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)
 dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.3.0)
 evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)
 fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.0)
 fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
 forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)
 gapminder   * 1.0.0   2023-03-10 [1] CRAN (R 4.3.0)
 generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
 ggplot2     * 3.4.3   2023-08-14 [1] CRAN (R 4.3.1)
 glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)
 gtable        0.3.3   2023-03-21 [1] CRAN (R 4.3.0)
 hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)
 htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)
 htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)
 jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)
 knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)
 lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)
 lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.3.0)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
 munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)
 pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
 purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
 readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)
 rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)
 rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)
 rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)
 scales        1.2.1   2022-08-20 [1] CRAN (R 4.3.0)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
 stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)
 stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)
 tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
 tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)
 tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)
 tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)
 timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)
 tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)
 utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.0)
 vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)
 withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.0)
 xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)
 yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)

 [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────