This lecture, as the rest of the course, is adapted from the version Stephanie C. Hicks designed and maintained in 2021 and 2022. Check the recent changes to this file through the GitHub history.
“Happy families are all alike; every unhappy family is unhappy in its own way.” —- Leo Tolstoy
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” —- Hadley Wickham
Pre-lecture materials
Read ahead
Read ahead
Before class, you can prepare by reading the following materials:
Tidy Data paper published in the Journal of Statistical Software
The purpose of defining tidy data is to highlight the fact that most data do not start out life as tidy.
In fact, much of the work of data analysis may involve simply making the data tidy (at least this has been our experience).
Once a dataset is tidy, it can be used as input into a variety of other functions that may transform, model, or visualize the data.
Example
As a quick example, consider the following data illustrating religion and income survey data with the number of respondents with income range in column name.
Some of these functions you have seen before, others might be new to you. Let’s talk about each one in the context of the tidyverse R packages.
The “Tidyverse”
There are a number of R packages that take advantage of the tidy data form and can be used to do interesting things with data. Many (but not all) of these packages are written by Hadley Wickham and the collection of packages is often referred to as the “tidyverse” because of their dependence on and presumption of tidy data.
Note
A subset of the “Tidyverse” packages include:
ggplot2: a plotting system based on the grammar of graphics
magrittr: defines the %>% operator for chaining functions together in a series of operations on data
dplyr: a suite of (fast) functions for working with data frames
tidyr: easily tidy data with pivot_wider() and pivot_longer() functions (also separate() and unite())
In the example above, let’s talk about what we did using the pivot_longer() function.
We will also talk about pivot_wider().
pivot_longer()
The tidyr package includes functions to transfer a data frame between long and wide.
Wide format data tends to have different attributes or variables describing an observation placed in separate columns.
Long format data tends to have different attributes encoded as levels of a single variable, followed by another column that contains tha values of the observation at those different levels.
Example
In the section above, we showed an example that used pivot_longer() to convert data into a tidy format.
The key problem with the tidyness of the data is that the income variables are not in their own columns, but rather are embedded in the structure of the columns.
To fix this, you can use the pivot_longer() function to gather values spread across several columns into a single column, here with the column names gathered into an income column.
Note: when gathering, exclude any columns that you do not want “gathered” (religion in this case) by including the column names with a the minus sign in the pivot_longer() function.
Even if your data is in a tidy format, pivot_longer() is occasionally useful for pulling data together to take advantage of faceting, or plotting separate plots based on a grouping variable. We will talk more about that in a future lecture.
pivot_wider()
The pivot_wider() function is less commonly needed to tidy data. It can, however, be useful for creating summary tables.
Example
You use the summarize() function in dplyr to summarize the total number of respondents per income category.
Notice in this example how pivot_wider() has been used at the very end of the code sequence to convert the summarized data into a shape that offers a better tabular presentation for a report.
Note
In the pivot_wider() call, you first specify the name of the column to use for the new column names (income in this example) and then specify the column to use for the cell values (total_respondents here).
Example of pivot_longer()
Let’s try another dataset. This data contain an excerpt of the Gapminder data on life expectancy, GDP per capita, and population by country.
library(gapminder)gapminder
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
If we wanted to make lifeExp, pop and gdpPercap (all measurements that we observe) go from a wide table into a long table, what would we do?
# try it yourself
Example
One more! Try using pivot_longer() to convert the the following data that contains made-up revenues for three companies by quarter for years 2006 to 2009.
Afterward, use group_by() and summarize() to calculate the average revenue for each company across all years and all quarters.
Bonus: Calculate a mean revenue for each company AND each year (averaged across all 4 quarters).
Both unite() and separate() have a remove argument. What does it do? Why would you set it to FALSE?
Compare and contrast separate() and extract(). Why are there three variations of separation (by position, by separator, and with groups), but only one unite()?
Additional Resources
Tip
Tidy Data paper published in the Journal of Statistical Software