<- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/beyonce_lyrics.csv")
b_lyrics <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/taylor_swift_lyrics.csv")
ts_lyrics <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/sales.csv") sales
Project 3
This project, as the rest of the course, is adapted from the version Stephanie C. Hicks designed and maintained in 2021 and 2022. Check the recent changes to this file through the GitHub history.
Background
Due date: October 20th at 11:59pm
The goal of this assignment is to practice wrangling special data types (including dates, character strings, and factors) and visualizing results while practicing our tidyverse skills.
To submit your project
Please write up your project using R Markdown and processed with knitr
. Compile your document as an HTML file and submit your HTML file to the dropbox on Courseplus. Please show all your code (i.e. make sure to set echo = TRUE
) for each of the answers to each part.
Load data
The datasets for this part of the assignment comes from TidyTuesday.
Data dictionary available here:
Specifically, we will explore album sales and lyrics from two artists (Beyoncé and Taylor Swift), The data are available from TidyTuesday from September 2020, which I have provided for you below:
However, to avoid re-downloading data, we will check to see if those files already exist using an if()
statement:
library("here")
<- c("b_lyrics.RDS", "ts_lyrics.RDS", "sales.RDS")
rds_files ## Check whether we have all 3 files
if (any(!file.exists(here("data", rds_files)))) {
## If we don't, then download the data
<- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/beyonce_lyrics.csv")
b_lyrics <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/taylor_swift_lyrics.csv")
ts_lyrics <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/sales.csv")
sales
## Then save the data objects to RDS files
saveRDS(b_lyrics, file = here("data", "b_lyrics.RDS"))
saveRDS(ts_lyrics, file = here("data", "ts_lyrics.RDS"))
saveRDS(sales, file = here("data", "sales.RDS"))
}
The above code will only run if it cannot find the path to the b_lyrics.RDS
on your computer. Then, we can just read in these files every time we knit the R Markdown, instead of re-downloading them every time.
Let’s load the datasets
<- readRDS(here("data", "b_lyrics.RDS"))
b_lyrics <- readRDS(here("data", "ts_lyrics.RDS"))
ts_lyrics <- readRDS(here("data", "sales.RDS")) sales
Part 1: Explore album sales
In this section, the goal is to explore the sales of studio albums from Beyoncé and Taylor Swift.
Notes
- In each of the subsections below that ask you to create a plot, you must create a title, subtitle, x-axis label, and y-axis label with units where applicable. For example, if your axis says “sales” as an axis label, change it to “sales (in millions)”.
Part 1A
In this section, we will do some data wrangling.
- Use
lubridate
to create a column calledreleased
that is aDate
class. However, to be able to do this, you first need to usestringr
to search for pattern that matches things like this “(US)[51]” in a string like this “September 1, 2006 (US)[51]” and removes them. (Note: to get full credit, you must create the regular expression). - Use
forcats
to create a factor calledcountry
(Note: you may need to collapse some factor levels). - Transform the
sales
into a unit that is album sales in millions of dollars. - Keep only album sales from the UK, the US or the World.
- Auto print your final wrangled tibble data frame.
# Add your solution here
Part 1B
In this section, we will do some more data wrangling followed by summarization using wrangled data from Part 1A.
- Keep only album sales from the US.
- Create a new column called
years_since_release
corresponding to the number of years since the release of each album from Beyoncé and Taylor Swift. This should be a whole number and you should round down to “14” if you get a non-whole number like “14.12” years. (Hint: you may find theinterval()
function fromlubridate
helpful here, but this not the only way to do this.) - Calculate the most recent, oldest, and the median years since albums were released for both Beyoncé and Taylor Swift.
# Add your solution here
Part 1C
Using the wrangled data from Part 1A:
- Calculate the total album sales for each artist and for each
country
(only sales from the UK, US, and World).- Note: assume that the World sales do not include the UK and US ones.
- Using the total album sales, create a percent stacked barchart using
ggplot2
of the percentage of sales of studio albums (in millions) along the y-axis for the two artists along the x-axis colored by thecountry
.
# Add your solution here
Part 1D
Using the wrangled data from Part 1A, use ggplot2
to create a bar plot for the sales of studio albums (in millions) along the x-axis for each of the album titles along the y-axis.
Note:
- You only need to consider the global World sales (you can ignore US and UK sales for this part). Hint: how would you abbreviate WorldWide?
- The title of the album must be clearly readable along the y-axis.
- Each bar should be colored by which artist made that album.
- The bars should be ordered from albums with the most sales (top) to the least sales (bottom) (Note: you must use functions from
forcats
for this step).
# Add your solution here
Part 1E
Using the wrangled data from Part 1A, use ggplot2
to create a scatter plot of sales of studio albums (in millions) along the y-axis by the released date for each album along the x-axis.
Note:
- The points should be colored by the artist.
- There should be three scatter plots (one for UK, US and world sales) faceted by rows.
# Add your solution here
Part 2: Exploring sentiment of lyrics
In Part 2, we will explore the lyrics in the b_lyrics
and ts_lyrics
datasets.
Part 2A
Using ts_lyrics
, create a new column called line
with one line containing the character string for each line of Taylor Swift’s songs.
- How many lines in Taylor Swift’s lyrics contain the word “hello”? For full credit, show all the rows in
ts_lyrics
that have “hello” in theline
column and report how many rows there are in total. - How many lines in Taylor Swift’s lyrics contain the word “goodbye”? For full credit, show all the rows in
ts_lyrics
that have “goodbye” in theline
column and report how many rows there are in total.
# Add your solution here
Part 2B
Repeat the same analysis for b_lyrics
as described in Part 2A.
# Add your solution here
Part 2C
Using the b_lyrics
dataset,
- Tokenize each lyrical line by words.
- Remove the “stopwords”.
- Calculate the total number for each word in the lyrics.
- Using the “bing” sentiment lexicon, add a column to the summarized data frame adding the “bing” sentiment lexicon.
- Sort the rows from most frequent to least frequent words.
- Only keep the top 25 most frequent words.
- Auto print the wrangled tibble data frame.
- Use
ggplot2
to create a bar plot with the top words on the y-axis and the frequency of each word on the x-axis. Color each bar by the sentiment of each word from the “bing” sentiment lexicon. Bars should be ordered from most frequent on the top to least frequent on the bottom of the plot. - Create a word cloud of the top 25 most frequent words.
# Add your solution here
Part 2D
Repeat the same analysis as above in Part 2C, but for ts_lyrics
.
# Add your solution here
Part 2E
Using the ts_lyrics
dataset,
- Tokenize each lyrical line by words.
- Remove the “stopwords”.
- Calculate the total number for each word in the lyrics for each Album.
- Using the “afinn” sentiment lexicon, add a column to the summarized data frame adding the “afinn” sentiment lexicon.
- Calculate the average sentiment score for each Album.
- Auto print the wrangled tibble data frame.
- Join the wrangled data frame from Part 1A (album sales in millions) filtered down to US sales with the wrangled data frame from #6 above (average sentiment score for each album).
- Using
ggplot2
, create a scatter plot of the average sentiment score for each album (y-axis) and the album release data along the x-axis. Make the size of each point the album sales in millions. - Add a horizontal line at y-intercept=0.
- Write 2-3 sentences interpreting the plot answering the question “How has the sentiment of Taylor Swift’s albums have changed over time?”. Add a title, subtitle, and useful axis labels.
# Add your solution here
R session information
options(width = 120)
::session_info() sessioninfo
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 4.4.1 (2024-06-14)
os macOS Sonoma 14.5
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2024-08-20
pandoc 3.2 @ /opt/homebrew/bin/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.0)
colorout * 1.3-0.2 2024-05-03 [1] Github (jalvesaq/colorout@c6113a2)
digest 0.6.36 2024-06-23 [1] CRAN (R 4.4.0)
evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.4.0)
fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)
here * 1.0.1 2020-12-13 [1] CRAN (R 4.4.0)
htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0)
jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.4.0)
knitr 1.48 2024-07-07 [1] CRAN (R 4.4.0)
rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)
rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0)
rprojroot 2.0.4 2023-11-05 [1] CRAN (R 4.4.0)
rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)
xfun 0.46 2024-07-18 [1] CRAN (R 4.4.0)
yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.0)
[1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────