genius tutorial
knitr::opts_chunk$set(eval = FALSE)
Introducing genius
You want to start analysing song lyrics, where do you go? There have been music information retrieval papers written on the topic of programmatically extracting lyrics from the web. Dozens of people have gone through the laborious task of scraping song lyrics from websites. Even a recent winner of the Shiny competition scraped lyrics from Genius.com.
I too have been there. Scraping websites is not always the best use of your time. genius
is an R package that will enable you to programatically download song lyrics in a tidy format ready for analysis. To begin using the package, it first must be installed, and loaded. In addition to genius
, we will need our standard data manipulation tools from the tidyverse
.
install.packages("genius")
Single song lyrics
The simplest method of extracting song lyrics is to get just a single song at a time. This is done with the genius_lyrics()
function. It takes two main arguments: artist
and song
. These are the quoted name of the artist and song. Additionally there is a third argument info
which determines what extra metadata you can get. The possible values are title
, simple
, artist
, features
, and all
. I recommend trying them all to see how they work.
In this example we will work to retrieve the song lyrics for the upcoming musician Renny Conti.
floating <- genius_lyrics("renny conti", "people floating")
floating
Album Lyrics
Now that you have the intuition for obtaining lyrics for a single song, we can now create a larger dataset for the lyrics of an entire album using genius_album()
. Similar to genius_lyrics()
, the arguments are artist
, album
, and info
.
In the exercise below the lyrics for Snail Mail’s album Lush. Try retrieving the lyrics for an album of your own choosing.
lush <- genius_album("Snail Mail", "Lush")
lush
Adding Lyrics to a data frame
Multiple songs
A common use for lyric analysis is to compare the lyrics of one artist to another. In order to do that, you could potentially retrieve the lyrics for multiple songs and albums and then join them together. This has one major issue in my mind, it makes you create multiple object taking up precious memory. For this reason, the function add_genius()
was developed. This enables you to create a tibble with a column for an artists name and their album or song title. add_genius()
will then go through the entire tibble and add song lyrics for the tracks and albums that are available.
Let’s try this with a tibble of three songs.
Multiple albums
add_genius()
also extends this functionality to albums.
albums <- tribble(
~ artist, ~ title,
"Andrew Bird", "Armchair Apocrypha",
"Andrew Bird", "Things are really great here sort of"
)
album_lyrics <- albums %>%
add_genius(artist, title, type = "album")
album_lyrics
What is important to note here is that the warnings for this function are somewhat informative. When a 404 error occurs, this may be because that the song does not exist in Genius. Or, that the song is actually an instrumental which is the case here with Andrew Bird.
Albums and Songs
In the scenario that you want to mix single songs and lyrics, you can supply a column with the type value of each row. The example below illustrates this. First a tibble with artist, track or album title, and type columns are created. Next, the tibble is piped to add_genius()
with the unquote column names for the artist, title, and type columns. This will then iterate over each row and fetch the appropriate song lyrics.
Self-similarity
Another feature of genius
is the ability to create self-similarity matrices to visualize lyrical patterns within a song. This idea was taken from Colin Morris’ wonderful javascript based Song Sim project. Colin explains the interpretation of a self-similarity matrix in their TEDx talk. An even better description of the interpretation is available in this post.
To use Colin’s example we will look at the structure of Ke$ha’s Tik Tok.
The function calc_self_sim()
will create a self-similarity matrix of a given song. The main arguments for this function are the tibble (df
), and the column containing the lyrics (lyric_col
). Ideally this is one line per observation as is default from the output of genius_*()
. The tidy output compares every ith word with every word in the song. This measures repetition of words and will show us the structure of the lyrics.
tik_tok <- genius_lyrics("Ke$ha", "Tik Tok")
tt_self_sim <- calc_self_sim(tik_tok, lyric, output = "tidy")
tt_self_sim
tt_self_sim %>%
ggplot(aes(x = x_id, y = y_id, fill = identical)) +
geom_tile() +
scale_fill_manual(values = c("white", "black")) +
theme_minimal() +
theme(legend.position = "none",
axis.text = element_blank()) +
scale_y_continuous(trans = "reverse") +
labs(title = "Tik Tok", subtitle = "Self-similarity matrix", x = "", y = "",
caption = "The matrix displays that there are three choruses with a bridge between the last two. The bridge displays internal repetition.")