So being the hip millenial that I am Spotify has always endeared itself to me with its commitment to data-based analytics. Their “Based on your Tastes In…” type recommendations are often pretty great, their “Labs” blog comes through with quality #DataScience, and their API provides a ton of useful track data if you’re, say, trying to find a final project topic for your “Statistical Learning” class. Speaking of which…
A Topic for my “Statistical Learning Course”
Whether I’m coding, biking, cooking, or resignedly laboring through p-sets I am basically always also listening to music. Typically I listen to music by album; to me it’s the most intuitive way to remember and group the artists, genres, or “vibes” that I want (or at least what I’m habituated to). Furthermore albums have the advantage of being 1+ hours long, so if I want to put on music but not worry about DJ-ing every 3-4 minutes they are ideal. However this isn’t perfect as many artists vary up the track type over the course of an album, so of that 1+ hours I usually wind up listening to only 20-30 minutes max. Long story short I’ve recently gotten interested in making my own playlists. It seems fun to choose a loose theme that ties together songs from different artists or even different genres, and I think that it would be convenient if I could select 50 tracks tailored to a few specific moods or associations.
Back when one still burned CDs I used to make mix tapes, although they were really just collections of sweet, classic rock bangers arranged without rhyme, reason, or structure other than “this solo is sick dude”. No longer being an unsophisticated, teenaged scrub I wanted to take a more mindful approach to playlist construction, at least with regards to choosing what goes in to each playlist in the first place. So I set down some loose rules. First, I wanted any given artist to only be featured once in my playlists. The idea here was that I didn’t want a playlist to become a “Best Of” album for one or two specific artists who fit a narrow genre. Second, I wanted each playlist to be sort of “intersubjectively verifiable”. That is to say, I wanted someone other than myself to be able to listen to the tracks I chose, maybe read a one or two sentence description, and then be able to decide which playlist some out-of-sample track belongs in.
It turns out that this wasn’t straightforward. I tend to add my first 10 tracks all at once, and then slowly chuck new tracks in on the fly, as they came up during the course of my usual listening. This meant that my themes would start to drift over time, and eventually I’d have to go through and purge a bunch of songs all at once to get back to the original idea. Being a savvy mathologist (and needing a project topic) I tried to hit this problem with some Statistical Learning
So I before I launch into the analysis side of things, I wanted to talk about the playlists quickly. I put together two, 21 track playlists which I titled BL33P C O R E (B3C) and chewy beats (CB). The former is sort of intended to be something between IDM and Abmient without being as boring as either. Also a track can qualify by featuring a ton of bleeps. The latter playlist is supposed to be comrprised of (loosely) “beats that get stuck in your teeth”, so bass-heavy with a prominent and hooky beat/rhythm. You can listen to the two playlists at the bottom of this section and decide for yourself if they should be grouped together or not. The prototypical songs for each are “A Paw in My Face”, by The Field and “Buggin’ Out” by A Tribe Called Quest.
BL33P C O R E
So for (almost) any track in its library Spotify provides measurements of 11 different “audio features”. You can read about the specifics here, but these are basically things like:
- Danceability: a number from 0 to 1 with 1 being the most “danceable”)
- Speechiness: a number from 0 to 1 indicating how “vocal” a track is. Rap, for example, typically falls between .33 and .66 on this scale, whereas a podcast or something is near 1.
- Acousticness: a measure of how confident Spotify is that the track is acoustic, between 0 and 1
My goal was to design an algorithm (called a “classifier”) that for any given track will accepts these 11 features as inputs and return a score between 0 to 1 based on how confident it is that the track should be in B3C (indicated by scores near 0) or CB (indicated by scores near 1). By rounding these scores to 0 or 1 we are effectively “classifying” a single track as belonging to either playlist. The simple place to make the rounding cutoff is at .5 (so if the score is >.5 we set it to 1, and set it to 0 if it’s <.5), but we could set these anywhere. For example, if we wanted to only make a classification if we were confident then we could assign scores >.9 to 1 scores <.1 as 0, and refuse to classify anything else. This type of algorithm falls into the field of “Machine Learning” or “Statistical Learning” depending on whether you’re a brainless pleb or a sophisticated intellectual (respectively). The general idea is to use techniques from (convex) optimization and statistics to find algorithmic representations of reference patterns (a so-called training set), and then make decisions based on how those match with observation. Support vector machines, neural nets (or more broadly, deep learning) are other approachs that fall under this umbrella.
My goal in using this classifier was to validate the themes behind each playlist. Basically I’m operating under the belief that if the design choices for my playlists are “legitimate” (ie. definable in some way that different people could unambiguously agree on them) then there should be some patterns in the 11 Spotify audio features for each playlist. My classifier should be able to pick up on these patterns, and recommend me similar tracks from outside of these two playlists that match the pattern and (ideally) fit the theme of the playlist. If the recommendations are good, then the theme is probably clear and “true”, but if they’re bad then the themes are weak and so am I. Put another way I’m trying to leverage the rule “Garbage In, Garbage Out” (GIGO). If my classifier churns out garbage, then I probably fed it garbage in the form of my shitty playlists. If it churns out not-garbage, however, then at least I know that I have a sweet career as a DJ to fall back on if this academia thing falls through.
A few quick caveats here, to hedge a little against the inevitable stink of my own failure. (recall that I know how this whole project turns out). This line of reasoning assumes that the playlist themes are detectable in the Spotify audio features, but that doesn’t have to be true. For example, I could make a playlist with the selection criteria “Songs During Which the Singer’s Vocal Range Exceeds one Octave”. This isn’t really something that the audio features are designed to measure, so my classifier probably wouldn’t be able to find me songs that belong in this playlist. It’s also possible that I already took all the songs that belong in the playlist, and put them in the playlist already. If I make a playlist that’s just “The Most Depressing Radiohead Songs” then my classifier will have nothing left to recommend me, so it might just start returning garbage depending on how discriminating I’ve told it to be.
I’ll talk about the specifics of this algorithm from a non-technical starting point after I go over the results, but here’s a quick roll-up for people who like to read the last page of a book first (which assumes a technical background, so feel free to skip to the next section). My classifier is a multivariable logistic regression fit with a LASSO penalty and a predictive decision boundary of 0.5 . This penalty automatically performs variable selection, while the binomial model fit is simle to interpret (as compared to something like an SVM or k-means classifier), which was important for calibration of the model and dissecting my results. Furthermore this type of classifier is easily ported over to a Bayesian framework. You can swap the LASSO penalty for Laplace prior (or something else, I don’t think the MLE/MAP equivalence holds because we’re not in Gaussian Kansas anymore so the Laplace isn’t quite as special anymore) and then you get a whole posterior over the parameters which is useful if you wanted to incorporate prediction uncertainty into your prediciton decision in a straightforward manner (although I didn’t do that for this project).
So basically I trained my classifier on the two playlists and then applied it to every song in my Spotify library. I put the Top 10 most confident predictions (as measured by how close their scores were to 0 or 1) into two playlists (in descending order of confidence, so the first track is the most confident pick, etc.). Check ’em out below.
BL33P C O R E Recommendations:
chewy beats Recommendations:
How’d We Do?
So, in my opinion, neither of these recommended playlists fit very well with the original, although I don’t think it’s entirely my fault (see the caveats above). The chewy beats recommendations IMO generally better than the BL33P C O R E ones, so that’s a little comforting. Now lets dig into how the classifier chose to assign the track scores so we can understand why the recommendations were off, and why it performed better for one playlist than the other.
Basically the way logistic regression works is by taking the numerical value from each feature, multiplying each value by “feature weight” (which can be positive or negative numbers) and summing the resulting products. To make it more concrete, say that a track had a “Danceability” score of .2, a “Speechiness” of .5, and a “Valence” of .1 (so basically something by Tom Waits). For this hypothetical track we compute the sum S = .2*A + .5*B + .1*C , where A,B, and C are the feature weights assigned to each feature. We then feed the value S into a special function (the logistic function) whose output is close to 1 if the S is large in the positive direction (like 1000) and close to -1 if the S is large in the negative direction (like -1000). Recall that a score near 0 meas confidently B3C and a score near 1 means confidently CB.
Since large feature weights drive up the value of S we can use them to understand how the classifier “thought” about the problem and what features it looked at to make its decision. The magnitude of a weight indicate to us that how useful the corresponding feature was in making the decision between the two playlists, while the sign (whether it’s positive or negative) tells us which playlist that feature was indicative of. Examples:
- If a weight is 0 then that feature was not useful for distinguishing between the playlists.
- If A = 20, for example, then this means that a track with a high “Danceability” is very probably a chewy beat.
- If C = -2 then this means that a track with a high “Valence” is somewhat probably BL33P C O R E.
If we look at the actual weights the classifier ended up using (which it calculated using the training data), we see that the two most important features for CB were “Danceability” with a feature weight of 6.39 and “Speechiness” with feature weight of 8.64. This isn’t wildly surprising, as looking through the playlist it’s largely composed of hip hop, R&B, and rap, genres which are often danceable and lyrically oriented. On the other hand the B3C feature weights were at most about half as large, the two biggest being “Acousticness” at -1.55 and “Liveness” at -3.30. Glossing over a some of the subtlety around variable scale, this set of weights indicates (to me) that tracks could pretty easily be identified as chewy beats based on their features, but that often the decision to assign a track to BL33P C O R E was made just because the track wasn’t clearly in CB (this argument can be made more rigorous by cranking up the LASSO penalty and observing which variables drop out of the classifier, the first to go were always the B3C features).
This goes a good way towards explaining the prediction discrepancy between the two playlists. We basically have that B3C wasn’t very identifiable in terms of the provided feature data, so that’s why the suggested playlist is such a hodgepodge. We can see how the recommendations reflect the features the classifier thought were important (there’s a healthy dose of live-sounding jazz, for example). The selections for CB, on the other hand , seem at least slightly more consistent with each other and with the original playlist (although that Julianna Barwick pick, for example, was probably chosen solely because it was “Speechy”, which suggests to me that CB wasn’t perfectly defined in the feature data either).
One track that had a really interesting effect on the overall model behavior was “Sunspell” by Geotic. This was a track that I had personally classed as B3C because of it’s higher-pitch, soft synth tone and spacey vibe, however it’s “clearly” a chewy beat (at least according to my girlfriend and my binomial classifier). I ended up leaving it in the mix because I wanted to make sure the list included some tracks with a light groove, but this had a pretty big impact on the classifications. When it was included in the playlist my within-sample scores were pretty evenly spread between 0 and 1, ie. it wasn’t really too confident about anything (it’s not overfit). Taking it out, however, sent everything to basically a perfect 1 or 0, suddenly it became very confident about was BL33P C O R E and what was a chewy beat. Leaving it in was a modelling choice I made to keep the predictions for the playlist from getting too similar to what I had chosen for the original list, variety is the spice of life.
The Lessons are Learned but the Damage is Irreversible
Basically classification is hard and so is playlist construction. Next up I’m hoping to do a pedagological writeup of logistic regression from the “machine learning standpoint” (sort of the approach that I used here, employing LASSO penalties and stuff) as well as from the Bayesian perspective. I’d also like to do a rundown of how to use the Spotify API with ‘httr’ to pay forward all the copy+paste script kiddie-ing that I’ve done in this project.
Let me know what you think of my analysis and playlists! I’ll try to post the cleaned track data on GitHub or something along with my analysis script in the next few days as well.
Updates: the code I used is available here, which also includes the .Rdata file if you want to load up my dataframes without running the whole thing. I apologize if it’s an unreadable mess, but I refuse to improve.