Constructing Spotify Playlists Using Statistical Learning

So being the hip millenial that I am Spotify has always endeared itself to me with its commitment to data-based analytics.  Their “Based on your Tastes In…” type recommendations are often pretty great, their “Labs” blog comes through with quality #DataScience, and their API provides a ton of useful track data if you’re, say, trying to find a final project topic for your “Statistical Learning” class.  Speaking of which…

A Topic for my “Statistical Learning Course”

Whether I’m coding, biking, cooking, or resignedly laboring through p-sets I am basically always also listening to music.  Typically I listen to music by album; to me it’s the most intuitive way to remember and group the artists, genres, or “vibes” that I want (or at least what I’m habituated to).  Furthermore albums have the advantage of being 1+ hours long, so if I want to put on music but not worry about DJ-ing every 3-4 minutes they are ideal.  However this isn’t perfect as many artists vary up the track type over the course of an album, so of that 1+ hours I usually wind up listening to only 20-30 minutes max.  Long story short I’ve recently gotten interested in making my own playlists.  It seems fun to choose a loose theme that ties together songs from different artists or even different genres, and I think that it would be convenient if I could select 50 tracks tailored to a few specific moods or associations.

Back when one still burned CDs I used to make mix tapes, although they were really just collections of sweet, classic rock bangers arranged without rhyme, reason, or structure other than “this solo is sick dude”.  No longer being an unsophisticated, teenaged scrub I wanted to take a more mindful approach to playlist construction, at least with regards to choosing what goes in to each playlist in the first place. So I set down some loose rules. First, I wanted any given artist to only be featured once in my playlists. The idea here was that I didn’t want a playlist to become a “Best Of” album for one or two specific artists who fit a narrow genre. Second, I wanted each playlist to be sort of “intersubjectively verifiable”.  That is to say, I wanted someone other than myself to be able to listen to the tracks I chose, maybe read a one or two sentence description, and then be able to decide which playlist some out-of-sample track belongs in.

It turns out that this wasn’t straightforward.  I tend to add my first 10 tracks all at once, and then slowly chuck new tracks in on the fly, as they came up during the course of my usual listening. This meant that my themes would start to drift over time, and eventually I’d have to go through and purge a bunch of songs all at once to get back to the original idea. Being a savvy mathologist (and needing a project topic) I tried to hit this problem with some Statistical Learning

The Playlists

So I before I launch into the analysis side of things, I wanted to talk about the playlists quickly.  I put together two, 21 track playlists which I titled BL33P C O R E (B3C) and chewy beats (CB).  The former is sort of intended to be something between IDM and Abmient without being as boring as either.  Also a track can qualify by featuring a ton of bleeps.  The latter playlist is supposed to be comrprised of (loosely) “beats that get stuck in your teeth”, so bass-heavy with a prominent and hooky beat/rhythm.  You can listen to the two playlists at the bottom of this section and decide for yourself if they should be grouped together or not.  The prototypical songs for each are “A Paw in My Face”, by The Field and “Buggin’ Out” by A Tribe Called Quest.


chewy beats

The Analysis

So for (almost) any track in its library Spotify provides measurements of 11 different “audio features”.  You can read about the specifics here, but these are basically things like:

  • Danceability: a number from 0 to 1 with 1 being the most “danceable”)
  • Speechiness: a number from 0 to 1 indicating how “vocal” a track is.  Rap, for example, typically falls between .33 and .66 on this scale, whereas a podcast or something is near 1.
  • Acousticness: a measure of how confident Spotify is that the track is acoustic, between 0 and 1

My goal was to design an algorithm (called a “classifier”) that for any given track will accepts these 11 features as inputs and return a score between 0 to 1 based on how confident it is that the track should be in B3C (indicated by scores near 0) or CB (indicated by scores near 1).  By rounding these scores to 0 or 1 we are effectively “classifying” a single track as belonging to either playlist.  The simple place to make the rounding cutoff is at .5 (so if the score is >.5 we set it to 1, and set it to 0 if it’s <.5), but we could set these anywhere.  For example, if we wanted to only make a classification if we were confident then we could assign scores >.9 to 1 scores <.1 as 0, and refuse to classify anything else.  This type of algorithm falls into the field of “Machine Learning” or “Statistical Learning” depending on whether you’re a brainless pleb or a sophisticated intellectual (respectively).  The general idea is to use techniques from (convex) optimization and statistics to find algorithmic representations of reference patterns (a so-called training set), and then make decisions based on how those match with observation.  Support vector machines, neural nets (or more broadly, deep learning) are other approachs that fall under this umbrella.

My goal in using this classifier was to validate the themes behind each playlist.  Basically I’m operating under the belief that if the design choices for my playlists are “legitimate” (ie. definable in some way that different people could unambiguously agree on them) then there should be some patterns in the 11 Spotify audio features for each playlist.  My classifier should be able to pick up on these patterns, and recommend me similar tracks from outside of these two playlists that match the pattern and (ideally) fit the theme of the playlist.  If the recommendations are good, then the theme is probably clear and “true”, but if they’re bad then the themes are weak and so am I.  Put another way I’m trying to leverage the rule “Garbage In, Garbage Out” (GIGO).  If my classifier churns out garbage, then I probably fed it garbage in the form of my shitty playlists.  If it churns out not-garbage, however, then at least I know that I have a sweet career as a DJ to fall back on if this academia thing falls through.

A few quick caveats here, to hedge a little against the inevitable stink of my own failure. (recall that I know how this whole project turns out).  This line of reasoning assumes that the playlist themes are detectable in the Spotify audio features, but that doesn’t have to be true.  For example, I could make a playlist with the selection criteria “Songs During Which the Singer’s Vocal Range Exceeds one Octave”.  This isn’t really something that the audio features are designed to measure, so my classifier probably wouldn’t be able to find me songs that belong in this playlist.  It’s also possible that I already took all the songs that belong in the playlist, and put them in the playlist already.  If I make a playlist that’s just “The Most Depressing Radiohead Songs” then my classifier will have nothing left to recommend me, so it might just start returning garbage depending on how discriminating I’ve told it to be.

I’ll talk about the specifics of this algorithm from a non-technical starting point after I go over the results, but here’s a quick roll-up for people who like to read the last page of a book first (which assumes a technical background, so feel free to skip to the next section).  My classifier is a multivariable logistic regression fit with a LASSO penalty and a predictive decision boundary of 0.5 .  This penalty automatically performs variable selection, while the binomial model fit is simle to interpret (as compared to something like an SVM or k-means classifier), which was important for calibration of the model and dissecting my results.  Furthermore this type of classifier is easily ported over to a Bayesian framework.  You can swap the LASSO penalty for Laplace prior (or something else, I don’t think the MLE/MAP equivalence holds because we’re not in Gaussian Kansas anymore so the Laplace isn’t quite as special anymore) and then you get a whole posterior over the parameters which is useful if you wanted to incorporate prediction uncertainty into your prediciton decision in a straightforward manner (although I didn’t do that for this project).

The Results

So basically I trained my classifier on the two playlists and then applied it to every song in my Spotify library.  I put the Top 10 most confident predictions (as measured by how close their scores were to 0 or 1) into two playlists (in descending order of confidence, so the first track is the most confident pick, etc.).  Check ’em out below.

BL33P C O R E  Recommendations:

chewy beats Recommendations:

How’d We Do?

So, in my opinion, neither of these recommended playlists fit very well with the original, although I don’t think it’s entirely my fault (see the caveats above).  The chewy beats recommendations IMO generally better than the BL33P C O R E ones, so that’s a little comforting.   Now lets dig into how the classifier chose to assign the track scores so we can understand why the recommendations were off, and why it performed better for one playlist than the other.

Basically the way logistic regression works is  by taking the numerical value from each feature, multiplying each value by “feature weight” (which can be positive or negative numbers) and summing the resulting products.  To make it more concrete, say that a track had a “Danceability” score of .2, a “Speechiness” of .5, and a “Valence” of .1 (so basically something by Tom Waits).  For this hypothetical track we compute the sum S = .2*A + .5*B + .1*C , where A,B, and C are the feature weights assigned to each feature.  We then feed the value S into a special function (the logistic function) whose output is close to 1 if the S is large in the positive direction (like 1000) and close to -1 if the S is large in the negative direction (like -1000).  Recall that a score near 0 meas confidently B3C and a score near 1 means confidently CB.

Since large feature weights drive up the value of S we can use them to understand how the classifier “thought” about the problem and what features it looked at to make its decision.  The magnitude of a weight indicate to us that how useful the corresponding feature was in making the decision between the two playlists, while the sign (whether it’s positive or negative) tells us which playlist that feature was indicative of.  Examples:

  • If a weight is 0 then that feature was not useful for distinguishing between the playlists.
  • If A = 20, for example, then this means that a track with a high “Danceability” is very probably a chewy beat.
  • If C = -2 then this means that a track with a high “Valence” is somewhat probably BL33P C O R E.

If we look at the actual weights the classifier ended up using (which it calculated using the training data), we see that the two most important features for CB were “Danceability” with a feature weight of 6.39 and “Speechiness” with feature weight of 8.64.  This isn’t wildly surprising, as looking through the playlist it’s largely composed of hip hop, R&B, and rap, genres which are often danceable and lyrically oriented.  On the other hand the B3C feature weights were at most about half as large, the two biggest being “Acousticness”  at -1.55 and “Liveness” at -3.30.  Glossing over a some of the subtlety around variable scale, this set of weights indicates (to me) that tracks could pretty easily be identified as chewy beats based on their features, but that often the decision to assign a track to BL33P C O R E was made just because the track wasn’t clearly in CB (this argument can be made more rigorous by cranking up the LASSO penalty and observing which variables drop out of the classifier, the first to go were always the B3C features).

This goes a good way towards explaining the prediction discrepancy between the two playlists.  We basically have that B3C wasn’t very identifiable in terms of the provided feature data, so that’s why the suggested playlist is such a hodgepodge.  We can see how the recommendations reflect the features the classifier thought were important (there’s a healthy dose of live-sounding jazz, for example).  The selections for CB, on the other hand , seem at least slightly more consistent with each other and with the original playlist (although that Julianna Barwick pick, for example, was probably chosen solely because it was “Speechy”, which suggests to me that CB wasn’t perfectly defined in the feature data either).

One track that had a really interesting effect on the overall model behavior was “Sunspell” by Geotic. This was a track that I had personally classed as B3C because of it’s higher-pitch, soft synth tone and spacey vibe, however it’s “clearly” a chewy beat (at least according to my girlfriend and my binomial classifier).  I ended up leaving it in the mix because I wanted to make sure the list included some tracks with a light groove, but this had a pretty big impact on the classifications.  When it was included in the playlist my within-sample scores were pretty evenly spread between 0 and 1, ie. it wasn’t really too confident about anything (it’s not overfit).  Taking it out, however, sent everything to basically a perfect 1 or 0, suddenly it became very confident about was BL33P C O R E and what was a chewy beat.  Leaving it in was a modelling choice I made to keep the predictions for the playlist from getting too similar to what I had chosen for the original list, variety is the spice of life.

The Lessons are Learned but the Damage is Irreversible

Basically classification is hard and so is playlist construction.  Next up I’m hoping to do a pedagological writeup of logistic regression from the “machine learning standpoint” (sort of the approach that I used here, employing LASSO penalties and stuff) as well as from the Bayesian perspective.   I’d also like to do a rundown of how to use the Spotify API with ‘httr’ to pay forward all the copy+paste script kiddie-ing that I’ve done in this project.

Let me know what you think of my analysis and playlists!  I’ll try to post the cleaned track data on GitHub or something along with my analysis script in the next few days as well.

Updates: the code I used is available here, which also includes the .Rdata file if you want to load up my dataframes without running the whole thing.  I apologize if it’s an unreadable mess, but I refuse to improve.

WolframAlpha: Not Just for Cheating on Calc I Homework Anymore!

Today I’ve been doing some work on a project for a Data Assimilation class: an implementation of an ensemble Kalman Filter that uses an SEIR model and Google Flu Trends from 2003 to track flu incidence and model parameters in a big coupled model of the 10 Health and Human Services surveillance regions.  When I first put everything together I assumed that each region was just a 10th of the total US population, because it was simpler than trying to track down actual population data.  I know this a pretty bad assumption, and I think this has been causing some inference quirks like concluding that the outbreak was a complete pandemic in every region (this, plus living in Boulder and having just visited Las Vegas has caused “The Stand” to loom in the back of my mind a lot over the last week).

Anyways, I’ve been trying to do a bettter with my population estimates.  Unfortunately the HHS website was a total bust for easy-to-locate population numbers.   Various abuses of Google’s fancy search bar such as “population MA+NY”  also turned up bupkiss.  “What I really want”, I thought to myself, “is a software that can interpret my mangled, semi-symbolic queries,  search a giant database, and then return the queried value to me.  It’d be something like…a…computational knowledge…engine…”  Cue flashback to freshman year of undergrad; the “MyMathLab” homework website open in one tab and WolframAlpha in the other, feverishly copy+pasting problems 10 minutes before midnight.

I was actually a pretty big fan of WolframAlpha for my entire undergrad career.  At the time I was totally unfamiliar with Mathematica, and so having another tool for troubleshooting or double checking my calculus (especially one that could accept pretty mangled or gnarly input) was invaluable in some of my upper level physics courses.  I even went so far as to buy the phone app; it was only 2.99, but I still think that indicates a certain amount of affection and loyalty for the software.  Iron Man has JARVIS, Holmes has Watson, and I have Stephen Wolfram (apologies to Dr. Wolfram if you are, for some reason, reading this).

Back to the present: I took my search efforts over to WolframAlpha and beheld glorious success.  The website can actually accept a query of the form “(Arkansas+Louisiana+New Mexico+Oklahoma,+Texas population in 2003)/(population of United States in 2003)” and return a value (that I’m just going to assume is accurate.  Error bars would be mindblowing, but beggars can’t be choosers).  That’s more or less the point of this post.  WolframAlpha (and Mathematica) really is an amazing product.  I’m not sure what the upper bound of sophistication would be if you were to try and fully integrate it into your inference procedures, but even at this level it’s really amazing.  And now they also make apps for the iPhone that provide reference for various specialized topics like cat breeds.

It’s Been a Weird Day

If you look at the timestamp on this post you can pretty easily make a general inference about where its title comes from.  Clinton has lost the election to Trump and I think myself and a lot of others are spending the day coming to terms with that.  It feels like nobody really knows what’s going to come next or what actions we should take.

Last night and this morning my girlfriend and I talked about both leaving the state and country, largely out of fear of political violence by the Trump White House, but it’s not clear to me how likely that outcome is and over what timescales we might expect it to develop.  There doesn’t seem to be an obvious calculus for this kind of decision making.  It certainly reminds me of Pascal’s wager, where an outcome is so overwhelmingly costly  (in Pascal it’s eternal damnation, but in this case it’s death/worse) that pretty much any decision algorithm returns only one choice (believe for Pascal, flee for me).

However as the day has gone on I’ve begun to get a “stand together” kind of vibe from my liberal corner of the social media, and I think this changes the decision calculus a little.  Staying offers the opportunity to effect positive social change (or at least neutralize negative social change) among the people I love and care about.  In many ethical systems I’ve encountered this is something that is considered to be of equal worth to ‘life itself’, but it also seems like the balance between this benefit and the cost of death or persecution is still modulated in some ways by the various probabilities of the outcomes.

Furthermore it’s deeply unclear to me what actions I can take to effect this change.  I see vague things like ‘participate’, and more specific suggestions such as volunteering for worthy organizations, but on some selfish level this feels beneath me?  That’s not an easy thing to admit, and I would like to qualify it a little bit.  I’m fairly well educated, and I have a number of ‘valuable’ skills (the standard ones that they present when they’re trying to shill a STEM education), and so ‘just volunteering’ feels a little like I’d be under-utilizing my talents.  On the other hand it’s not clear to me that these are talents that facilitate creating positive social change, nor is it clear if I possess them in sufficient strength to be using them in a productive way.

It’s possible that I’m going too all-or-nothing; that I see only the option of devoting myself to social progress against a lifetime of sequestering myself away from any social goal, and that maybe true progress comes when I find a way to compromise between these lifestyles.  Maybe volunteering for 5 hours a week or something is enough.  When I type it out now that seems obvious, but still feels vaguely unfulfilling in some ways.

I’ve always felt like there’s some kind of fog in my head causing me to jump to conclusions or decisions all at once, and then fixate on or around them without regard to other possible perspectives.  It’s like once I have an answer to a problem I can no longer develop alternatives in my head.  Something like functional fixation, but all the time for everything ever.  I’m not sure how this directly relates to the main post, but it’s certainly a factor somehow.

Long story short, I think this blog is going to be about how I relate myself and my background to promoting good.  I originally wanted to do some stuff on how algorithms are dangerous when they intersect with our lives unchecked, but I got beat to the punch by blogs like, alas.  So it’s all kind of a work in progress.  More to come for sure as I wrap my head around all of this.

A First Post and also a Project

This is the post excerpt.

I am starting this blog because it seems somewhat professional these days to have a grad school blog.  I’m not 100% sure what I want my regular posts to focus on, or even if I want to make regular posts at all.  My first motivation in starting this blog was actually just to find a way to upload a project I wrote for our undergraduate differential equations course.  I’m realizing now that WordPress may make me pay them to host files.

Edit (10:29am): They do not charge; will upload file to fresh page.