Music Personalization : Real time Platforms.

  1. 1.
    Music Personalization:
    Realtime Platforms
    ♫ + ML + You = ❤
    CrunchConf, Budapest, October 30, 2015

  2. 2.

    Esh Kumar
    Machine Learning & Data Products @ Spotify NYC
    @eshvk

  3. 3.

    Who am I?
    • UT Austin Machine Learning
    • Building Large Scale Recommendation Systems @
    Mozilla, StumbleUpon & Spotify

  4. 4.

    75 M+ Active Users

  5. 5.

    58 Markets

  6. 6.

    1 TB of Logs/Day

  7. 7.

    1200+ Node Hadoop Cluster

  8. 8.

    Products
    •Discover … to find new albums
    •Discover Weekly … A weekly Playlist
    •Editorial Playlist Recommendations
    •Radio

  9. 9.

    Music Personalization
    •Understanding People
    ➡ User Experience, Cultural Variations
    •Understanding Content
    ➡ Genres, Cultural knowledge
    •Models
    ➡ Collaborative Filtering, Content Based
    ML
    Content
    User

  10. 10.

    Music Personalization
    •Understanding People
    ➡ User Experience, Cultural Variations
    •Understanding Content
    ➡ Genres, Cultural knowledge
    •Models
    ➡ Collaborative Filtering, Content Based
    • News, Blogs, NLP

  11. 11.

    Music Personalization
    •Understanding People
    ➡ User Experience, Cultural Variations
    •Understanding Content
    ➡ Genres, Cultural knowledge
    •Models
    ➡ Collaborative Filtering, Content Based
    • News, Blogs, NLP
    • Manually tag attributes
    • Curation

  12. 12.

    Music Personalization
    •Understanding People
    ➡ User Experience, Cultural Variations
    •Understanding Content
    ➡ Genres, Cultural knowledge
    •Models
    ➡ Collaborative Filtering, Content Based
    • News, Blogs, NLP
    • Manually tag attributes
    • Curation
    • CF

  13. 13.

    30 Million Songs…
    WhatTo Play?
    75 Million Users … 1 Person Every 3 Secs…

  14. 14.

    Recommendation Systems
    • Predict user response to options.
    • Rich field: Matrix completion, ranking, text models,
    latent factor models.
    • Several conferences annually. RecSys, NIPS, ICML etc
    • Industry researchers include NFLX, GOOG, MS and
    more…

  15. 15.

    Collaborative Filtering
    Hey,
    I like tracks P, Q, R, S!
    Well,
    I like tracks Q, R, S, T!
    Then you should check out
    track P!
    Nice! Btw try track T!
    Model you based on songs you played…
    Predict your future based on similar users…
    Millions of users and billions of streams…
    …. so there is someone like you out there

  16. 16.

    Collaborative Filtering
    The Netflix Prize.
    A million dollars for beating NFLX’s
    best algorithms by ~ 10%.

  17. 17.

    Similarity
    Our problem is to figure out how similar two
    items are.
    Mathematically, this means modeling a function
    Similarity(x,y) for all users and items, if possible.

  18. 18.

    How do we do this?
    Matrix Completion. A matrix expresses a system. We model the
    data in the form of a matrix. For example, play counts for all songs
    and all users could be:
    Users
    8
    >>>>>><
    >>>>>>:
    0
    B
    B
    B
    B
    B
    B
    @
    Song Plays
    z }| {
    s1,1 s1,2 14 · · · s1,n
    s2,1 s2,2 2 · · · s2,n
    ·
    ·
    ·
    sm,1 sm,2 1 · · · sm,n
    1
    C
    C
    C
    C
    C
    C
    A
    Users
    8
    >>>>>><
    >>>>>>:
    0
    B
    B
    B
    B
    B
    B
    @
    Song Plays
    z }| {
    s1,1 s1,2 14 · · · s1,n
    s2,1 s2,2 2 · · · s2,n
    ·
    ·
    ·
    sm,1 sm,2 1 · · · sm,n
    1
    C
    C
    C
    C
    C
    C
    A
    Call Me Maybe
    Esh
    Esh listened to call me maybe once…

    0
    B
    B
    B
    B
    B
    B
    B
    B
    B
    @
    u1
    u2



    um
    1
    C
    C
    C
    C
    C
    C
    C
    C
    C
    A
    t1 t2 · · · · · · · · · tn⇡
    0
    B
    B
    B
    B
    B
    B
    B
    B
    B
    @
    u1
    u2



    um
    1
    C
    C
    C
    C
    C
    C
    C
    C
    C
    A
    t1 t2 · · · · · · · · · tn

  19. 19.

    Matrix Completion is well studied …
    Start with random vectors around the origin. Run alternating least
    squares or gradient descent or stochastic gradient descent… All this
    is Hadoopable™.
    Users
    8
    >>>>>><
    >>>>>>:
    0
    B
    B
    B
    B
    B
    B
    @
    Song Plays
    z }| {
    s1,1 s1,2 14 · · · s1,n
    s2,1 s2,2 2 · · · s2,n
    ·
    ·
    ·
    sm,1 sm,2 1 · · · sm,n
    1
    C
    C
    C
    C
    C
    C
    A
    Users
    8
    >>>>>><
    >>>>>>:
    0
    B
    B
    B
    B
    B
    B
    @
    Song Plays
    z }| {
    s1,1 s1,2 14 · · · s1,n
    s2,1 s2,2 2 · · · s2,n
    ·
    ·
    ·
    sm,1 sm,2 1 · · · sm,n
    1
    C
    C
    C
    C
    C
    C
    A
    Call Me Maybe
    Esh
    Esh listened to call me maybe once…

    0
    B
    B
    B
    B
    B
    B
    B
    B
    B
    @
    u1
    u2



    um
    1
    C
    C
    C
    C
    C
    C
    C
    C
    C
    A
    t1 t2 · · · · · · · · · tn⇡
    0
    B
    B
    B
    B
    B
    B
    B
    B
    B
    @
    u1
    u2



    um
    1
    C
    C
    C
    C
    C
    C
    C
    C
    C
    A
    t1 t2 · · · · · · · · · tn

  20. 20.

    30 Million Songs…
    WhatTo Play?
    75 Million People … 1 Person Every 3 Secs…

  21. 21.

    1.5 Billion Playlists

  22. 22.

    Language Models
    • Language models work well too. For example, a
    playlist could be considered as a document and
    you could learn the latent vectors for tracks
    (words).
    • Then represent a User as a linear combination
    of their Tracks.

  23. 23.

    word2vec
    Words with similar contexts have similar
    meaning

  24. 24.

    word2vec

  25. 25.

    word2vec
    Target Word
    Context Word

  26. 26.

    word2vec
    Target Words and Corresponding Contexts
    shining bright trees dark green
    stars 61 50 10 30 1
    sun 71 60 5 2 0
    cucumber 2 1 15 3 40

  27. 27.

    word2vec
    Playlists CPU Vectors
    Read GetVectors & Update

  28. 28.

    Vectors are awesome!
    •Unique fingerprint for every users, tracks,
    albums, artists & even playlists in the same
    space.
    •Similarity is easily computable. Euclidean
    Distance or Cosine Similarity.

  29. 29.

    Approximate Nearest Neighbors
    •Fast approximate nearest neighbor search.
    • Locality Sensitive Hashing
    • https://github.com/spotify/annoy

  30. 30.

    Vectors are great for Infrastructure too…
    •Machine Learning can be decomposed &
    abstracted away.
    •A Lambda Architecture involving Machine
    Learning becomes eas(ier).
    •Platforms for Personalization become
    possible….

  31. 31.

    The Record Store…
    The List Maker …
    How do you scale this?

  32. 32.

    Tools of the trade
    • Build models in Python. (NumPy, SciPy )
    • Jobs in Scalding + Luigi ( https://github.com/spotify/luigi )
    • Storm for real time.
    • In house RPC for serving requests.

  33. 33.

    Storm 101
    • Realtime Stream Processing.
    • Like Hadoop but easier.
    • Fault tolerant.
    • Java, Clojure (yay!) and more!

  34. 34.

    Storm @ Spotify
    • Major users are Ads & Personalization!
    • Everyteam manages its own cluster. For personalization, we have
    a 12 node cluster.
    • Relatively a new tech, compared to Hadoop™.

  35. 35.

    So why Storm?
    • Hadoop is slowwww. Daily UserVector jobs takes ~ 16 hours to
    run. Small Data FTW!
    • New Users are important; they need a friend!
    • What moment are you in? Gym, Running etc?.

  36. 36.

    Getting Data Across The Globe

  37. 37.

    HDFS
    Kafka
    Pipeline …
    User
Listens
    Playlists
    Realtime Listens
    Spout

  38. 38.

    HDFS
    Kafka
    Pipeline …
    User
Listens
    Playlists
    Realtime Listens
    Spout
    User Vector
    Generation Job
    Latent
    Vector
    Models
    Track, Artist, Album
    Vectors

  39. 39.

    HDFS
    Kafka
    Pipeline …
    User
Listens
    Playlists
    Realtime Listens
    Spout
    User Vector
    Generation Job
    Latent
    Vector
    Models
    Track, Artist, Album
    Vectors
    Compressed
    Listening History
    Bolts
    Cassandra
    Cassandra

  40. 40.

    HDFS
    Kafka
    Pipeline + Platform
    User
Listens
    Playlists
    Realtime Listens
    Spout
    User Vector
    Generation Job
    Latent
    Vector
    Models
    Track, Artist, Album
    Vectors
    Compressed
    Listening History
    Bolts
    Cassandra
    Cassandra
    Backend
    Systems
    •Top Albums
    •Top Tracks
    •Top Playlists

  41. 41.

    Discover New User
    •Going from two weeks of no
    recommendations to recommendations as
    soon as a user plays a track.
    •Successful A/B test
    •First team to build a production ready
    personalization feature using Storm.

  42. 42.

    Lessons Learnt …
    • Boring technology works well. Complicated Storm
    Topology = Bad. (Dan Mckinley)
    • Storm is nice. Would have preferred reusing batch
    Scalding Code. Maybe Spark Streaming?
    • Grow your API from one use case to another. Don’t
    solve for everything at one time.

  43. 43.

    Join the band!
    • Machine Learning, Data & Backend Gigs.
    • Now touring in New York, Boston & Stockholm!
    • https://www.spotify.com/jobs/

  44. 44.

    Thanks !
    Esh Kumar
    @eshvk

[tribe_event_inline id=”3760″]

Content related to: {title:linked}

[/tribe_event_inline]