Picking What to Watch Next

build a recommendation system

* start downloading the datasets *


30 mins - understand recommenders

10 mins - have a look at our project:


50 mins - work on data preparation

50 mins - work on model

10 mins - try it out

What is a recommender?

  • Based on the data about the users and items
  • Make personalized recommendations
  • Important to online business when there's no/ less human involved

Type of recommenders

  • content-based systems
  • collaborative filtering systems
  • hybrid systems
    (just the combination of the 2 👍🏻)

Content-based systems

  • If you have watched Cowboy Bebop the anime
  • You probably would like Cowboy Bebop the live-action series
  • (not always true 🤦🏻‍♀️)

Collaborative filtering systems

  • If both you and I have watched a lot of anime
  • If I watched Cowboy Bebop the anime
  • You may like Cowboy Bebop the anime
  • (there are popularity bias 🤷🏻‍♀️)

In our workshop

Data: ratings.csv and movies.csv from MovieLens Datasets (full version)

Method: collaborative filtering

Problem: Unpopular movies, inactive users -> sparse data points

Preparing the data

  • there are long tails in the data
  • many movies received only a few ratings (unpopular)
  • many users only rate a few movies (inactive)
  • we only want to use the popular movie and active users to reduce bias

Preparing the data

  • do it in _prep_data
  • create a filter for popular movies
  • create a filter for active users
  • only use those that are popular and active
  • filter data that satisfied both filters

Preparing the data

  • create a m x n array
  • m: number of movies, n: number of users
  • create a mapper from movie title to index
  • the m x n array is sparse (not every user rate every movies)
  • transform array to SciPy sparse matrix

Our model

  • collaborative filtering
  • user-based or item-based
  • KNN model: similar movies are clustered together
  • similar means received similar ratings from users
  • feed the m x n array (sparse matrix) into KNN model
  • inference and get the top n nearest neighbours

Our model

  • using SciKit-learn NearestNeighbor:
    self.model = NearestNeighbors()
  • parameters are set in set_model_params
  • in _inference get the distances and indices of the top n_neighbours
  • try different settings

Run our model in CLI

  • run the Python script as a module with options
  • Example:
    python knn_recommender.py --movie_name "Iron Man" --top_n 10
  • Details about the options: see project repo
  • after the data is processed it will be pickled for faster rerun (and used in PyScript version)

Run our model in browser (PyScript)

  • make sure you have the pickle files (hashmap.p and movie_user_mat_sparse.p)
  • start a local server: python -m http.server
  • Open and select knn_recommender.html from there

What have we learnt

  • Know about the different types of recommenders
  • Building a simple recommender
  • See what data quality issue we may have
    • popularity bias: bias toward popular items
    • cold-start problem: new items do not get enough reviews

References and Credits