SK

I watch a lot of movies

Date written Jul 8, 2020
Date updated Last updated: Aug 7, 2020
Filed under Data & Viz. in ref

Contents

For better or for worse, I have watched over 800 titles. It seems natural to put that data to good use, at least academically. This post is written, in expectation, to mainly serve three purposes -

  1. a way for me to ask as many questions as possible on a simple dataset to keep the creative juices flowing

  2. to keep thinking about informative ways to visualize data ubiquitous around us

  3. a quick reference for Altair, a declarative charting library

Data

IMDb allows users to rate every movie or a TV show on a scale of 1 to 10, restricted to integer ratings. Conveniently, it also collects them into a list which I have made public.

I wrote a tiny web spider using Playwright which collects some basic information - title, release year, genres, ratings (including mine) and total number of votes, into a CSV file a. It is a pretty straightforward set of CSS path selectors. I do some further organization in a Jupyter notebook using Pandas DataFrames to make charting easier.

My rule of thumb to store or organize data is to do it in a format I would design for a typical relational database. All downstream analysis can then be pretty much summarized via operations in relational algebra - Cartesian product, projection, selection, union and difference b.

Questions

This is a list of questions I've thought of visualizing so far for a qualitative inspection of the statistics.

Do not forget to scroll as some charts may be larger than they appear.

Count based

Movies watched by release year

Genre heatmap by release year

Votes based

Distribution of votes

Distribution of votes by year

Ratings based

Distribution over all ratings

Distribution by release year

Distribution by genre


  1. The code and data is available at activatedgeek/imdb-ratings.
  2. More SQL-esque notions would be the operations of table merge and join.