For better or for worse, I have watched over 800 titles. It seems natural to put that data to good use, at least academically. This post is written, in expectation, to mainly serve three purposes -
a way for me to ask as many questions as possible on a simple dataset to keep the creative juices flowing
to keep thinking about informative ways to visualize data ubiquitous around us
a quick reference for Altair, a declarative charting library
IMDb allows users to rate every movie or a TV show on a scale of 1 to 10, restricted to integer ratings. Conveniently, it also collects them into a list which I have made public.
I wrote a tiny web spider using Playwright which
collects some basic information - title, release year, genres, ratings (including
mine) and total number of votes, into a CSV file. The code and data is available at activatedgeek/imdb-ratings. It is a pretty
straightforward set of CSS path selectors. I do some further organization in
a Jupyter notebook
DataFrames to make charting easier.
My rule of thumb to store or organize data is to do it in a format I would design for a typical relational database. All downstream analysis can then be pretty much summarized via operations in relational algebra - Cartesian product, projection, selection, union and difference. More SQL-esque notions would be the operations of table merge and join.
This is a list of questions I've thought of visualizing so far for a qualitative inspection of the statistics.
Do not forget to scroll as some charts may be larger than they appear.