SK Logo

Data Sources

Jun 7, 2020Last updated: Aug 9, 2021
WWW in 📚 ref

I've always wondered about stupid questions like "how many of those toilet papers were sold per household?" (funny I actually found something here). For all the important things in life though, it is surprising that we have so much data available out in the open. I think we live in a glorious time.

Our World in Data - Research and data to make progress against the world’s largest problems

UN Data - A world of information

World Bank Open Data - Free and open access to global development data

Numbeo - Numbeo is the world’s largest database of user contributed data about cities and countries worldwide. Numbeo provides current and timely information on world living conditions including cost of living, housing indicators, health care, traffic, crime and pollution.

NOTE: I suspect Numbeo has some data bias (most likely upwards) but in my limited experience has reasonable ballpark estimates.

Unicorn Nest - Dataset on fundraising. Unclear how clean and helpful this is to answer questions about network effects in fundraising etc.

OpenCellid - The world's largest Open Database of Cell Towers.

ESO National Grid Electricity System - forecast and historic data for: electricity demand, interconnectors, pump storage pumping, wind generation and solar generation.

The Pile - The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

Spotify Podcast Dataset - This dataset consists of 100,000 episodes from different podcast shows on Spotify.

The Big Mac Index - A way to compare currencies by Economist.

List of datasets for machine learning research - Suprise, surprise! Wikipedia has a list of course.

Mindat.org is the world's largest open database of minerals, rocks, meteorites and the localities they come from.

OpenTrees.org data on trees.

Living Planet Index is a measure of the state of the world's biological diversity based on population trends of vertebrate species from terrestrial, freshwater and marine habitats.

Data Visualization

Tools

Vega - Vega-Lite for JS, Altair for Python are excellent declarative abstractionsa for data visualization.

kepler.gl - Geospatial analysis built on top of React. I use this to host My Wine Map.

Books & Articles


  1. The customization of Vega is challenging. Ad-hoc flexibility is limited, unfortunately.

© 2021 Sanyam Kapoor