Datasets

443 industry datasets across e-commerce, economics, and ML.

443 datasets

E-Commerce

35 datasets
JD.com 2020 (MSOM-20)

JD.com 2020 (MSOM-20)

2.5M customers (457k purchasers) and 31,868 SKUs from JD.com

BestBuy

BestBuy

Mobile website clicks (~42k) for Xbox games from BestBuy

Retail Rocket

Retail Rocket

2.76M events (views, carts, purchases) from 1.4M visitors

Open E-Commerce 1.0 (MIT)

Open E-Commerce 1.0 (MIT)

1.8M Amazon purchases with demographics (age, gender, location). Real household e-commerce behavior at scale

Shopee

Shopee

Dataset from Shopee's 2020 Code League competition

Flipkart

Flipkart

Sales dataset from Indian e-commerce platform Flipkart

Amazon Sessions (KDD Cup 23)

Amazon Sessions (KDD Cup 23)

Sessions from 6 locales with 40k-500k products per locale

Coveo Shopping (SIGIR-21)

Coveo Shopping (SIGIR-21)

30M+ browsing events with query and image vectors for e-commerce search

Brazilian eCommerce

Brazilian eCommerce

100,000 orders (2016-2018) structured in 9 relational tables from Olist

ASOS Digital Experiments

ASOS Digital Experiments

99 real A/B experiments with 24,153 time-granular snapshots for adaptive stopping research

Pakistan e-commerce

Pakistan e-commerce

500k+ transactions (Mar 2016 - Aug 2018) from Pakistan's largest e-commerce

Polish Supermarket POS Dataset

Polish Supermarket POS Dataset

Transaction-level checkout data with timestamps showing start/end times, basket size, and payment method. Distinguishes staffed vs self-service.

Open CDP

Open CDP

Omnichannel interaction tracking with AI-driven identity resolution

Al

Alibaba Ads (IJCAI-18)

6 billion display ad/click logs over 8 days from 100M users

Ra

Rakuten SIGIR

E-commerce dataset for SIGIR workshop from Rakuten

Amazon Reviews (2023)

Amazon Reviews (2023)

571M reviews (1996-2023), 33 categories, 48M items - comprehensive Amazon review dataset

Al

Alibaba Industrial Dump (150GB)

Large-scale industrial dataset from Alibaba (150GB)

Online Shopping Intention

Online Shopping Intention

12,330 user sessions with numerical and categorical features for purchase prediction

OTTO Session-based Recommendations

OTTO Session-based Recommendations

12M+ e-commerce sessions with click → cart → order sequences. Real multi-stage conversion funnel data from German retailer

Al

Alibaba Personalized Re-Ranking

Mobile shopping user click data on recommended items

Wayfair Search (WANDS)

Wayfair Search (WANDS)

233k human-annotated query-product judgments, 43k products

Google Merchandise

Google Merchandise

3 months obfuscated GA4 e-commerce data (Nov 2020-Jan 2021)

Al

Alibaba Cloud Theme

Themed dataset related to Alibaba Cloud services

JD.com Search

JD.com Search

170,000 users' real search queries (2021-2022) from JD.com

M5

M5Product

5 modalities (image, text, table, video, audio), 6M+ samples for multimodal learning

JD-pretrain-data

JD-pretrain-data

Encoded search queries and item data for intent detection

Al

Alibaba Clickstream 2018

Clickstream data from Alibaba platforms (2018)

Al

Alibaba User Behavior 2018

649M user interactions (clicks, carts, buys) on 25M items

Op

Open e-commerce 1.0

1.85M Amazon purchases from 5,027 US consumers (2018-2022) linked to demographics (age, gender, income, education)

Al

Alibaba Mobile 2021

Mobile user behavior data from Alibaba (2021)

Al

Alibaba Brick and Mortar (IJCAI-16)

Online and offline check-ins/purchases from 1,000+ stores

OTTO Session Data

OTTO Session Data

12M German e-commerce sessions with click → cart → order sequences. RecSys 2022 competition

Tm

Tmall Reviews

Product reviews from Tmall (Alibaba's B2C platform)

Flipkart Products

Flipkart Products

Product information scraped from Flipkart e-commerce platform

Home Depot Product Search

Home Depot Product Search

Human-rated relevance scores (1-3) for search terms and products

Transportation & Mobility

17 datasets
Chicago TNC Trips

Chicago TNC Trips

100M+ rideshare trips with fares (unlike NYC which lacks fare data). Trip-level pricing for Uber/Lyft economic analysis

Flight Delay and Cancellation 2019-2023

Flight Delay and Cancellation 2019-2023

5 years of flight delay and cancellation data including COVID-era disruptions for aviation operations research.

Uber Movement

Uber Movement

Zone-to-zone travel times and street speeds for 50+ cities worldwide. Congestion patterns from actual Uber rides

Ch

Chicago Taxi Trip Data

200M+ taxi trip records from 2013-present with persistent Taxi ID enabling driver-level panel analysis. Includes complete fare decomposition (base, extras, tolls, tips) across 77 community areas. Uniquely suited for supply-side revealed preference analysis of labor supply decisions.

Lo

London Passenger Mode Choice (LPMC)

134K trips from London Travel Demand Survey with level-of-service variables computed for all four modes (walking, cycling, transit, driving). Complete choice-set construction for mode choice analysis.

2015 Flight Delays and Cancellations

2015 Flight Delays and Cancellations

5M+ US flights from 2015 with departure/arrival times and delay information. Model runways as single-server queues.

NGSIM Vehicle Trajectories

NGSIM Vehicle Trajectories

Vehicle trajectory data for traffic flow modeling

NYC TLC Trip Records

NYC TLC Trip Records

3B+ taxi and rideshare trips since 2009. Fares, tips, surge pricing, driver pay. The gold standard for marketplace analytics

Flight Delay Dataset 2018-2022

Flight Delay Dataset 2018-2022

Recent flight delay data covering 2018-2022 for analyzing aviation queueing and delay patterns.

Grab Driving GPS Traces

Grab Driving GPS Traces

GPS trace data from Grab ride-hailing platform

Natural Driving in Ohio

Natural Driving in Ohio

ADAS-equipped vehicles with driving behavior events

Airline Delay

Airline Delay

Airline flight delays and carrier information

Sw

SwissMetro Dataset

Stated preference survey with explicit availability flags for train, SwissMetro (maglev), and car alternatives. Methodological gold standard for discrete choice modeling with Biogeme.

BT

BTS Airline On-Time Performance

All US flights since 1987. Delays, cancellations, fares, capacity. Revenue management research goldmine

TfL Unified API

TfL Unified API

Transport for London real-time and historical transit data including arrivals, departures, and service disruptions across all modes.

MTA Open Data

MTA Open Data

NYC subway and bus real-time feeds and historical performance data including ridership, delays, and service metrics.

DiDi GAIA Open Data

DiDi GAIA Open Data

Billions of GPS points and ride trajectories from China's largest ride-hailing platform. Driver behavior and urban mobility patterns

Healthcare

13 datasets
CMS Hospital Price Transparency

CMS Hospital Price Transparency

Hospital pricing data mandated since 2021. Negotiated rates, chargemaster prices across 6,000+ hospitals. Healthcare pricing research

NY

NYC EMS Incident Dispatch Data

4.83M+ ambulance responses with second-level timestamps for dispatch, en route, and on-scene times. Ideal for M/M/c modeling of ambulance fleets.

OPTN Organ Transplant

OPTN Organ Transplant

Complete US organ donation records since 1987. Waiting lists, donor-recipient matches, outcomes. Market design and matching research

Medicare Provider Utilization

Medicare Provider Utilization

All Medicare providers with service utilization and payment data. CMS public use files for healthcare analytics

Hospital Triage and Patient History

Hospital Triage and Patient History

Triage data with timestamps and patient history for modeling priority queues in emergency medicine.

Synthetic Patient Wait Time Records

Synthetic Patient Wait Time Records

Hospital wait time records for analyzing patient flow and queue dynamics in healthcare settings.

MI

MIMIC-CXR

473,000 chest X-rays from 45,561 patients linked to treatments, outcomes, and clinical notes. Largest medical imaging dataset for causal analysis.

TC

TCGA (The Cancer Genome Atlas)

11,000+ tumor samples across 33 cancer types with ~20,000 gene expression features per sample. Over 3M Cox models fit identifying 100K+ prognostic biomarkers.

NEMSIS Public Release Research Dataset

NEMSIS Public Release Research Dataset

49M+ EMS activations nationally with minute-level timestamps. The definitive dataset for US emergency medical services research.

Hospital Emergency Dataset

Hospital Emergency Dataset

Patient care and operational efficiency data from hospital emergency departments for healthcare operations research.

ER Wait Time

ER Wait Time

Simulated emergency room data with wait times, patient outcomes, and satisfaction scores for healthcare queueing analysis.

10

10X Genomics PBMC

Single-cell RNA sequencing datasets (3k, 4k, 8k, 68k cells) with 80-95% zero entries. Standard benchmark for zero-inflated count models.

SU

SUPPORT Study

8,873 seriously ill hospitalized patients with ~40 clinical variables. Standard survival analysis benchmark.

Entertainment & Media

25 datasets
Spotify Million Playlist

Spotify Million Playlist

1M playlists with 2M unique tracks from 300K artists. RecSys 2018 Challenge for playlist continuation research

Netflix Prize

Netflix Prize

100M+ anonymous movie ratings from 480k users

KuaiSAR

KuaiSAR

5M search actions, 14.6M recommendation events from 25k users

Stanford Amazon/Beer

Stanford Amazon/Beer

Amazon product data and BeerAdvocate reviews from Stanford SNAP

Netflix Viewing Behavior

Netflix Viewing Behavior

1.7M episodes/movies watched by 1,060 users over 1 year. Watch patterns, session length, preferences, predictability metrics

Netflix Engagement Reports

Netflix Engagement Reports

Hours viewed for every Netflix title (original and licensed) watched >50K hours. First public streaming metrics since 2021

YouTube Engagement Dataset

YouTube Engagement Dataset

5M videos with watch percentage, engagement maps, Freebase topic labels. Video-level engagement metrics for content research

Co

ContentWise Impressions

Industrial OTT video streaming data with full recommendation lists shown to users, including position in list and interaction types (view, detail, rating, purchase). CIKM 2020 benchmark.

MI

MIND (Microsoft News Dataset)

15M+ impression logs from 1M users over 6 weeks with explicit clicked/non-clicked news articles. Gold standard for impression-aware recommendation research with article titles, abstracts, categories, and user history.

Disney California Adventure Wait Times

Disney California Adventure Wait Times

Historical ride wait time data from Disney California Adventure for theme park queueing analysis.

MicroLens

MicroLens

1 billion interactions from 34 million users on 1 million micro-videos

Bandcamp Music Sales

Bandcamp Music Sales

Music sales data (digital/physical) from Bandcamp platform

Spotify Music Streaming Sessions (MSSD)

Spotify Music Streaming Sessions (MSSD)

150M+ listening sessions with skips, track features, and playlist context. The largest public music streaming behavior dataset

LastFM-1B

LastFM-1B

1 billion listening events with long-term user histories. Music recommendation and listening behavior research

YouTube User Watch History

YouTube User Watch History

1.8M videos watched by 243 users over 1.5 years. Recommendation engine performance, caching research, viewing patterns

Twitch Gamers Social Network

Twitch Gamers Social Network

168K nodes with mutual follower relationships. 6 ML tasks including churn, affiliate status, view count prediction

Goodreads

Goodreads

Book information and user reviews from Goodreads platform

Me

Metacritic Video Games

Video game reviews and metadata from Metacritic

TouringPlans Disney World Data

TouringPlans Disney World Data

Posted vs actual wait times for Disney World attractions from 2012-present. Premium dataset for validating queue estimates.

Queue-Times API

Queue-Times API

Real-time wait time data API for 80+ theme parks worldwide. Live queueing data for attractions.

Twitch Streaming Dataset

Twitch Streaming Dataset

16 days of viewer counts, stream metadata, game categories from Oct 2017. Live streaming platform dynamics

Co

CoasterQueues

10-minute interval wait time data for 48 theme parks worldwide. CC-BY licensed CSV downloads.

YFCC100M

YFCC100M

100M Flickr photos/videos with metadata under Creative Commons. Yahoo/Flickr dataset for multimedia research

YouTube-8M

YouTube-8M

8M videos with video-level features for large-scale video understanding. Google Research benchmark for video classification

NetEase Music (INFORMS)

NetEase Music (INFORMS)

Data from NetEase Cloud Music for INFORMS competition

MarTech & Customer Analytics

10 datasets
Retail Rocket Recommender System Dataset

Retail Rocket Recommender System Dataset

4.5 months of behavior data from a real e-commerce site: 2.7M sessions, item properties, and purchase events. Designed for session-based recommendation research.

On

Online Retail II (UCI)

1M+ transactions from a UK-based online retailer (2009-2011). Contains invoice number, stock code, description, quantity, invoice date, unit price, customer ID, and country. Standard benchmark for RFM analysis and CLV modeling.

ba

bayesm R Package Datasets

Marketing datasets including cheese, orangeJuice, margarine purchases, physician detailing, and conjoint surveys.

Do

Dominick's Finer Foods (Kilts Center)

Chicago supermarket scanner data 1989-1994. 3,500+ UPCs across 29 categories with prices, promotions, demographics.

Amazon Reviews 2023

Amazon Reviews 2023

233 million product reviews across all Amazon categories with user IDs, timestamps, ratings, and review text. The largest public e-commerce review dataset for recommendation research.

Telco Customer Churn (IBM)

Telco Customer Churn (IBM)

7,043 customers from a telecommunications company with 21 features including demographics, services, account information, and churn status. Industry-standard dataset for churn prediction benchmarking.

In

Instacart Market Basket Analysis

3 million grocery orders from 200,000 Instacart users with product details and order sequences. Released for a Kaggle competition to predict which products users will reorder.

MovieLens 25M

MovieLens 25M

25 million ratings and 1 million tag applications across 62,000 movies by 162,000 users. The gold-standard benchmark for collaborative filtering research with rich metadata including genres, tags, and timestamps.

KDD Cup 2009 Customer Relationship Prediction

KDD Cup 2009 Customer Relationship Prediction

Orange Telecom CRM dataset with 50,000 customers and 230 anonymized features. Predict churn, appetency (propensity to buy), and up-selling. Classic benchmark for CRM analytics.

Am

Amazon Fine Foods Reviews

500,000+ food product reviews from Amazon spanning 1999-2012. Includes user/product IDs, ratings, helpfulness votes, and full review text. Popular for sentiment analysis and review-based recommendations.

Causal Inference

15 datasets
Do

DoubleML

Chernozhukov's double/debiased machine learning package with 401k and other example datasets.

Ne

Nevo Cereal Data

Semi-fabricated ready-to-eat cereal data for teaching BLP methods. 24 brands, 94 markets.

Ne

News Headlines A/B Test Dataset

Real A/B test outcomes on click-through rates paired with BERT embeddings of headline text. From Columbia Business School.

IB

IBM LBIDD Benchmark

Scalable semi-synthetic causal inference data from birth records (~100 covariates) with various link functions for treatment effect estimation.

Criteo Uplift Prediction Dataset

Criteo Uplift Prediction Dataset

~25 million rows with treatment indicators for benchmarking Individual Treatment Effect (ITE) estimation in advertising

La

Lalonde NSW Job Training

National Supported Work experiment data. Canonical dataset for propensity score and matching methods.

MH

MHE Data Archive (Angrist-Pischke)

Replication data for Mostly Harmless Econometrics including STAR, NJ minimum wage, Mariel boatlift.

Py

PyBLP (BLP Automobile Data)

Berry-Levinsohn-Pakes automobile demand data 1971-1990. Foundational dataset for IO demand estimation.

GR

GRF (Generalized Random Forests)

Athey-Imbens-Wager causal forest package with example datasets for heterogeneous treatment effects.

Sy

Synth (Synthetic Control)

Abadie's synthetic control package with California Prop 99 and Basque Country terrorism datasets.

dS

dSprite Causal Benchmark

64×64 heart-shaped sprite images (4,096 dimensions) with fully synthetic data and known treatment effect functions. Supports backdoor, front-door, IV, and proxy causal learning identification strategies.

Gi

GiveDirectly Kenya (Satellite Imagery)

High-resolution satellite imagery with CNN-extracted features for 4,578 households in a cash transfer RCT. Provides genuine ground truth from randomized experimental design.

Ne

News (Johansson et al.)

Bag-of-words text benchmark with thousands of word features for HTE estimation. Semi-synthetic outcomes (real text covariates, simulated treatment/outcomes) with known CATE functions.

AC

ACIC Competition Series (2016-2023)

Competition-validated causal inference benchmarks with 50-200 covariates and known treatment effects. ACIC 2022 used Medicare expenditure data across 200 realizations.

Am

Amazon Reviews (CausalNLP)

~5,000 product reviews with simulated treatment-outcome relationships and documented confounders for causal inference from text.

Grocery & Supermarkets

28 datasets
Indian Grocery (Flipkart Supermart)

Indian Grocery (Flipkart Supermart)

Flipkart Supermart transaction and product details

Brazil Medical

Brazil Medical

Medicine sales data in Brazil

UK Gift Shop (Online Retail)

UK Gift Shop (Online Retail)

Online retail transactions (2010-2011) from UK gift retailer

Office Supplies (DMDA 2023)

Office Supplies (DMDA 2023)

Office supply sales for DMDA 2023 workshop challenge

Mo

Montgomery Liquor

Warehouse and retail liquor sales from Montgomery County, Maryland

Vietnam Supermarket

Vietnam Supermarket

Sales and inventory snapshot data from Vietnamese supermarket

Turkish Drugs

Turkish Drugs

Drug sales data from Turkey

Israeli Grocery

Israeli Grocery

Grocery purchase data from Israel

Ukraine eCommerce (Fozzy)

Ukraine eCommerce (Fozzy)

E-commerce sales data from Fozzy Group retail chain in Ukraine

Tesco Grocery 1.0

Tesco Grocery 1.0

Grocery purchases from Tesco stores via loyalty cards

NYC Shopping

NYC Shopping

Large sales dataset from New York City retail

Walmart Sales

Walmart Sales

General sales data including CPI and unemployment rate

du

dunnhumby - Breakfast at the Frat

Time series dataset featuring 156 weeks of store-level sales data for household staples in four categories: mouthwash, pretzels, frozen pizza, and boxed cereal. Includes pricing, promotional activity, and store characteristics.

Brazilian Drugs (ANVISA)

Brazilian Drugs (ANVISA)

Sales data for controlled substances reported by ANVISA

Indian Sales

Indian Sales

Sales forecasting dataset for small basket items in India

Dominicks Soft Drinks

Dominicks Soft Drinks

Weekly scanner data on soft drink purchases from Dominick's Finer Foods

Ecuador Grocery (Favorita)

Ecuador Grocery (Favorita)

Unit sales data with store/item metadata and oil prices from Ecuador

Iowa Liquor

Iowa Liquor

Monthly Class E liquor sales data with volume and pricing from Iowa

Walmart (M5)

Walmart (M5)

Hierarchical sales data for 3,049 products across 10 stores

Brazilian Store Chain

Brazilian Store Chain

Sales data from Brazilian retail chain

Italian Grocers

Italian Grocers

Receipt-level sales data from Italian grocery stores

Mexican Grocery

Mexican Grocery

Data from a Mexican grocery store

Store Item Demand

Store Item Demand

50 items across 10 different stores over 5 years

Rossmann Store Sales

Rossmann Store Sales

1,115 Rossmann drug stores historical sales data

Polish Grocery

Polish Grocery

Yearly sales data (2018) from Polish grocery shop

du

dunnhumby - Carbo-Loading

Relational database containing 2 years of household-level transaction data across 38 stores. Includes 5,000 households purchasing 125 products from pasta sauce and pasta categories with full demographic profiles.

du

dunnhumby - The Complete Journey

Comprehensive household-level panel data from 2,500 frequent shoppers over 2 years. Includes full purchase history, coupon redemptions, direct mail campaigns, and complete demographic/lifestyle profiles - the gold standard for CPG analytics.

du

dunnhumby - Let's Get Sort-of-Real

Large-scale synthetic dummy dataset with 300M+ transactions for testing analytics pipelines and algorithms at scale. Mimics real retail data structures without privacy concerns - ideal for scalability testing and teaching.

Financial Services

18 datasets
Da

Damodaran Corporate Finance Data

Industry-level betas, WACC, multiples, margins for 170+ countries. Most comprehensive free corporate finance data.

FI-2010 Limit Order Book

FI-2010 Limit Order Book

4.3M samples of NASDAQ Nordic limit order book data. 10 depth levels, 5 stocks, normalized features. Benchmark for price prediction

IEEE-CIS Fraud Detection

IEEE-CIS Fraud Detection

590K card-not-present transactions with 393 features from Vesta Corp. Real messy fraud data (3.5% fraud rate)

Hashed Multimodal Banking

Hashed Multimodal Banking

Banking transactions and product purchases with hashed identifiers

Un

Uniswap DEX Trading Data (Harvard Dataverse)

82.9M decentralized exchange transactions from Uniswap V3 across Ethereum mainnet and Layer-2 networks (June 2021-Dec 2022). Wallet-level panel enables analysis of how crypto traders respond to gas fees and exchange rates. Published in Nature Scientific Data (2025).

ORBITAAL Bitcoin Transactions

ORBITAAL Bitcoin Transactions

13 years of Bitcoin transaction graphs (2009-2022). Complete blockchain with labeled entities. network analysis at scale

ORBITAAL Bitcoin Graph

ORBITAAL Bitcoin Graph

13 years (2009-2021) of entity-level Bitcoin transaction networks with BTC/USD values

PaySim Synthetic Transactions

PaySim Synthetic Transactions

6M+ mobile money transactions simulating real fraud patterns. Agent-based model calibrated on real African mobile money logs

LO

LOBSTER Order Book

NASDAQ limit order book data at millisecond precision. Level 1-10 depth, message-by-message reconstruction. Market microstructure research

Prosper Loan Data

Prosper Loan Data

113K P2P loans with borrower characteristics and outcomes

Prosper Loans

Prosper Loans

113K P2P loans with borrower characteristics, credit grades, and loan outcomes. Alternative to LendingClub for P2P lending research

LendingClub Loans

LendingClub Loans

2.7M loans (2007-2019) with 151 features. Interest rates, credit scores, defaults. The canonical P2P lending dataset for credit risk modeling

Amazon Fraud Detection Benchmark

Amazon Fraud Detection Benchmark

9 consolidated fraud datasets with unified format. Includes IEEE-CIS, credit card, e-commerce fraud. Benchmark for fraud ML research

Ke

Kenneth French Data Library

Fama-French factor returns (3-factor, 5-factor, momentum) from 1926-present. The authoritative source for asset pricing research.

Sh

Shiller CAPE & Stock Market Data

S&P 500 prices, earnings, dividends, CAPE ratio from 1871-present. Nobel laureate Robert Shiller's canonical dataset.

Google BigQuery Crypto

Google BigQuery Crypto

8 complete blockchain histories (Bitcoin, Ethereum, etc.) with daily updates. Transaction-level data for crypto analytics research

SEC EDGAR Filings

SEC EDGAR Filings

21M+ public company filings since 1994. 10-Ks, 8-Ks, proxy statements. Full text + structured XBRL data

Pa

Pastor-Stambaugh Liquidity Factors

Aggregated liquidity levels and traded liquidity factor from 1962-present for asset pricing research.

Transportation Economics & Technology

10 datasets
NYC TLC Trip Records

NYC TLC Trip Records

Complete trip-level data for all NYC taxi and for-hire vehicle trips including Uber and Lyft. Billions of records since 2009 with pickups, dropoffs, fares, and tips.

Chicago Rideshare Data

Chicago Rideshare Data

Trip-level data for all Transportation Network Provider (Uber/Lyft) trips in Chicago since 2018. Includes ~57 million trips annually with origins, destinations, and fares.

National Household Travel Survey (NHTS)

National Household Travel Survey (NHTS)

Comprehensive US travel behavior data since 1969 capturing daily non-commercial travel by all modes. The authoritative source on American travel patterns.

DB

DB1B Airline Origin and Destination Survey

10% random sample of all US airline tickets with origin, destination, fare, and itinerary details. Quarterly since 1993. The gold standard for airline pricing research.

Transitland GTFS Feeds

Transitland GTFS Feeds

Aggregated GTFS data from 2,500+ transit agencies across 55+ countries. The largest open transit data aggregator with REST and GraphQL APIs.

Citi Bike System Data

Citi Bike System Data

Trip-level records for NYC's bike-share system since 2013. ~2 million trips monthly in peak season with station-level origin-destination and duration data.

National Transit Database (NTD)

National Transit Database (NTD)

Definitive source for US transit statistics since 2002. Ridership, operating expenses, capital expenses, safety incidents for all federally-funded transit agencies.

MobilityData GTFS Catalog

MobilityData GTFS Catalog

Curated directory of 1,327+ GTFS feeds from transit agencies globally with quality metrics, update frequency, and standardized metadata.

OpenStreetMap Road Network

OpenStreetMap Road Network

Open-source global road network data including road types, speeds, and connectivity. Downloadable via Overpass API or pre-processed extracts.

FHWA Highway Statistics

FHWA Highway Statistics

Annual data on US highway system including vehicle miles traveled, fuel consumption, road infrastructure, and highway financing since 1945.

Data Portals

36 datasets
Am

Amazon AWS Open Data

Registry of Open Data with analysis-ready datasets

Census Business Dynamics Statistics

Census Business Dynamics Statistics

8M+ establishments with firm age data. Job creation/destruction, startups, exits. Longitudinal firm dynamics since 1977

JD.com Open Datasets

JD.com Open Datasets

Open dataset portal for e-commerce and logistics from JD.com

IBM Developer Data

IBM Developer Data

AI, data science, healthcare, and weather datasets from IBM

Dr

DrivenData Water Supply Forecasting (2024)

Western US water supply data from Bureau of Reclamation, $500K prize pool for seasonal forecasting

Amazon ShopBench (KDD Cup 2024)

Amazon ShopBench (KDD Cup 2024)

57 tasks, 20K questions derived from real Amazon shopping data for LLM shopping assistants

RecSys Challenge 2025 (Synerise)

RecSys Challenge 2025 (Synerise)

1M users, 6 months of real e-commerce behavior logs with 5 event types for universal behavioral modeling

Hugging Face Datasets

Hugging Face Datasets

ML/NLP datasets hub with 100K+ datasets. Easy loading via Python library. Community-driven repository

KDD Cup

KDD Cup

ACM SIGKDD annual data mining competition

RecSys Challenge 2024 (EB-NeRD)

RecSys Challenge 2024 (EB-NeRD)

2.3M users, 380M+ news impressions from Ekstra Bladet for news recommendation research

PatentsView

PatentsView

13M+ US patents (1976-present) with citations, inventors, assignees. Full patent text and claims. innovation research at scale

Ma

Makridakis Competitions

Time series data for forecasting competitions (M1-M5)

NeurIPS Competition Data

NeurIPS Competition Data

Top-tier conference with competitions and benchmarks

Inside Airbnb Raw Data

Inside Airbnb Raw Data

Raw data files from Inside Airbnb project

Rakuten Data Release

Rakuten Data Release

E-commerce, advertising, and multimedia datasets from Rakuten

Ma

Marketing Science Databases

INFORMS conference with data-focused opportunities

RecSys Datasets Collection

RecSys Datasets Collection

Datasets from ACM Recommender Systems challenges

Yo

Yongfeng Dataset Collection

E-commerce and recommendation system datasets

Julian McAuley Datasets

Julian McAuley Datasets

Reviews, recommendations, and social network data

Yandex Datasets

Yandex Datasets

Search ranking, translation quality, and ML task datasets

NBER Public Use Data Archive

NBER Public Use Data Archive

Eclectic mix of economic, demographic, and enterprise data from NBER-affiliated research projects

Da

Data Mining Cup

Industry-sponsored data mining competitions

Dr

DrivenData

Data science competitions for social impact

IJCAI Competitions

IJCAI Competitions

International AI conference with competitions

Microsoft Research

Microsoft Research

Research tools and datasets across multiple domains

MS

MSOM Data Challenges

Manufacturing & Service Operations Management challenges

OpenML

OpenML

Platform for sharing datasets, tasks, and ML code

Go

Google Dataset Search

Universal search engine for datasets across the web. Meta-tool for discovering research data

US

USASpending Federal Awards

All federal contracts, grants, loans since 2001. 400+ variables, $50T+ in awards. Government procurement analytics

Co

CodaLab

Platform for competitions, benchmarks, and reproducible research

Baidu AI Datasets

Baidu AI Datasets

AI, NLP, computer vision, and autonomous driving datasets

GH

GH Archive

Complete public GitHub timeline since 2011 with 3B+ events. Captures all public repository activity including pushes, PRs, issues, stars, and forks. BigQuery and hourly downloadable archives available.

So

Software Heritage Archive

World's largest source code archive with 5B+ files from 300M+ projects. Preserves entire Git history with persistent identifiers (SWHIDs). Universal reference for software provenance research.

St

Stack Exchange Data Dump

Complete Q&A history from all Stack Exchange sites including Stack Overflow. Posts, users, votes, comments, and tags in XML format. Updated quarterly via archive.org.

Tr

TravisTorrent

2.6M builds from 1,300 GitHub projects with test outcomes, build durations, and commit metadata. Links CI/CD behavior to code changes for build prediction and process mining research.

Je

JetBrains Developer Ecosystem Survey

Annual survey with 23,000-32,000 developer responses covering technology adoption, work patterns, and industry trends. Raw CSV data available for independent analysis.

Healthcare Economics & Health-Tech

11 datasets
MIMIC-IV

MIMIC-IV

Gold standard for freely accessible critical care EHR data from MIT/Beth Israel. Contains 364,627 unique patients, 546,028 hospitalizations, and 94,458 ICU stays (2008-2022). Includes demographics, vitals, labs, medications, procedures, and clinical notes.

FD

FDA Adverse Event Reporting System (FAERS)

Database of 21+ million adverse event reports for drugs and therapeutic biologics. Free API access through openFDA. Supports pharmacovigilance and drug safety research.

UK Biobank

UK Biobank

Prospective cohort of 500,000 UK participants aged 40-69 with genetic data, imaging, and longitudinal health records. Extensive phenotyping including MRI, accelerometry, and linked hospital records.

NHANES (National Health and Nutrition Examination Survey)

NHANES (National Health and Nutrition Examination Survey)

Unique combination of interviews and physical examinations including blood/urine samples. Covers nutrition, chronic diseases, and environmental exposures. ~5,000 participants annually.

Medicare Claims (ResDAC)

Medicare Claims (ResDAC)

Comprehensive claims data covering 98%+ of adults 65+ in the United States. Includes inpatient, outpatient, physician, and prescription drug claims. Research Identifiable Files require 6-12 month DUA approval.

Me

Medical Expenditure Panel Survey (MEPS)

Definitive U.S. data on healthcare expenditures, utilization, and insurance coverage. Surveys ~15,000 households annually with detailed spending by payer and service type. Free public use files.

Merative MarketScan

Merative MarketScan

De-identified commercial claims from 273+ million unique patients since 1995. Includes Commercial Claims, Medicare Supplemental, and Multi-State Medicaid databases. Cited in 2,650+ peer-reviewed studies.

Behavioral Risk Factor Surveillance System (BRFSS)

Behavioral Risk Factor Surveillance System (BRFSS)

World's largest ongoing health survey with 400,000+ adults annually across all states. Covers chronic conditions, risk behaviors, and preventive health. State-level estimates available.

All of Us Research Program

All of Us Research Program

NIH precision medicine initiative enrolling 1+ million diverse U.S. participants. Includes EHR, surveys, wearables, and genomics. Cloud-based Researcher Workbench provides secure access.

National Health Interview Survey (NHIS)

National Health Interview Survey (NHIS)

CDC's flagship health survey covering ~35,000 households annually since 1957. Monitors health status, healthcare access, and health behaviors. Free public use files with extensive documentation.

Or

Oregon Health Insurance Experiment

First large-scale Medicaid RCT. 90K waitlist lottery with health utilization, financial, and credit outcomes.

Advertising

22 datasets
Real-Time Advertisers Auction

Real-Time Advertisers Auction

Real-time advertiser auction dataset for RTB research

Yahoo A1 Search Advertising Dataset

Yahoo A1 Search Advertising Dataset

Search advertising competition dataset with sponsored search auction features and click outcomes

Al

Alibaba Ads Dataset

Advertising dataset from Alibaba for ad targeting and prediction

Avazu

Avazu

Dataset for click-through rate prediction on mobile ads

Soso (KDD Cup 2012)

Soso (KDD Cup 2012)

KDD Cup 2012 Track 2 for sponsored search CTR prediction

Criteo Display Advertising

Criteo Display Advertising

342GB total with 13 integer features, 26 hashed categorical features

Outbrain Click Prediction Dataset

Outbrain Click Prediction Dataset

Content recommendation dataset with 2 billion page views and user engagement data from Outbrain

Criteo Kaggle CTR Dataset

Criteo Kaggle CTR Dataset

Standard CTR prediction benchmark with ~45 million records across 7 days, widely used for model comparison

Avazu Click-Through Rate Dataset

Avazu Click-Through Rate Dataset

Mobile advertising dataset with 40+ million ad click records from Avazu mobile advertising platform

iPinYou RTB Dataset

iPinYou RTB Dataset

Real-time bidding dataset from Chinese DSP with ~35GB of bid requests, impressions, clicks, and conversions with bidding prices

Harvard Dataverse Auctions

Harvard Dataverse Auctions

Auction-related replication datasets from Harvard Dataverse

Criteo Attribution Dataset

Criteo Attribution Dataset

30 days of advertising traffic with conversion attribution data for multi-touch attribution research

Ipinyou RTB

Ipinyou RTB

Real-time bidding (RTB) dataset for CTR prediction

Tencent Social Ads

Tencent Social Ads

Social ad CTR prediction dataset from Tencent

Criteo Terabyte

Criteo Terabyte

342GB, 45M samples with 13 integer features and 26 hashed categorical features for CTR prediction

Cr

Criteo 1TB Click Logs

World's largest public ML advertising dataset with 4+ billion events, 13 integer and 26 categorical features across 24 days

Ad

Adform Display

Display advertising dataset with impressions and clicks

Upworthy News Headlines

Upworthy News Headlines

32,487 headline/image experiments on 538M assignments

Outbrain Click Prediction

Outbrain Click Prediction

Click prediction based on browsing history from Outbrain

Yoyi

Yoyi

Computational advertising dataset from Chinese ad platform

IC

ICPSR Auction Studies

Search results for auction studies from ICPSR

Cr

Criteo Counterfactual Learning

25M logged interactions with counterfactual propensity scores. Gold standard for offline policy evaluation and causal inference in ads

Dataset Aggregators

24 datasets
Google Cloud Public Datasets

Google Cloud Public Datasets

20+ petabytes across 200+ datasets with 1TB free BigQuery queries monthly

WRDS (Wharton Research Data Services)

WRDS (Wharton Research Data Services)

350+ terabytes from CRSP, Compustat, TAQ - de facto standard for academic finance

Cr

Criteo AI Lab Datasets

World's largest public ML dataset - 1TB Click Logs with 4 billion advertising events

Al

Alpha Vantage

NASDAQ-licensed stock data for 200,000+ tickers with free tier (25 requests/day)

Kaggle Datasets

Kaggle Datasets

50,000+ public datasets with free GPU notebooks and active ML community of 23M members

FR

FRED (Federal Reserve Economic Data)

816,000+ US macroeconomic time series from 100+ sources with free API

World Inequality Database

World Inequality Database

Income and wealth inequality data for 100+ countries by Piketty, Saez, and Zucman

Hugging Face Datasets

Hugging Face Datasets

659,000+ datasets across text, image, audio, and tabular with one-line data loaders

EU Open Data Portal

EU Open Data Portal

2 million datasets from 205 catalogues across 36 European countries

AWS Open Data Registry

AWS Open Data Registry

300+ petabytes across hundreds of datasets - Common Crawl, satellite imagery, genomics

IM

IMF Data

International macroeconomic forecasts, BOP, and financial statistics for 195 countries

UCI Machine Learning Repository

UCI Machine Learning Repository

688 curated benchmark datasets since 1987 - gold standard for ML research

Papers With Code Datasets

Papers With Code Datasets

Datasets linked to research papers, code implementations, and SOTA leaderboards

AE

AEA Data and Code Repository

Replication packages for all AEA publications since 2019 with DOI-assigned packages

IPUMS

IPUMS

Harmonized microdata from US Census (1850-present), ACS, CPS, and 103+ countries' censuses

Nasdaq Data Link

Nasdaq Data Link

250+ datasets from 400+ publishers with API access - formerly Quandl

World Bank Open Data

World Bank Open Data

1,400+ development indicators for 217 economies spanning 50+ years with free API

OECD Data

OECD Data

Harmonized indicators for 38 member countries - gold standard for advanced economy comparisons

Zenodo

Zenodo

CERN-operated research data repository with DOI citations, accepts all file types up to 50GB

ICPSR

ICPSR

World's largest social science archive - 250,000+ files across 16,000 studies since 1962

Harvard Dataverse

Harvard Dataverse

Global network of 120+ Dataverse installations hosting 75,000+ datasets with free 1TB storage

Da

Data.gov

370,000+ datasets from US federal, state, and local agencies

OpenML

OpenML

ML benchmarking platform with standardized train-test splits for reproducible comparisons

Go

Google Dataset Search

Search engine indexing 45M+ datasets from 13,000+ websites using schema.org metadata

Insurance & Actuarial

12 datasets
SOA Mortality Tables

SOA Mortality Tables

Society of Actuaries mortality tables and experience studies used as industry standards for life insurance pricing

HCUP (Healthcare Cost and Utilization Project)

HCUP (Healthcare Cost and Utilization Project)

Largest collection of longitudinal hospital care data in the US with 100+ million records per year covering inpatient and emergency visits

FEMA NFIP Claims & Policies

FEMA NFIP Claims & Policies

National Flood Insurance Program data with 2M+ claims since 1978 and policy-level information for flood risk modeling

CM

CMS Medicare & Medicaid Data

Public use files from Centers for Medicare & Medicaid Services including claims data, provider statistics, and program enrollment

French Motor TPL (freMTPL2)

French Motor TPL (freMTPL2)

French motor third-party liability insurance dataset with 678K policies and claims - the standard benchmark for insurance ML papers

NHTSA FARS (Fatality Analysis Reporting System)

NHTSA FARS (Fatality Analysis Reporting System)

Complete census of fatal traffic crashes in the United States since 1975 with vehicle, person, and crash-level details

AXA Driver Telematics (Kaggle)

AXA Driver Telematics (Kaggle)

Driving behavior dataset with 50K driver trips characterized by second-by-second GPS coordinates for usage-based insurance

Human Mortality Database

Human Mortality Database

Detailed mortality and population data for 40+ countries with life tables and exposure-to-risk calculations

EM-DAT International Disaster Database

EM-DAT International Disaster Database

Global database of 26K+ natural and technological disasters since 1900 with human and economic impact data

MEPS (Medical Expenditure Panel Survey)

MEPS (Medical Expenditure Panel Survey)

Nationally representative survey of healthcare utilization, expenditures, insurance coverage, and health status for the US civilian population

NOAA Storm Events Database

NOAA Storm Events Database

Detailed records of significant weather events including property and crop damage estimates from 1950-present

Fr

French Motor Third Party Liability (MTPL)

678,000 motor insurance policies with 9 features and highly zero-inflated claim counts. Canonical insurance benchmark in CASdatasets and sklearn.

Social & Web

13 datasets
Pushshift Reddit Archive

Pushshift Reddit Archive

5.6B comments, 651M posts since 2005. Full Reddit history for social/economic research. 100+ papers published

Yelp Dataset

Yelp Dataset

Business attributes, reviews, user data, and check-ins

Wikipedia Pageviews

Wikipedia Pageviews

296B views/year since 2007. Hourly pageview data for all Wikimedia projects. attention metrics at scale

Go

Google Trends Datastore

Search interest data for nowcasting. Economic indicators, demand prediction, event detection

Meta (Facebook) Research

Meta (Facebook) Research

1.1B+ public FB/IG posts with engagement metrics

Facebook URL Shares

Facebook URL Shares

38M URLs with 10T exposure numbers, fact-checking flags, interaction types (2017-2019). Social Science One initiative

Common Crawl

Common Crawl

250TB/month web crawl. 9.5 PB archive since 2008. Product listings, pricing, economic text at web scale

Meta Content Library

Meta Content Library

Full Facebook/Instagram public archive via ICPSR application. Posts, Pages, groups, events for academic research

US 2020 Election Study

US 2020 Election Study

Facebook/Instagram impact on political attitudes. Published in Science/Nature 2023. SOMAR Michigan access

Stack Overflow Data Dump

Stack Overflow Data Dump

Full Q&A archive + annual developer survey (49K+ responses). Salaries, tech adoption, developer analytics

Wikipedia Full Database Dump

Wikipedia Full Database Dump

Complete Wikipedia content and metadata in SQL/XML format, includes all revisions and edit history

OpenStreetMap Planet

OpenStreetMap Planet

84GB PBF (2TB+ uncompressed) complete world map database with full edit history, weekly updates

SN

SNAP Facebook Ego Networks

4K users with social circles and anonymized node features. Stanford Network Analysis Project dataset

Labor Markets

14 datasets
BLS JOLTS

BLS JOLTS

Monthly job openings, hires, separations by industry since 2000. Bureau of Labor Statistics time series

Ec

Economic Tracker (COVID)

High-frequency economic indicators during COVID. Consumer spending, employment, small business revenue. Weekly updates.

Gl

Glassdoor Reviews

Company ratings, salary reports, interview experiences. Employer review platform data for labor analytics

CE

CEPII BACI Trade Data

Harmonized bilateral trade flows for 200+ countries, 5,000+ products from 1995-present.

Revelio Labs COSMOS

Revelio Labs COSMOS

4.1B job postings from 6.6M companies. Deduplicated, parsed, enriched workforce data (commercial/academic partnerships)

JobHop (Flanders)

JobHop (Flanders)

2.3M occupations and 391K resumes with real career trajectories mapped to ESCO codes. Labor mobility research

Pe

Penn World Table

Cross-country GDP, productivity, capital stocks for 185 countries 1950-2023. Essential for growth research.

Ch

China Shock Data (Autor-Dorn-Hanson)

Import penetration, employment, wages by commuting zone 1990-2019. Canonical trade shock data.

Op

Opportunity Atlas (Chetty)

Census tract-level economic mobility outcomes for 20M+ children. Earnings, incarceration, employment by parental income and race.

NB

NBER-CES Manufacturing Database

U.S. manufacturing industry data 1958-2018. Output, employment, TFP, investment for 459 industries.

So

Social Capital Atlas

County and ZIP-level social capital measures from Facebook data. Economic connectedness, cohesiveness, civic engagement.

Mo

Moving to Opportunity (MTO)

HUD housing mobility experiment 1994-1998. 4,608 families randomized to housing vouchers with long-term outcomes.

Stack Overflow Developer Survey

Stack Overflow Developer Survey

49K+ annual responses with salaries, tech adoption, and developer analytics

[P

[PDF] THE GIG ECONOMY | Edison Research

Source: Marketplace-Edison Research Poll 2018 Gig economy is primary source of income Gig economy is secondary source of income Total Men Women Age 18-34 Age 35-54 Age 55+ Hispanic African-American White 44 47 40 53 39 27 47 55 41 53 51 56 43 57 70 43 46 57 6 | Edison Research | Marketplace THE GIG