Datasets
443 industry datasets across e-commerce, economics, and ML.
E-Commerce
35 datasets
JD.com 2020 (MSOM-20)
2.5M customers (457k purchasers) and 31,868 SKUs from JD.com

BestBuy
Mobile website clicks (~42k) for Xbox games from BestBuy

Retail Rocket
2.76M events (views, carts, purchases) from 1.4M visitors

Open E-Commerce 1.0 (MIT)
1.8M Amazon purchases with demographics (age, gender, location). Real household e-commerce behavior at scale

Shopee
Dataset from Shopee's 2020 Code League competition

Flipkart
Sales dataset from Indian e-commerce platform Flipkart

Amazon Sessions (KDD Cup 23)
Sessions from 6 locales with 40k-500k products per locale

Coveo Shopping (SIGIR-21)
30M+ browsing events with query and image vectors for e-commerce search

Brazilian eCommerce
100,000 orders (2016-2018) structured in 9 relational tables from Olist

ASOS Digital Experiments
99 real A/B experiments with 24,153 time-granular snapshots for adaptive stopping research

Pakistan e-commerce
500k+ transactions (Mar 2016 - Aug 2018) from Pakistan's largest e-commerce

Polish Supermarket POS Dataset
Transaction-level checkout data with timestamps showing start/end times, basket size, and payment method. Distinguishes staffed vs self-service.

Open CDP
Omnichannel interaction tracking with AI-driven identity resolution
Alibaba Ads (IJCAI-18)
6 billion display ad/click logs over 8 days from 100M users
Rakuten SIGIR
E-commerce dataset for SIGIR workshop from Rakuten

Amazon Reviews (2023)
571M reviews (1996-2023), 33 categories, 48M items - comprehensive Amazon review dataset
Alibaba Industrial Dump (150GB)
Large-scale industrial dataset from Alibaba (150GB)

Online Shopping Intention
12,330 user sessions with numerical and categorical features for purchase prediction

OTTO Session-based Recommendations
12M+ e-commerce sessions with click → cart → order sequences. Real multi-stage conversion funnel data from German retailer
Alibaba Personalized Re-Ranking
Mobile shopping user click data on recommended items

Wayfair Search (WANDS)
233k human-annotated query-product judgments, 43k products

Google Merchandise
3 months obfuscated GA4 e-commerce data (Nov 2020-Jan 2021)
Alibaba Cloud Theme
Themed dataset related to Alibaba Cloud services

JD.com Search
170,000 users' real search queries (2021-2022) from JD.com
M5Product
5 modalities (image, text, table, video, audio), 6M+ samples for multimodal learning

JD-pretrain-data
Encoded search queries and item data for intent detection
Alibaba Clickstream 2018
Clickstream data from Alibaba platforms (2018)
Alibaba User Behavior 2018
649M user interactions (clicks, carts, buys) on 25M items
Open e-commerce 1.0
1.85M Amazon purchases from 5,027 US consumers (2018-2022) linked to demographics (age, gender, income, education)
Alibaba Mobile 2021
Mobile user behavior data from Alibaba (2021)
Alibaba Brick and Mortar (IJCAI-16)
Online and offline check-ins/purchases from 1,000+ stores

OTTO Session Data
12M German e-commerce sessions with click → cart → order sequences. RecSys 2022 competition
Tmall Reviews
Product reviews from Tmall (Alibaba's B2C platform)

Flipkart Products
Product information scraped from Flipkart e-commerce platform

Home Depot Product Search
Human-rated relevance scores (1-3) for search terms and products
Transportation & Mobility
17 datasets
Chicago TNC Trips
100M+ rideshare trips with fares (unlike NYC which lacks fare data). Trip-level pricing for Uber/Lyft economic analysis

Flight Delay and Cancellation 2019-2023
5 years of flight delay and cancellation data including COVID-era disruptions for aviation operations research.

Uber Movement
Zone-to-zone travel times and street speeds for 50+ cities worldwide. Congestion patterns from actual Uber rides
Chicago Taxi Trip Data
200M+ taxi trip records from 2013-present with persistent Taxi ID enabling driver-level panel analysis. Includes complete fare decomposition (base, extras, tolls, tips) across 77 community areas. Uniquely suited for supply-side revealed preference analysis of labor supply decisions.
London Passenger Mode Choice (LPMC)
134K trips from London Travel Demand Survey with level-of-service variables computed for all four modes (walking, cycling, transit, driving). Complete choice-set construction for mode choice analysis.

2015 Flight Delays and Cancellations
5M+ US flights from 2015 with departure/arrival times and delay information. Model runways as single-server queues.

NGSIM Vehicle Trajectories
Vehicle trajectory data for traffic flow modeling

NYC TLC Trip Records
3B+ taxi and rideshare trips since 2009. Fares, tips, surge pricing, driver pay. The gold standard for marketplace analytics

Flight Delay Dataset 2018-2022
Recent flight delay data covering 2018-2022 for analyzing aviation queueing and delay patterns.

Grab Driving GPS Traces
GPS trace data from Grab ride-hailing platform

Natural Driving in Ohio
ADAS-equipped vehicles with driving behavior events

Airline Delay
Airline flight delays and carrier information
SwissMetro Dataset
Stated preference survey with explicit availability flags for train, SwissMetro (maglev), and car alternatives. Methodological gold standard for discrete choice modeling with Biogeme.
BTS Airline On-Time Performance
All US flights since 1987. Delays, cancellations, fares, capacity. Revenue management research goldmine

TfL Unified API
Transport for London real-time and historical transit data including arrivals, departures, and service disruptions across all modes.

MTA Open Data
NYC subway and bus real-time feeds and historical performance data including ridership, delays, and service metrics.

DiDi GAIA Open Data
Billions of GPS points and ride trajectories from China's largest ride-hailing platform. Driver behavior and urban mobility patterns
Healthcare
13 datasets
CMS Hospital Price Transparency
Hospital pricing data mandated since 2021. Negotiated rates, chargemaster prices across 6,000+ hospitals. Healthcare pricing research
NYC EMS Incident Dispatch Data
4.83M+ ambulance responses with second-level timestamps for dispatch, en route, and on-scene times. Ideal for M/M/c modeling of ambulance fleets.

OPTN Organ Transplant
Complete US organ donation records since 1987. Waiting lists, donor-recipient matches, outcomes. Market design and matching research

Medicare Provider Utilization
All Medicare providers with service utilization and payment data. CMS public use files for healthcare analytics

Hospital Triage and Patient History
Triage data with timestamps and patient history for modeling priority queues in emergency medicine.

Synthetic Patient Wait Time Records
Hospital wait time records for analyzing patient flow and queue dynamics in healthcare settings.
MIMIC-CXR
473,000 chest X-rays from 45,561 patients linked to treatments, outcomes, and clinical notes. Largest medical imaging dataset for causal analysis.
TCGA (The Cancer Genome Atlas)
11,000+ tumor samples across 33 cancer types with ~20,000 gene expression features per sample. Over 3M Cox models fit identifying 100K+ prognostic biomarkers.

NEMSIS Public Release Research Dataset
49M+ EMS activations nationally with minute-level timestamps. The definitive dataset for US emergency medical services research.

Hospital Emergency Dataset
Patient care and operational efficiency data from hospital emergency departments for healthcare operations research.

ER Wait Time
Simulated emergency room data with wait times, patient outcomes, and satisfaction scores for healthcare queueing analysis.
10X Genomics PBMC
Single-cell RNA sequencing datasets (3k, 4k, 8k, 68k cells) with 80-95% zero entries. Standard benchmark for zero-inflated count models.
SUPPORT Study
8,873 seriously ill hospitalized patients with ~40 clinical variables. Standard survival analysis benchmark.
Fashion & Apparel
7 datasetsAlibaba Fashion Combo
Fashion item combinations from Alibaba for outfit recommendation
Diginetica Fashion
Clickstream and purchase data for fashion e-commerce

ASOS Experiments
99 real e-commerce experiments with daily checkpoints from ASOS

Dressipi Fashion (RecSys 2022)
Session interactions and item features from styling service

Indonesian Fashion
Fashion items for image classification tasks from Indonesia

Fashion-MNIST
70,000 28x28 grayscale images of 10 fashion categories from Zalando

Innerwear Data
Data scraped from Victoria's Secret and other innerwear retailers
Entertainment & Media
25 datasets
Spotify Million Playlist
1M playlists with 2M unique tracks from 300K artists. RecSys 2018 Challenge for playlist continuation research

Netflix Prize
100M+ anonymous movie ratings from 480k users

KuaiSAR
5M search actions, 14.6M recommendation events from 25k users

Stanford Amazon/Beer
Amazon product data and BeerAdvocate reviews from Stanford SNAP

Netflix Viewing Behavior
1.7M episodes/movies watched by 1,060 users over 1 year. Watch patterns, session length, preferences, predictability metrics

Netflix Engagement Reports
Hours viewed for every Netflix title (original and licensed) watched >50K hours. First public streaming metrics since 2021

YouTube Engagement Dataset
5M videos with watch percentage, engagement maps, Freebase topic labels. Video-level engagement metrics for content research
ContentWise Impressions
Industrial OTT video streaming data with full recommendation lists shown to users, including position in list and interaction types (view, detail, rating, purchase). CIKM 2020 benchmark.
MIND (Microsoft News Dataset)
15M+ impression logs from 1M users over 6 weeks with explicit clicked/non-clicked news articles. Gold standard for impression-aware recommendation research with article titles, abstracts, categories, and user history.

Disney California Adventure Wait Times
Historical ride wait time data from Disney California Adventure for theme park queueing analysis.

MicroLens
1 billion interactions from 34 million users on 1 million micro-videos

Bandcamp Music Sales
Music sales data (digital/physical) from Bandcamp platform

Spotify Music Streaming Sessions (MSSD)
150M+ listening sessions with skips, track features, and playlist context. The largest public music streaming behavior dataset

LastFM-1B
1 billion listening events with long-term user histories. Music recommendation and listening behavior research

YouTube User Watch History
1.8M videos watched by 243 users over 1.5 years. Recommendation engine performance, caching research, viewing patterns

Twitch Gamers Social Network
168K nodes with mutual follower relationships. 6 ML tasks including churn, affiliate status, view count prediction

Goodreads
Book information and user reviews from Goodreads platform
Metacritic Video Games
Video game reviews and metadata from Metacritic

TouringPlans Disney World Data
Posted vs actual wait times for Disney World attractions from 2012-present. Premium dataset for validating queue estimates.

Queue-Times API
Real-time wait time data API for 80+ theme parks worldwide. Live queueing data for attractions.

Twitch Streaming Dataset
16 days of viewer counts, stream metadata, game categories from Oct 2017. Live streaming platform dynamics
CoasterQueues
10-minute interval wait time data for 48 theme parks worldwide. CC-BY licensed CSV downloads.

YFCC100M
100M Flickr photos/videos with metadata under Creative Commons. Yahoo/Flickr dataset for multimedia research

YouTube-8M
8M videos with video-level features for large-scale video understanding. Google Research benchmark for video classification

NetEase Music (INFORMS)
Data from NetEase Cloud Music for INFORMS competition
MarTech & Customer Analytics
10 datasets
Retail Rocket Recommender System Dataset
4.5 months of behavior data from a real e-commerce site: 2.7M sessions, item properties, and purchase events. Designed for session-based recommendation research.
Online Retail II (UCI)
1M+ transactions from a UK-based online retailer (2009-2011). Contains invoice number, stock code, description, quantity, invoice date, unit price, customer ID, and country. Standard benchmark for RFM analysis and CLV modeling.
bayesm R Package Datasets
Marketing datasets including cheese, orangeJuice, margarine purchases, physician detailing, and conjoint surveys.
Dominick's Finer Foods (Kilts Center)
Chicago supermarket scanner data 1989-1994. 3,500+ UPCs across 29 categories with prices, promotions, demographics.

Amazon Reviews 2023
233 million product reviews across all Amazon categories with user IDs, timestamps, ratings, and review text. The largest public e-commerce review dataset for recommendation research.

Telco Customer Churn (IBM)
7,043 customers from a telecommunications company with 21 features including demographics, services, account information, and churn status. Industry-standard dataset for churn prediction benchmarking.
Instacart Market Basket Analysis
3 million grocery orders from 200,000 Instacart users with product details and order sequences. Released for a Kaggle competition to predict which products users will reorder.

MovieLens 25M
25 million ratings and 1 million tag applications across 62,000 movies by 162,000 users. The gold-standard benchmark for collaborative filtering research with rich metadata including genres, tags, and timestamps.

KDD Cup 2009 Customer Relationship Prediction
Orange Telecom CRM dataset with 50,000 customers and 230 anonymized features. Predict churn, appetency (propensity to buy), and up-selling. Classic benchmark for CRM analytics.
Amazon Fine Foods Reviews
500,000+ food product reviews from Amazon spanning 1999-2012. Includes user/product IDs, ratings, helpfulness votes, and full review text. Popular for sentiment analysis and review-based recommendations.
Causal Inference
15 datasetsDoubleML
Chernozhukov's double/debiased machine learning package with 401k and other example datasets.
Nevo Cereal Data
Semi-fabricated ready-to-eat cereal data for teaching BLP methods. 24 brands, 94 markets.
News Headlines A/B Test Dataset
Real A/B test outcomes on click-through rates paired with BERT embeddings of headline text. From Columbia Business School.
IBM LBIDD Benchmark
Scalable semi-synthetic causal inference data from birth records (~100 covariates) with various link functions for treatment effect estimation.

Criteo Uplift Prediction Dataset
~25 million rows with treatment indicators for benchmarking Individual Treatment Effect (ITE) estimation in advertising
Lalonde NSW Job Training
National Supported Work experiment data. Canonical dataset for propensity score and matching methods.
MHE Data Archive (Angrist-Pischke)
Replication data for Mostly Harmless Econometrics including STAR, NJ minimum wage, Mariel boatlift.
PyBLP (BLP Automobile Data)
Berry-Levinsohn-Pakes automobile demand data 1971-1990. Foundational dataset for IO demand estimation.
GRF (Generalized Random Forests)
Athey-Imbens-Wager causal forest package with example datasets for heterogeneous treatment effects.
Synth (Synthetic Control)
Abadie's synthetic control package with California Prop 99 and Basque Country terrorism datasets.
dSprite Causal Benchmark
64×64 heart-shaped sprite images (4,096 dimensions) with fully synthetic data and known treatment effect functions. Supports backdoor, front-door, IV, and proxy causal learning identification strategies.
GiveDirectly Kenya (Satellite Imagery)
High-resolution satellite imagery with CNN-extracted features for 4,578 households in a cash transfer RCT. Provides genuine ground truth from randomized experimental design.
News (Johansson et al.)
Bag-of-words text benchmark with thousands of word features for HTE estimation. Semi-synthetic outcomes (real text covariates, simulated treatment/outcomes) with known CATE functions.
ACIC Competition Series (2016-2023)
Competition-validated causal inference benchmarks with 50-200 covariates and known treatment effects. ACIC 2022 used Medicare expenditure data across 200 realizations.
Amazon Reviews (CausalNLP)
~5,000 product reviews with simulated treatment-outcome relationships and documented confounders for causal inference from text.
Food & Delivery
5 datasets
Instacart
3.4M orders, 206k+ users, 49k+ products with reorder behavior

Cainiao Last-Mile (MSOM18)
Cainiao Last-Mile Delivery dataset from MSOM 2018

Mendeley Food Delivery Reviews
1.69M reviews from DoorDash, Grubhub, Uber Eats. Ratings, text reviews, restaurant metadata. Gig economy platform research
Ele.me Clickstream
Clickstream data from Ele.me food delivery platform
Ele.me Search
Search log dataset from Ele.me (Chinese food delivery)
Grocery & Supermarkets
28 datasets
Indian Grocery (Flipkart Supermart)
Flipkart Supermart transaction and product details

Brazil Medical
Medicine sales data in Brazil

UK Gift Shop (Online Retail)
Online retail transactions (2010-2011) from UK gift retailer

Office Supplies (DMDA 2023)
Office supply sales for DMDA 2023 workshop challenge
Montgomery Liquor
Warehouse and retail liquor sales from Montgomery County, Maryland

Vietnam Supermarket
Sales and inventory snapshot data from Vietnamese supermarket

Turkish Drugs
Drug sales data from Turkey

Israeli Grocery
Grocery purchase data from Israel

Ukraine eCommerce (Fozzy)
E-commerce sales data from Fozzy Group retail chain in Ukraine

Tesco Grocery 1.0
Grocery purchases from Tesco stores via loyalty cards

NYC Shopping
Large sales dataset from New York City retail

Walmart Sales
General sales data including CPI and unemployment rate
dunnhumby - Breakfast at the Frat
Time series dataset featuring 156 weeks of store-level sales data for household staples in four categories: mouthwash, pretzels, frozen pizza, and boxed cereal. Includes pricing, promotional activity, and store characteristics.

Brazilian Drugs (ANVISA)
Sales data for controlled substances reported by ANVISA

Indian Sales
Sales forecasting dataset for small basket items in India

Dominicks Soft Drinks
Weekly scanner data on soft drink purchases from Dominick's Finer Foods

Ecuador Grocery (Favorita)
Unit sales data with store/item metadata and oil prices from Ecuador

Iowa Liquor
Monthly Class E liquor sales data with volume and pricing from Iowa

Walmart (M5)
Hierarchical sales data for 3,049 products across 10 stores

Brazilian Store Chain
Sales data from Brazilian retail chain

Italian Grocers
Receipt-level sales data from Italian grocery stores

Mexican Grocery
Data from a Mexican grocery store

Store Item Demand
50 items across 10 different stores over 5 years

Rossmann Store Sales
1,115 Rossmann drug stores historical sales data

Polish Grocery
Yearly sales data (2018) from Polish grocery shop
dunnhumby - Carbo-Loading
Relational database containing 2 years of household-level transaction data across 38 stores. Includes 5,000 households purchasing 125 products from pasta sauce and pasta categories with full demographic profiles.
dunnhumby - The Complete Journey
Comprehensive household-level panel data from 2,500 frequent shoppers over 2 years. Includes full purchase history, coupon redemptions, direct mail campaigns, and complete demographic/lifestyle profiles - the gold standard for CPG analytics.
dunnhumby - Let's Get Sort-of-Real
Large-scale synthetic dummy dataset with 300M+ transactions for testing analytics pipelines and algorithms at scale. Mimics real retail data structures without privacy concerns - ideal for scalability testing and teaching.
Automotive
5 datasets
BLP US Car Data
Classic dataset (1971-1990) for demand model estimation

German Used Cars
Used car listings or sales in Germany

European Car Market
Car information including prices and attributes (1970-1999)

Indian Automobiles (Telangana)
Vehicle sales data for Telangana, India (2023)

Russian Car Market
Car sales information in Russia
Auctions & Marketplaces
8 datasets
Crypto Art (SuperRare)
Bids and transactions from SuperRare NFT platform
Online Auctions Collection
Collection of datasets from eBay and experimental auctions

Used Car Auction (PakWheels)
Listings from PakWheels Pakistani automobile marketplace

EU TED Procurement
800K+ procurement notices annually. All EU public contracts above thresholds. Structured XML since 2006. Cross-country procurement research

Romania Tenders
Public tender data (2007-2016) from Romania

Ukraine Procurement (ProZorro)
Public procurement data from ProZorro system

Art Auction (Artists for Lahaina)
Artists for Lahaina benefit art auction data (2023)

FCC Spectrum Auctions
87+ auctions (1994-present) with round-by-round bidding data. Complete bid histories, reserve prices, winners. Auction theory empirics
Financial Services
18 datasetsDamodaran Corporate Finance Data
Industry-level betas, WACC, multiples, margins for 170+ countries. Most comprehensive free corporate finance data.

FI-2010 Limit Order Book
4.3M samples of NASDAQ Nordic limit order book data. 10 depth levels, 5 stocks, normalized features. Benchmark for price prediction

IEEE-CIS Fraud Detection
590K card-not-present transactions with 393 features from Vesta Corp. Real messy fraud data (3.5% fraud rate)

Hashed Multimodal Banking
Banking transactions and product purchases with hashed identifiers
Uniswap DEX Trading Data (Harvard Dataverse)
82.9M decentralized exchange transactions from Uniswap V3 across Ethereum mainnet and Layer-2 networks (June 2021-Dec 2022). Wallet-level panel enables analysis of how crypto traders respond to gas fees and exchange rates. Published in Nature Scientific Data (2025).

ORBITAAL Bitcoin Transactions
13 years of Bitcoin transaction graphs (2009-2022). Complete blockchain with labeled entities. network analysis at scale

ORBITAAL Bitcoin Graph
13 years (2009-2021) of entity-level Bitcoin transaction networks with BTC/USD values

PaySim Synthetic Transactions
6M+ mobile money transactions simulating real fraud patterns. Agent-based model calibrated on real African mobile money logs
LOBSTER Order Book
NASDAQ limit order book data at millisecond precision. Level 1-10 depth, message-by-message reconstruction. Market microstructure research

Prosper Loan Data
113K P2P loans with borrower characteristics and outcomes

Prosper Loans
113K P2P loans with borrower characteristics, credit grades, and loan outcomes. Alternative to LendingClub for P2P lending research

LendingClub Loans
2.7M loans (2007-2019) with 151 features. Interest rates, credit scores, defaults. The canonical P2P lending dataset for credit risk modeling

Amazon Fraud Detection Benchmark
9 consolidated fraud datasets with unified format. Includes IEEE-CIS, credit card, e-commerce fraud. Benchmark for fraud ML research
Kenneth French Data Library
Fama-French factor returns (3-factor, 5-factor, momentum) from 1926-present. The authoritative source for asset pricing research.
Shiller CAPE & Stock Market Data
S&P 500 prices, earnings, dividends, CAPE ratio from 1871-present. Nobel laureate Robert Shiller's canonical dataset.

Google BigQuery Crypto
8 complete blockchain histories (Bitcoin, Ethereum, etc.) with daily updates. Transaction-level data for crypto analytics research

SEC EDGAR Filings
21M+ public company filings since 1994. 10-Ks, 8-Ks, proxy statements. Full text + structured XBRL data
Pastor-Stambaugh Liquidity Factors
Aggregated liquidity levels and traded liquidity factor from 1962-present for asset pricing research.
Transportation Economics & Technology
10 datasets
NYC TLC Trip Records
Complete trip-level data for all NYC taxi and for-hire vehicle trips including Uber and Lyft. Billions of records since 2009 with pickups, dropoffs, fares, and tips.

Chicago Rideshare Data
Trip-level data for all Transportation Network Provider (Uber/Lyft) trips in Chicago since 2018. Includes ~57 million trips annually with origins, destinations, and fares.

National Household Travel Survey (NHTS)
Comprehensive US travel behavior data since 1969 capturing daily non-commercial travel by all modes. The authoritative source on American travel patterns.
DB1B Airline Origin and Destination Survey
10% random sample of all US airline tickets with origin, destination, fare, and itinerary details. Quarterly since 1993. The gold standard for airline pricing research.

Transitland GTFS Feeds
Aggregated GTFS data from 2,500+ transit agencies across 55+ countries. The largest open transit data aggregator with REST and GraphQL APIs.

Citi Bike System Data
Trip-level records for NYC's bike-share system since 2013. ~2 million trips monthly in peak season with station-level origin-destination and duration data.

National Transit Database (NTD)
Definitive source for US transit statistics since 2002. Ridership, operating expenses, capital expenses, safety incidents for all federally-funded transit agencies.

MobilityData GTFS Catalog
Curated directory of 1,327+ GTFS feeds from transit agencies globally with quality metrics, update frequency, and standardized metadata.

OpenStreetMap Road Network
Open-source global road network data including road types, speeds, and connectivity. Downloadable via Overpass API or pre-processed extracts.

FHWA Highway Statistics
Annual data on US highway system including vehicle miles traveled, fuel consumption, road infrastructure, and highway financing since 1945.
Data Portals
36 datasetsAmazon AWS Open Data
Registry of Open Data with analysis-ready datasets

Census Business Dynamics Statistics
8M+ establishments with firm age data. Job creation/destruction, startups, exits. Longitudinal firm dynamics since 1977

JD.com Open Datasets
Open dataset portal for e-commerce and logistics from JD.com

IBM Developer Data
AI, data science, healthcare, and weather datasets from IBM
DrivenData Water Supply Forecasting (2024)
Western US water supply data from Bureau of Reclamation, $500K prize pool for seasonal forecasting

Amazon ShopBench (KDD Cup 2024)
57 tasks, 20K questions derived from real Amazon shopping data for LLM shopping assistants

RecSys Challenge 2025 (Synerise)
1M users, 6 months of real e-commerce behavior logs with 5 event types for universal behavioral modeling

Hugging Face Datasets
ML/NLP datasets hub with 100K+ datasets. Easy loading via Python library. Community-driven repository

KDD Cup
ACM SIGKDD annual data mining competition

RecSys Challenge 2024 (EB-NeRD)
2.3M users, 380M+ news impressions from Ekstra Bladet for news recommendation research

PatentsView
13M+ US patents (1976-present) with citations, inventors, assignees. Full patent text and claims. innovation research at scale
Makridakis Competitions
Time series data for forecasting competitions (M1-M5)

NeurIPS Competition Data
Top-tier conference with competitions and benchmarks

Inside Airbnb Raw Data
Raw data files from Inside Airbnb project

Rakuten Data Release
E-commerce, advertising, and multimedia datasets from Rakuten
Marketing Science Databases
INFORMS conference with data-focused opportunities

RecSys Datasets Collection
Datasets from ACM Recommender Systems challenges
Yongfeng Dataset Collection
E-commerce and recommendation system datasets

Julian McAuley Datasets
Reviews, recommendations, and social network data

Yandex Datasets
Search ranking, translation quality, and ML task datasets

NBER Public Use Data Archive
Eclectic mix of economic, demographic, and enterprise data from NBER-affiliated research projects
Data Mining Cup
Industry-sponsored data mining competitions
DrivenData
Data science competitions for social impact

IJCAI Competitions
International AI conference with competitions

Microsoft Research
Research tools and datasets across multiple domains
MSOM Data Challenges
Manufacturing & Service Operations Management challenges

OpenML
Platform for sharing datasets, tasks, and ML code
Google Dataset Search
Universal search engine for datasets across the web. Meta-tool for discovering research data
USASpending Federal Awards
All federal contracts, grants, loans since 2001. 400+ variables, $50T+ in awards. Government procurement analytics
CodaLab
Platform for competitions, benchmarks, and reproducible research

Baidu AI Datasets
AI, NLP, computer vision, and autonomous driving datasets
GH Archive
Complete public GitHub timeline since 2011 with 3B+ events. Captures all public repository activity including pushes, PRs, issues, stars, and forks. BigQuery and hourly downloadable archives available.
Software Heritage Archive
World's largest source code archive with 5B+ files from 300M+ projects. Preserves entire Git history with persistent identifiers (SWHIDs). Universal reference for software provenance research.
Stack Exchange Data Dump
Complete Q&A history from all Stack Exchange sites including Stack Overflow. Posts, users, votes, comments, and tags in XML format. Updated quarterly via archive.org.
TravisTorrent
2.6M builds from 1,300 GitHub projects with test outcomes, build durations, and commit metadata. Links CI/CD behavior to code changes for build prediction and process mining research.
JetBrains Developer Ecosystem Survey
Annual survey with 23,000-32,000 developer responses covering technology adoption, work patterns, and industry trends. Raw CSV data available for independent analysis.
AI & LLM
5 datasets
LMSYS-Chat-1M
1M real-world conversations with 25 state-of-the-art LLMs spanning 154 languages

DiQAD
100K real-world user dialogues with comprehensive 6-dimension quality assessment
ConvAI Dataset
4,750 human-to-bot dialogues with thumbs up/down feedback plus quality scores

USS (User Satisfaction Simulation)
6,800 dialogues with 5-level satisfaction scale labels across multiple domains

Arena Human Preference (55K)
55K+ real-world conversations with human preference labels from Chatbot Arena
Healthcare Economics & Health-Tech
11 datasets
MIMIC-IV
Gold standard for freely accessible critical care EHR data from MIT/Beth Israel. Contains 364,627 unique patients, 546,028 hospitalizations, and 94,458 ICU stays (2008-2022). Includes demographics, vitals, labs, medications, procedures, and clinical notes.
FDA Adverse Event Reporting System (FAERS)
Database of 21+ million adverse event reports for drugs and therapeutic biologics. Free API access through openFDA. Supports pharmacovigilance and drug safety research.

UK Biobank
Prospective cohort of 500,000 UK participants aged 40-69 with genetic data, imaging, and longitudinal health records. Extensive phenotyping including MRI, accelerometry, and linked hospital records.

NHANES (National Health and Nutrition Examination Survey)
Unique combination of interviews and physical examinations including blood/urine samples. Covers nutrition, chronic diseases, and environmental exposures. ~5,000 participants annually.

Medicare Claims (ResDAC)
Comprehensive claims data covering 98%+ of adults 65+ in the United States. Includes inpatient, outpatient, physician, and prescription drug claims. Research Identifiable Files require 6-12 month DUA approval.
Medical Expenditure Panel Survey (MEPS)
Definitive U.S. data on healthcare expenditures, utilization, and insurance coverage. Surveys ~15,000 households annually with detailed spending by payer and service type. Free public use files.

Merative MarketScan
De-identified commercial claims from 273+ million unique patients since 1995. Includes Commercial Claims, Medicare Supplemental, and Multi-State Medicaid databases. Cited in 2,650+ peer-reviewed studies.

Behavioral Risk Factor Surveillance System (BRFSS)
World's largest ongoing health survey with 400,000+ adults annually across all states. Covers chronic conditions, risk behaviors, and preventive health. State-level estimates available.

All of Us Research Program
NIH precision medicine initiative enrolling 1+ million diverse U.S. participants. Includes EHR, surveys, wearables, and genomics. Cloud-based Researcher Workbench provides secure access.

National Health Interview Survey (NHIS)
CDC's flagship health survey covering ~35,000 households annually since 1957. Monitors health status, healthcare access, and health behaviors. Free public use files with extensive documentation.
Oregon Health Insurance Experiment
First large-scale Medicaid RCT. 90K waitlist lottery with health utilization, financial, and credit outcomes.
Marketing Mix
1 datasetsEducation
6 datasets
Athey's Course Datasets
Datasets related to causal inference and experimental design from Susan Athey

Stanford GSB Experiment Collection
Datasets for experimentation analysis from Stanford Graduate School of Business
Gamified Learning
Experiments on gamification in learning environments
ASSISTments Dataset
Data from online tutoring platform for educational data mining

Computational Neuroscience
Experimental data from neural recordings and behavior
PEP Experimental Research
Experimental research datasets from Partnership for Economic Policy
Travel & Hospitality
5 datasetsExpedia Personalized Sort
10M hotel search results with ranking positions, click/booking labels, prices, quality scores, and competitive OTA information. ICDM 2013 learning-to-rank benchmark.

Expedia Hotel
Hotel booking and search data from Expedia
Fliggy Travel
Travel-related data from Alibaba's online travel platform
Fliggy Transfers
Transfer-related data (flights, ground transport) from Fliggy
Trivago RecSys 2019
16M hotel search sessions with explicit impression lists showing which accommodations were displayed. Session-based recommendation challenge with filter usage and clickout actions.
Logistics & Supply Chain
8 datasetsDataCo Supply Chain
Synthetic supply chain dataset covering sales and returns
MSOM Pharma Manufacturing (2024)
Continuous pharmaceutical manufacturing data from MSD. Real production processes for operations management research

LaDe Last-Mile Delivery
10.6M+ packages, 619k trajectories with GPS data

Amazon Last Mile
9,184 historical routes across 5 US metro areas
Drone Delivery
Drone delivery logistics and operations dataset

LaDe (Cainiao)
10.6M+ packages with 619K trajectories and GPS data from Alibaba logistics

IMF PortWatch
Daily port activity across 1,985 ports worldwide using AIS satellite data from 90,000+ vessels. Free API access.

Port of Los Angeles Port Optimizer
Container dwell times and truck turn times from the largest US port since 2021. Real operational metrics.
Advertising
22 datasets
Real-Time Advertisers Auction
Real-time advertiser auction dataset for RTB research

Yahoo A1 Search Advertising Dataset
Search advertising competition dataset with sponsored search auction features and click outcomes
Alibaba Ads Dataset
Advertising dataset from Alibaba for ad targeting and prediction

Avazu
Dataset for click-through rate prediction on mobile ads

Soso (KDD Cup 2012)
KDD Cup 2012 Track 2 for sponsored search CTR prediction

Criteo Display Advertising
342GB total with 13 integer features, 26 hashed categorical features

Outbrain Click Prediction Dataset
Content recommendation dataset with 2 billion page views and user engagement data from Outbrain

Criteo Kaggle CTR Dataset
Standard CTR prediction benchmark with ~45 million records across 7 days, widely used for model comparison

Avazu Click-Through Rate Dataset
Mobile advertising dataset with 40+ million ad click records from Avazu mobile advertising platform

iPinYou RTB Dataset
Real-time bidding dataset from Chinese DSP with ~35GB of bid requests, impressions, clicks, and conversions with bidding prices

Harvard Dataverse Auctions
Auction-related replication datasets from Harvard Dataverse

Criteo Attribution Dataset
30 days of advertising traffic with conversion attribution data for multi-touch attribution research

Ipinyou RTB
Real-time bidding (RTB) dataset for CTR prediction

Tencent Social Ads
Social ad CTR prediction dataset from Tencent

Criteo Terabyte
342GB, 45M samples with 13 integer features and 26 hashed categorical features for CTR prediction
Criteo 1TB Click Logs
World's largest public ML advertising dataset with 4+ billion events, 13 integer and 26 categorical features across 24 days
Adform Display
Display advertising dataset with impressions and clicks

Upworthy News Headlines
32,487 headline/image experiments on 538M assignments

Outbrain Click Prediction
Click prediction based on browsing history from Outbrain

Yoyi
Computational advertising dataset from Chinese ad platform
ICPSR Auction Studies
Search results for auction studies from ICPSR
Criteo Counterfactual Learning
25M logged interactions with counterfactual propensity scores. Gold standard for offline policy evaluation and causal inference in ads
Dataset Aggregators
24 datasets
Google Cloud Public Datasets
20+ petabytes across 200+ datasets with 1TB free BigQuery queries monthly

WRDS (Wharton Research Data Services)
350+ terabytes from CRSP, Compustat, TAQ - de facto standard for academic finance
Criteo AI Lab Datasets
World's largest public ML dataset - 1TB Click Logs with 4 billion advertising events
Alpha Vantage
NASDAQ-licensed stock data for 200,000+ tickers with free tier (25 requests/day)

Kaggle Datasets
50,000+ public datasets with free GPU notebooks and active ML community of 23M members
FRED (Federal Reserve Economic Data)
816,000+ US macroeconomic time series from 100+ sources with free API

World Inequality Database
Income and wealth inequality data for 100+ countries by Piketty, Saez, and Zucman

Hugging Face Datasets
659,000+ datasets across text, image, audio, and tabular with one-line data loaders

EU Open Data Portal
2 million datasets from 205 catalogues across 36 European countries

AWS Open Data Registry
300+ petabytes across hundreds of datasets - Common Crawl, satellite imagery, genomics
IMF Data
International macroeconomic forecasts, BOP, and financial statistics for 195 countries

UCI Machine Learning Repository
688 curated benchmark datasets since 1987 - gold standard for ML research

Papers With Code Datasets
Datasets linked to research papers, code implementations, and SOTA leaderboards
AEA Data and Code Repository
Replication packages for all AEA publications since 2019 with DOI-assigned packages

IPUMS
Harmonized microdata from US Census (1850-present), ACS, CPS, and 103+ countries' censuses

Nasdaq Data Link
250+ datasets from 400+ publishers with API access - formerly Quandl
World Bank Open Data
1,400+ development indicators for 217 economies spanning 50+ years with free API

OECD Data
Harmonized indicators for 38 member countries - gold standard for advanced economy comparisons

Zenodo
CERN-operated research data repository with DOI citations, accepts all file types up to 50GB

ICPSR
World's largest social science archive - 250,000+ files across 16,000 studies since 1962

Harvard Dataverse
Global network of 120+ Dataverse installations hosting 75,000+ datasets with free 1TB storage
Data.gov
370,000+ datasets from US federal, state, and local agencies

OpenML
ML benchmarking platform with standardized train-test splits for reproducible comparisons
Google Dataset Search
Search engine indexing 45M+ datasets from 13,000+ websites using schema.org metadata
Energy
10 datasetsCAISO OASIS
California ISO market data including prices, generation, demand, and transmission

EIA-860 Generator Inventory
Annual survey of all U.S. electric generators including capacity, technology, ownership, and location

PJM Data Miner
Comprehensive market data from PJM, the largest U.S. regional transmission organization

FERC Form 714
Hourly electricity load data from major U.S. utilities and regional transmission organizations

Ember Global Electricity Data
Monthly electricity generation, capacity, and emissions data for 200+ countries

EIA-923 Power Plant Operations
Monthly power plant fuel consumption, generation, and emissions data for all U.S. generators

NREL NSRDB (National Solar Radiation Database)
High-resolution solar irradiance data covering the Americas with 30-minute temporal resolution

EPA CEMS (Continuous Emissions Monitoring)
Hourly emissions and generation data from U.S. power plants since 1995

gridstatus
Unified Python API for accessing real-time and historical data from all major U.S. ISOs

PUDL (Public Utility Data Liberation)
Cleaned and integrated dataset combining EIA, FERC, and EPA energy data into a unified database
Content Moderation
8 datasets
FakeNewsNet
23K news articles labeled fake/real with social context. Includes PolitiFact and GossipCop sources

CoAID COVID Misinformation
4,251 news articles and 296K claims about COVID-19 healthcare misinformation. Fact-checked with ground truth labels
LIAR Fact-Checking
12.8K fact-checked political statements with speaker metadata and 6-way truthfulness labels. Politifact benchmark
LIAR
12.8K fact-checked political statements with speaker metadata

ETHOS Hate Speech
998 online comments labeled for hate speech detection in English. Binary and multi-label annotations

HateXplain
20K social media posts with human rationales across 10 hate speech target categories. Explainable AI for content moderation

HateDay
Global representative sample of real-world hate speech across languages. 2024 benchmark for content moderation
Hate Speech Data Catalogue
50+ hate speech datasets across languages compiled at hatespeechdata.com. Meta-resource for content moderation research
Real Estate
6 datasets
AirBnb (Inside Airbnb)
6M+ listings, 190M+ reviews with pricing and amenities

UK Land Registry Price Paid
4.3GB of UK property sales transactions going back decades, messy real-world government data

Redfin Housing Market Data
Downloadable housing market data: home prices, sales, inventory, listings by metro/city/zip. Updated weekly from MLS

Chicago Property Data
Property assessment values and sales data from Cook County
NYC Property Sales
NYC property sales transactions across all boroughs

Zillow Research Data
Home values (ZHVI), rents (ZORI), inventory, and market heat indices across US metros and zip codes
Cybersecurity
3 datasets
NIST National Vulnerability Database (NVD)
Comprehensive database of 500,000+ CVE vulnerability entries with CVSS severity scores and free API access

CFR Cyber Operations Tracker
Database of publicly known state-sponsored cyber incidents since 2005 with threat actor attribution

CSIS Significant Cyber Incidents
Curated list of major cyber attacks with losses exceeding $1 million, maintained by leading security think tank
Manufacturing
3 datasets
Bosch Production Line Performance
1.18M parts tracked across 52 workstations with 1,156 timestamp features. One of the largest real manufacturing datasets.
NASA C-MAPSS Turbofan
Run-to-failure sensor data from turbofan engines. 21 sensors over time for predictive maintenance and remaining useful life estimation.
OR-Library Job Shop Benchmarks
82+ classic job shop scheduling problem instances used as benchmarks in operations research literature.
Insurance & Actuarial
12 datasets
SOA Mortality Tables
Society of Actuaries mortality tables and experience studies used as industry standards for life insurance pricing

HCUP (Healthcare Cost and Utilization Project)
Largest collection of longitudinal hospital care data in the US with 100+ million records per year covering inpatient and emergency visits

FEMA NFIP Claims & Policies
National Flood Insurance Program data with 2M+ claims since 1978 and policy-level information for flood risk modeling
CMS Medicare & Medicaid Data
Public use files from Centers for Medicare & Medicaid Services including claims data, provider statistics, and program enrollment

French Motor TPL (freMTPL2)
French motor third-party liability insurance dataset with 678K policies and claims - the standard benchmark for insurance ML papers

NHTSA FARS (Fatality Analysis Reporting System)
Complete census of fatal traffic crashes in the United States since 1975 with vehicle, person, and crash-level details

AXA Driver Telematics (Kaggle)
Driving behavior dataset with 50K driver trips characterized by second-by-second GPS coordinates for usage-based insurance

Human Mortality Database
Detailed mortality and population data for 40+ countries with life tables and exposure-to-risk calculations

EM-DAT International Disaster Database
Global database of 26K+ natural and technological disasters since 1900 with human and economic impact data

MEPS (Medical Expenditure Panel Survey)
Nationally representative survey of healthcare utilization, expenditures, insurance coverage, and health status for the US civilian population

NOAA Storm Events Database
Detailed records of significant weather events including property and crop damage estimates from 1950-present
French Motor Third Party Liability (MTPL)
678,000 motor insurance policies with 9 features and highly zero-inflated claim counts. Canonical insurance benchmark in CASdatasets and sklearn.
Social & Web
13 datasets
Pushshift Reddit Archive
5.6B comments, 651M posts since 2005. Full Reddit history for social/economic research. 100+ papers published

Yelp Dataset
Business attributes, reviews, user data, and check-ins

Wikipedia Pageviews
296B views/year since 2007. Hourly pageview data for all Wikimedia projects. attention metrics at scale
Google Trends Datastore
Search interest data for nowcasting. Economic indicators, demand prediction, event detection

Meta (Facebook) Research
1.1B+ public FB/IG posts with engagement metrics

Facebook URL Shares
38M URLs with 10T exposure numbers, fact-checking flags, interaction types (2017-2019). Social Science One initiative

Common Crawl
250TB/month web crawl. 9.5 PB archive since 2008. Product listings, pricing, economic text at web scale

Meta Content Library
Full Facebook/Instagram public archive via ICPSR application. Posts, Pages, groups, events for academic research

US 2020 Election Study
Facebook/Instagram impact on political attitudes. Published in Science/Nature 2023. SOMAR Michigan access

Stack Overflow Data Dump
Full Q&A archive + annual developer survey (49K+ responses). Salaries, tech adoption, developer analytics

Wikipedia Full Database Dump
Complete Wikipedia content and metadata in SQL/XML format, includes all revisions and edit history

OpenStreetMap Planet
84GB PBF (2TB+ uncompressed) complete world map database with full edit history, weekly updates
SNAP Facebook Ego Networks
4K users with social circles and anonymized node features. Stanford Network Analysis Project dataset
Sports & Athletics
8 datasets
Lahman Baseball Database
Complete historical baseball statistics from 1871-2024 including batting, pitching, fielding, and salaries for every MLB player and team

Statcast / Baseball Savant
MLB's official tracking data including pitch velocity, spin rate, exit velocity, launch angle, and player positioning from 2015-present

NHL API
Official NHL statistics and play-by-play data from 2010-present including shot locations, player stats, and game events

NBA Stats API
Official NBA statistics including shot charts, play-by-play, player tracking data, and historical records

StatsBomb Open Data
Free soccer event data with 3,400+ events per match including xG, freeze-frame data, and 360 player positioning. Includes Messi's complete La Liga career and multiple World Cups

nflverse
Comprehensive NFL play-by-play data from 1999-present with EPA, win probability, and player participation data

Jeff Sackmann Tennis Data
Comprehensive tennis match results and point-by-point data from 1973-present for ATP, WTA, and Grand Slam tournaments

Retrosheet
Play-by-play data for MLB games from 1911-2024 including detailed event files, game logs, and transaction records
Geospatial
5 datasets
Predicting Poverty Replication Data
Satellite imagery and survey data from Jean et al. (Science 2016) for predicting poverty in African countries using deep learning.
GHCN-Daily
Global Historical Climatology Network with 100,000+ weather stations globally, records from 1832-present. Primary source for extreme value analysis.

World Bank Light Every Night
30 years of nighttime satellite imagery (250 terabytes) from DMSP and VIIRS sensors. Foundational dataset for using lights as GDP proxy.
ERA5 Reanalysis
Hourly global climate fields from 1950-present at 0.25° resolution. ECMWF's state-of-the-art reanalysis product.
HYADES
39,206 weather stations with annual maxima precipitation spanning 16-200 years per station. Purpose-built for extreme precipitation analysis.
App Stores
3 datasets
Apple App Store Dataset
7,200 iOS apps with pricing, ratings, genres, in-app purchases. Apple app marketplace analysis

MobileRec
19.3M user reviews from 700K users across 10K apps in 48 categories. Google Play app recommendation research

Google Play Store Dataset
2.3M apps with ratings, reviews, categories, sizes, installs. Android app marketplace data
Creator Economy
3 datasets
Creator Economy Reports
Survey-based earnings breakdowns by platform (YouTube, TikTok, Instagram, Twitch). Influencer Marketing Factory research

Patreon Creator Data
279K+ active creators with membership tiers and patron counts. Creator economy platform metrics from Graphtreon
Social Blade
Public subscriber/follower counts and growth metrics across YouTube, Twitch, Instagram, Twitter, TikTok
Fraud Detection
1 datasetsDefense Economics
7 datasetsSIPRI Arms Industry Database
Financial data on the 100 largest defense companies worldwide including revenue, profits, and employment

SIPRI Military Expenditure Database
Comprehensive annual military spending data covering all countries since 1949 in local currency, constant/current USD, and GDP shares

NATO Defence Expenditure
Standardized defense spending data for all NATO members enabling alliance burden-sharing analysis and 2% GDP target tracking

Federal Procurement Data System (FPDS)
Comprehensive U.S. government contract database with 50+ million unclassified actions and 200+ data elements per transaction

SIPRI Arms Transfers Database
Most comprehensive public source on international transfers of major conventional weapons since 1950

DoD National Defense Budget Estimates (Green Book)
Detailed U.S. defense spending by program element, military department, and appropriation from FY1945 to present

USAspending.gov
User-friendly interface to federal spending data with bulk downloads in CSV/JSON and visualization tools
Labor Markets
14 datasets
BLS JOLTS
Monthly job openings, hires, separations by industry since 2000. Bureau of Labor Statistics time series
Economic Tracker (COVID)
High-frequency economic indicators during COVID. Consumer spending, employment, small business revenue. Weekly updates.
Glassdoor Reviews
Company ratings, salary reports, interview experiences. Employer review platform data for labor analytics
CEPII BACI Trade Data
Harmonized bilateral trade flows for 200+ countries, 5,000+ products from 1995-present.

Revelio Labs COSMOS
4.1B job postings from 6.6M companies. Deduplicated, parsed, enriched workforce data (commercial/academic partnerships)

JobHop (Flanders)
2.3M occupations and 391K resumes with real career trajectories mapped to ESCO codes. Labor mobility research
Penn World Table
Cross-country GDP, productivity, capital stocks for 185 countries 1950-2023. Essential for growth research.
China Shock Data (Autor-Dorn-Hanson)
Import penetration, employment, wages by commuting zone 1990-2019. Canonical trade shock data.
Opportunity Atlas (Chetty)
Census tract-level economic mobility outcomes for 20M+ children. Earnings, incarceration, employment by parental income and race.
NBER-CES Manufacturing Database
U.S. manufacturing industry data 1958-2018. Output, employment, TFP, investment for 459 industries.
Social Capital Atlas
County and ZIP-level social capital measures from Facebook data. Economic connectedness, cohesiveness, civic engagement.
Moving to Opportunity (MTO)
HUD housing mobility experiment 1994-1998. 4,608 families randomized to housing vouchers with long-term outcomes.

Stack Overflow Developer Survey
49K+ annual responses with salaries, tech adoption, and developer analytics
[PDF] THE GIG ECONOMY | Edison Research
Source: Marketplace-Edison Research Poll 2018 Gig economy is primary source of income Gig economy is secondary source of income Total Men Women Age 18-34 Age 35-54 Age 55+ Hispanic African-American White 44 47 40 53 39 27 47 55 41 53 51 56 43 57 70 43 46 57 6 | Edison Research | Marketplace THE GIG
Telecommunications
1 datasetsSpace
2 datasetsOperations & Service
6 datasetsBank of Israel Call Center (Technion)
250K+ calls with second-level timestamps from an Israeli bank. The most extensively validated call center dataset in OR literature with ~20% abandonment.
DataMOCCA US Bank Database
220M calls across 2.5 years from multiple call center sites with skills-based routing. Accessible via SEEStat software.

Call Center Data
Daily call center performance metrics including call volumes, handle times, and agent performance.

Queue Waiting Time Prediction
Call center data with arrival time, service start/end times, waiting time, and queue length. Perfect for validating queueing theory formulas.

Call Centre Queue Simulation
One year of simulated call center data with arrival times, handle times, and outcomes. Ideal for estimating arrival rates and validating Erlang-C staffing models.
Israeli Bank Call Center (Technion)
12 months of call center data with customer arrivals, service times, and agent IDs. Standard queueing inference benchmark.
Technology & Infrastructure
7 datasets
Alibaba Cluster Trace v2018
280 GB of container scheduling data from Alibaba production clusters with job DAGs and resource utilization.

Web Log Dataset
Web server traffic logs for analyzing request patterns and server queue dynamics.

Server Logs
Apache server logs with timestamps and response times for modeling web server queues.
MAWI Traffic Archive
Daily 15-minute packet traces from Japan's WIDE backbone with microsecond precision. Free download without registration.

CAIDA Anonymized Internet Traces
Backbone packet captures from 10-100 Gbps commercial links with nanosecond precision. 80-200GB per hourly trace.

Synthetic Distributed System Logs
Log data from distributed systems showing request patterns and processing times across multiple nodes.

Google Cluster Workload Traces 2019
2.4 TiB of job scheduling data from 8 Borg clusters (~12,000 machines each) with microsecond timestamps for submissions and completions.
Energy & Utilities
2 datasetsIrish CER Smart Meter Trial
Randomized controlled trial with 5,000 Irish households tracked at 30-minute intervals over 18 months (2009-2010). Experimental time-of-use tariff structures (A/B/C/D) with known peak/off-peak prices. Combines RCT rigor with revealed preference analysis of time allocation decisions.
iFlex Norway Electricity Experiment
Field experiment with Norwegian households receiving hourly electricity price signals across two winter periods (2020-2021). Clean experimental identification of household response to price variation with complete tariff documentation for budget constraint reconstruction.
E-Commerce & Retail
1 datasetsSocial & Web Data
1 datasetsGeospatial & Location
1 datasetsData Portals & Aggregators
1 datasetsML Benchmarks
7 datasetsHumanEval
164 hand-written Python programming problems with unit tests for evaluating code generation models. The benchmark that established the pass@k metric and launched the code generation era.
MBPP (Mostly Basic Python Problems)
974 crowd-sourced Python programming tasks designed to test basic programming skills. Entry-level problems with natural language descriptions and test cases. Google Research benchmark.
SWE-bench
2,294 real GitHub issues from popular Python repositories for evaluating autonomous coding agents. Tests end-to-end software engineering including issue understanding, codebase navigation, and patch generation.
CodeContests
13,610 competitive programming problems with test cases and solutions from Codeforces, CodeChef, and other platforms. DeepMind benchmark with difficulty ratings and multiple solution languages.
MultiPL-E
HumanEval and MBPP benchmarks translated to 22+ programming languages for evaluating multilingual code generation. Enables cross-language comparison of LLM coding capabilities.
Defects4J
854 reproducible bugs from 17 real-world Java projects with regression tests. Gold standard for automated program repair (APR) and fault localization research.
BugsInPy
493 real bugs from 17 Python projects with test cases for bug detection and repair research. Python-focused counterpart to Defects4J.
Software Engineering
4 datasetsQScored
Code quality metrics from 1.1B lines of code across 85K Java projects. Static analysis scores including maintainability, technical debt, and code smells from SonarQube.
Technical Debt Dataset
Longitudinal technical debt measurements from 33 Apache projects across 700+ releases. SQALE metrics, self-admitted technical debt (SATD) from comments, and evolution tracking.
StackPilot Code Snippets
30,746 Stack Overflow code snippets with execution results and correctness labels. Enables research on code quality in crowd-sourced knowledge bases.
Copilot Telemetry Traces
Developer interaction traces with GitHub Copilot showing suggestion acceptance/rejection patterns, edit sequences, and productivity metrics. Anonymized research dataset from 20 developers.


