AI & LLM
1M real-world conversations with 25 state-of-the-art LLMs spanning 154 languages
100K real-world user dialogues with comprehensive 6-dimension quality assessment
6,800 dialogues with 5-level satisfaction scale labels across multiple domains
4,750 human-to-bot dialogues with thumbs up/down feedback plus quality scores
55K+ real-world conversations with human preference labels from Chatbot Arena
E-Commerce
100,000 orders (2016-2018) structured in 9 relational tables from Olist
Omnichannel interaction tracking with AI-driven identity resolution
2.5M customers (457k purchasers) and 31,868 SKUs from JD.com
6 billion display ad/click logs over 8 days from 100M users
30M+ browsing events with query and image vectors for e-commerce search
2.76M events (views, carts, purchases) from 1.4M visitors
3 months obfuscated GA4 e-commerce data (Nov 2020-Jan 2021)
500k+ transactions (Mar 2016 - Aug 2018) from Pakistan's largest e-commerce
1.8M Amazon purchases with demographics (age, gender, location). Real household e-commerce behavior at scale
Sessions from 6 locales with 40k-500k products per locale
170,000 users' real search queries (2021-2022) from JD.com
Encoded search queries and item data for intent detection
233k human-annotated query-product judgments, 43k products
Large-scale industrial dataset from Alibaba (150GB)
Online and offline check-ins/purchases from 1,000+ stores
Clickstream data from Alibaba platforms (2018)
649M user interactions (clicks, carts, buys) on 25M items
Mobile shopping user click data on recommended items
12,330 user sessions with numerical and categorical features for purchase prediction
571M reviews (1996-2023), 33 categories, 48M items - comprehensive Amazon review dataset
5 modalities (image, text, table, video, audio), 6M+ samples for multimodal learning
Human-rated relevance scores (1-3) for search terms and products
Product information scraped from Flipkart e-commerce platform
12M+ e-commerce sessions with click → cart → order sequences. Real multi-stage conversion funnel data from German retailer
1.85M Amazon purchases from 5,027 US consumers (2018-2022) linked to demographics (age, gender, income, education)
12M German e-commerce sessions with click → cart → order sequences. RecSys 2022 competition
57 tasks, 20K questions for LLM shopping assistant benchmark
30M+ browsing events with query and image vectors for e-commerce search
99 real A/B experiments with 24,153 time-granular snapshots for adaptive stopping research
32,487 headline/image A/B experiments with 538M assignments
Food & Delivery
3.4M orders, 206k+ users, 49k+ products with reorder behavior
Clickstream data from Ele.me food delivery platform
Cainiao Last-Mile Delivery dataset from MSOM 2018
1.69M reviews from DoorDash, Grubhub, Uber Eats. Ratings, text reviews, restaurant metadata. Gig economy platform research
Advertising
32,487 headline/image experiments on 538M assignments
342GB total with 13 integer features, 26 hashed categorical features
Advertising dataset from Alibaba for ad targeting and prediction
Display advertising dataset with impressions and clicks
Click prediction based on browsing history from Outbrain
Real-time advertiser auction dataset for RTB research
Auction-related replication datasets from Harvard Dataverse
25M logged interactions with counterfactual propensity scores. Gold standard for offline policy evaluation and causal inference in ads
Billions of bid requests with complete auction logs: bid IDs, timestamps, user agents, regions, bid/paying prices, conversions
342GB, 45M samples with 13 integer features and 26 hashed categorical features for CTR prediction
40M samples for mobile ad click-through rate prediction
64M impressions for display advertising with impressions and clicks
Education
Data from online tutoring platform for educational data mining
Experiments on gamification in learning environments
Datasets for experimentation analysis from Stanford Graduate School of Business
Datasets related to causal inference and experimental design from Susan Athey
Experimental research datasets from Partnership for Economic Policy
Experimental data from neural recordings and behavior
Fashion & Apparel
99 real e-commerce experiments with daily checkpoints from ASOS
Fashion item combinations from Alibaba for outfit recommendation
Data scraped from Victoria's Secret and other innerwear retailers
Fashion items for image classification tasks from Indonesia
Clickstream and purchase data for fashion e-commerce
Session interactions and item features from styling service
70,000 28x28 grayscale images of 10 fashion categories from Zalando
Entertainment & Media
1 billion interactions from 34 million users on 1 million micro-videos
5M search actions, 14.6M recommendation events from 25k users
Amazon product data and BeerAdvocate reviews from Stanford SNAP
Data from NetEase Cloud Music for INFORMS competition
Music sales data (digital/physical) from Bandcamp platform
150M+ listening sessions with skips, track features, and playlist context. The largest public music streaming behavior dataset
1.7M episodes/movies watched by 1,060 users over 1 year. Watch patterns, session length, preferences, predictability metrics
Hours viewed for every Netflix title (original and licensed) watched >50K hours. First public streaming metrics since 2021
1.8M videos watched by 243 users over 1.5 years. Recommendation engine performance, caching research, viewing patterns
5M videos with watch percentage, engagement maps, Freebase topic labels. Video-level engagement metrics for content research
8M videos with video-level features for large-scale video understanding. Google Research benchmark for video classification
168K nodes with mutual follower relationships. 6 ML tasks including churn, affiliate status, view count prediction
16 days of viewer counts, stream metadata, game categories from Oct 2017. Live streaming platform economics
1 billion listening events with long-term user histories. Music recommendation and listening behavior research
1M playlists with 2M unique tracks from 300K artists. RecSys 2018 Challenge for playlist continuation research
100M Flickr photos/videos with metadata under Creative Commons. Yahoo/Flickr dataset for multimedia research
Grocery & Supermarkets
Hierarchical sales data for 3,049 products across 10 stores
Unit sales data with store/item metadata and oil prices from Ecuador
E-commerce sales data from Fozzy Group retail chain in Ukraine
Office supply sales for DMDA 2023 workshop challenge
Sales data for controlled substances reported by ANVISA
Warehouse and retail liquor sales from Montgomery County, Maryland
Monthly Class E liquor sales data with volume and pricing from Iowa
50 items across 10 different stores over 5 years
Grocery purchases from Tesco stores via loyalty cards
1,115 Rossmann drug stores historical sales data
Online retail transactions (2010-2011) from UK gift retailer
Sales and inventory snapshot data from Vietnamese supermarket
Flipkart Supermart transaction and product details
Weekly scanner data on soft drink purchases from Dominick's Finer Foods
Financial Services
Banking transactions and product purchases with hashed identifiers
21M+ public company filings since 1994. 10-Ks, 8-Ks, proxy statements. Full text + structured XBRL data
590K card-not-present transactions with 393 features from Vesta Corp. Real messy fraud data (3.5% fraud rate)
6M+ mobile money transactions simulating real fraud patterns. Agent-based model calibrated on real African mobile money logs
2.7M loans (2007-2019) with 151 features. Interest rates, credit scores, defaults. The canonical P2P lending dataset for credit risk modeling
9 consolidated fraud datasets with unified format. Includes IEEE-CIS, credit card, e-commerce fraud. Benchmark for fraud ML research
13 years of Bitcoin transaction graphs (2009-2022). Complete blockchain with labeled entities. Network economics at scale
NASDAQ limit order book data at millisecond precision. Level 1-10 depth, message-by-message reconstruction. Market microstructure research
4.3M samples of NASDAQ Nordic limit order book data. 10 depth levels, 5 stocks, normalized features. Benchmark for price prediction
113K P2P loans with borrower characteristics, credit grades, and loan outcomes. Alternative to LendingClub for P2P lending research
8 complete blockchain histories (Bitcoin, Ethereum, etc.) with daily updates. Transaction-level data for crypto economics research
2.7M loans (2007-2019) with borrower income, DTI, employment, defaults, and state macro variables
9 consolidated datasets for standardized fraud detection evaluation including IEEE-CIS (590K transactions)
6M+ mobile money transactions from agent-based model calibrated on African mobile money logs
13 years (2009-2021) of entity-level Bitcoin transaction networks with BTC/USD values
Automotive
Classic dataset (1971-1990) for demand model estimation
Vehicle sales data for Telangana, India (2023)
Travel & Hospitality
Transfer-related data (flights, ground transport) from Fliggy
Transportation & Mobility
All US flights since 1987. Delays, cancellations, fares, capacity. Revenue management research goldmine
ADAS-equipped vehicles with driving behavior events
Vehicle trajectory data for traffic flow modeling
GPS trace data from Grab ride-hailing platform
3B+ taxi and rideshare trips since 2009. Fares, tips, surge pricing, driver pay. The gold standard for marketplace economics
Zone-to-zone travel times and street speeds for 50+ cities worldwide. Congestion patterns from actual Uber rides
100M+ rideshare trips with fares (unlike NYC which lacks fare data). Trip-level pricing for Uber/Lyft economic analysis
Billions of GPS points and ride trajectories from China's largest ride-hailing platform. Driver behavior and urban mobility patterns
Millions of GPS traces from Southeast Asian ride-hailing platform
Auctions & Marketplaces
Collection of datasets from eBay and experimental auctions
Public procurement data from ProZorro system
Artists for Lahaina benefit art auction data (2023)
Listings from PakWheels Pakistani automobile marketplace
87+ auctions (1994-present) with round-by-round bidding data. Complete bid histories, reserve prices, winners. Auction theory empirics
800K+ procurement notices annually. All EU public contracts above thresholds. Structured XML since 2006. Cross-country procurement research
Logistics & Supply Chain
10.6M+ packages, 619k trajectories with GPS data
Synthetic supply chain dataset covering sales and returns
Continuous pharmaceutical manufacturing data from MSD. Real production processes for operations management research
9,184 routes across 5 US metros with stop-level coordinates, service times, and driver knowledge
10.6M+ packages with 619K trajectories and GPS data from Alibaba logistics
Real Estate
6M+ listings, 190M+ reviews with pricing and amenities
Property assessment values and sales data from Cook County
NYC property sales transactions across all boroughs
4.3GB of UK property sales transactions going back decades, messy real-world government data
Downloadable housing market data: home prices, sales, inventory, listings by metro/city/zip. Updated weekly from MLS
Home values (ZHVI), rents (ZORI), inventory, and market heat indices across US metros and zip codes
Social & Web
296B views/year since 2007. Hourly pageview data for all Wikimedia projects. Attention economics at scale
5.6B comments, 651M posts since 2005. Full Reddit history for social/economic research. 100+ papers published
Full Q&A archive + annual developer survey (49K+ responses). Salaries, tech adoption, developer economics
250TB/month web crawl. 9.5 PB archive since 2008. Product listings, pricing, economic text at web scale
Search interest data for nowcasting. Economic indicators, demand prediction, event detection
1.1B+ public FB/IG posts with engagement metrics
84GB PBF (2TB+ uncompressed) complete world map database with full edit history, weekly updates
Complete Wikipedia content and metadata in SQL/XML format, includes all revisions and edit history
Full Facebook/Instagram public archive via ICPSR application. Posts, Pages, groups, events for academic research
38M URLs with 10T exposure numbers, fact-checking flags, interaction types (2017-2019). Social Science One initiative
4K users with social circles and anonymized node features. Stanford Network Analysis Project dataset
Facebook/Instagram impact on political attitudes. Published in Science/Nature 2023. SOMAR Michigan access
Data Portals
Search ranking, translation quality, and ML task datasets
Research tools and datasets across multiple domains
Open dataset portal for e-commerce and logistics from JD.com
E-commerce, advertising, and multimedia datasets from Rakuten
AI, data science, healthcare, and weather datasets from IBM
AI, NLP, computer vision, and autonomous driving datasets
E-commerce and recommendation system datasets
Reviews, recommendations, and social network data
Time series data for forecasting competitions (M1-M5)
Datasets from ACM Recommender Systems challenges
INFORMS conference with data-focused opportunities
Platform for competitions, benchmarks, and reproducible research
1M users, 6 months of real e-commerce behavior logs with 5 event types for universal behavioral modeling
2.3M users, 380M+ news impressions from Ekstra Bladet for news recommendation research
57 tasks, 20K questions derived from real Amazon shopping data for LLM shopping assistants
Western US water supply data from Bureau of Reclamation, $500K prize pool for seasonal forecasting
Eclectic mix of economic, demographic, and enterprise data from NBER-affiliated research projects
All federal contracts, grants, loans since 2001. 400+ variables, $50T+ in awards. Government procurement economics
8M+ establishments with firm age data. Job creation/destruction, startups, exits. Longitudinal firm dynamics since 1977
13M+ US patents (1976-present) with citations, inventors, assignees. Full patent text and claims. Innovation economics at scale
Universal search engine for datasets across the web. Meta-tool for discovering research data
ML/NLP datasets hub with 100K+ datasets. Easy loading via Python library. Community-driven repository
Healthcare
Complete US organ donation records since 1987. Waiting lists, donor-recipient matches, outcomes. Market design and matching research
Hospital pricing data mandated since 2021. Negotiated rates, chargemaster prices across 6,000+ hospitals. Healthcare pricing research
All Medicare providers with service utilization and payment data. CMS public use files for healthcare economics
App Stores
19.3M user reviews from 700K users across 10K apps in 48 categories. Google Play app recommendation research
2.3M apps with ratings, reviews, categories, sizes, installs. Android app marketplace data
7,200 iOS apps with pricing, ratings, genres, in-app purchases. Apple app marketplace analysis
Labor Markets
2.3M occupations and 391K resumes with real career trajectories mapped to ESCO codes. Labor mobility research
Monthly job openings, hires, separations by industry since 2000. Bureau of Labor Statistics time series
Company ratings, salary reports, interview experiences. Employer review platform data for labor economics
4.1B job postings from 6.6M companies. Deduplicated, parsed, enriched workforce data (commercial/academic partnerships)
49K+ annual responses with salaries, tech adoption, and developer economics
Content Moderation
20K social media posts with human rationales across 10 hate speech target categories. Explainable AI for content moderation
Global representative sample of real-world hate speech across languages. 2024 benchmark for content moderation
998 online comments labeled for hate speech detection in English. Binary and multi-label annotations
50+ hate speech datasets across languages compiled at hatespeechdata.com. Meta-resource for content moderation research
4,251 news articles and 296K claims about COVID-19 healthcare misinformation. Fact-checked with ground truth labels
12.8K fact-checked political statements with speaker metadata and 6-way truthfulness labels. Politifact benchmark
23K news articles labeled fake/real with social context. Includes PolitiFact and GossipCop sources
4,251 news articles and 296K claims about COVID-19 healthcare misinformation
12.8K fact-checked political statements with speaker metadata
Creator Economy
279K+ active creators with membership tiers and patron counts. Creator economy platform metrics from Graphtreon
Public subscriber/follower counts and growth metrics across YouTube, Twitch, Instagram, Twitter, TikTok
Survey-based earnings breakdowns by platform (YouTube, TikTok, Instagram, Twitch). Influencer Marketing Factory research



























