Clustering & Segmentation
Group similar customers to personalize experiences • 61 papers
Customer Segmentation
Group customers by behavior for targeting
RFM Analysis for Customer Segmentation
Recency-Frequency-Monetary framework still widely used in retail and e-commerce.
Customer Lifetime Value Segmentation
Probability models (BG/NBD) for CLV estimation enabling value-based segmentation.
Spotify's Discover Weekly: Machine Learning Meets Human Curation
Clustering taste profiles to power personalized playlists at scale.
Customer Lifetime Value: Modeling and Recommendations
Framework for predicting individual-level CLV using past transaction data—foundational for value-based marketing.
Counting Your Customers: Who Are They and What Will They Do Next?
Pareto/NBD model for customer counting and transaction prediction—still used at scale in industry.
A Model of Customer Lifetime Value
Linking customer acquisition, retention, and expansion to CLV—connects marketing spend to lifetime value.
Classical Clustering Algorithms
Foundational partitioning, hierarchical, and density-based clustering algorithms
Algorithm AS 136: A K-Means Clustering Algorithm
Definitive k-means formulation with convergence guarantees—still the default.
Hierarchical Grouping to Optimize an Objective Function
Ward's method for agglomerative clustering minimizing within-cluster variance.
Scalable K-Means++
Parallel k-means initialization achieving logarithmic rounds—default in Spark MLlib.
A Density-Based Algorithm for Discovering Clusters (DBSCAN)
Density-based clustering finding arbitrarily shaped clusters and handling noise—foundational for spatial clustering.
BIRCH: An Efficient Data Clustering Method for Very Large Databases
Hierarchical clustering using CF-trees for single-scan scalability—enables clustering of millions of records.
OPTICS: Ordering Points To Identify the Clustering Structure
Density-based ordering producing cluster hierarchy without fixed epsilon—extends DBSCAN for variable densities.
Mean Shift: A Robust Approach Toward Feature Space Analysis
Non-parametric mode-seeking algorithm for clustering without specifying number of clusters.
Web-Scale K-Means Clustering
Mini-batch k-means enabling streaming updates—Google's approach for web-scale clustering.
Model-Based Clustering
Use probabilistic models for clustering
Finite Mixture Models
Comprehensive treatment of Gaussian and non-Gaussian mixture estimation.
Latent Class Analysis
Foundational discrete mixture model for categorical response patterns.
Variational Inference for Dirichlet Process Mixtures
Scalable non-parametric Bayesian clustering without specifying K.
Latent Dirichlet Allocation
Generative probabilistic model for topic modeling—foundational for discovering latent topics in text collections.
Hierarchical Dirichlet Processes
Non-parametric Bayesian approach for sharing clusters across grouped data—automatic topic number selection.
Model-Based Clustering, Discriminant Analysis, and Density Estimation
Gaussian mixture model framework with automatic model selection via BIC—implemented in R's mclust package.
Embedding-Based Clustering
Cluster using learned representations
Deep Embedded Clustering (DEC)
Joint representation learning and clustering via autoencoders.
Spectral Clustering and the High-Dimensional Stochastic Blockmodel
Theoretical foundation for spectral methods in network/embedding clustering.
Contrastive Clustering
Self-supervised contrastive objectives for cluster-friendly representations.
On Spectral Clustering: Analysis and an Algorithm
Foundational spectral clustering using Laplacian eigenvectors—NeurIPS best paper, widely implemented.
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments (SwAV)
Self-supervised visual learning via online clustering—state-of-the-art for unsupervised image representations.
SCAN: Learning to Classify Images without Labels
Two-step unsupervised classification via representation learning then clustering—strong ImageNet results.
Variational Deep Embedding (VaDE)
VAE-based clustering combining variational autoencoders with GMM priors for end-to-end learning.
Deep Clustering for Unsupervised Learning of Visual Features (DeepCluster)
Iterative clustering and CNN training for unsupervised feature learning—Facebook AI's breakthrough.
Segmentation for Targeting
Create actionable segments for personalization
Heterogeneous Treatment Effects and Optimal Targeting
Causal forests for estimating HTEs and deriving optimal targeting policies.
Uplift Modeling for Clinical Trial Data
Foundational uplift/treatment-effect modeling enabling segment-specific interventions.
Personalization at Spotify Using Cassandra
Large-scale user segmentation powering real-time recommendations.
Netflix Artwork Personalization
Segment-based image selection improving engagement through visual personalization.
A Contextual-Bandit Approach to Personalized News Article Recommendation (LinUCB)
Linear UCB algorithm for personalization at Yahoo—foundational contextual bandit for segment-based recommendations.
The Microsoft Decision Service
Production system for personalization via contextual bandits—deployed across Microsoft products.
Online Clustering of Bandits
Dynamic user clustering for bandits—learns segment structure while optimizing recommendations.
Music & Audio Clustering
Organize music and audio content for discovery and recommendations
Million Song Dataset
Benchmark dataset for music analysis enabling audio feature clustering research at scale.
Content-Based Music Information Retrieval: Current Directions and Future Challenges
Survey of audio feature extraction for music similarity and clustering—foundational for MIR.
WaveNet: A Generative Model for Raw Audio
Deep generative model for audio—enables learned audio embeddings for clustering.
librosa: Audio and Music Signal Analysis in Python
Standard Python library for audio feature extraction—MFCCs, spectrograms for clustering.
Automatic Tagging Using Deep Convolutional Neural Networks
CNN for music auto-tagging enabling tag-based clustering and organization.
Spotify's Audio Features and Track Analysis
Audio analysis API powering playlist generation and music clustering at scale.
Music Genre Classification with the Million Song Dataset
Benchmark for genre classification—evaluates clustering approaches on real music data.
Video & Movie Clustering
Organize video content for streaming recommendations
Deep Neural Networks for YouTube Recommendations
Two-tower architecture for video clustering and candidate generation at YouTube scale.
The Netflix Recommender System: Algorithms, Business Value, and Innovation
Overview of Netflix's recommendation system including movie clustering and personalization.
YouTube-8M: A Large-Scale Video Classification Benchmark
8 million videos with labels for video understanding—benchmark for video clustering research.
Embarrassingly Shallow Autoencoders for Sparse Data (EASE)
Simple but effective collaborative filtering for movie recommendations—Netflix competition winner approach.
Matrix Factorization Techniques for Recommender Systems
Netflix Prize winning approach using latent factors—foundational for content clustering.
Variational Autoencoders for Collaborative Filtering
VAE-based collaborative filtering—Netflix research on implicit feedback clustering.
Game & UGC Clustering
Organize games, user-generated content, and player experiences
Player Modeling in Video Games
Survey of player modeling techniques including behavior clustering and segmentation.
Predicting Player Churn in Video Games Using Survival Analysis and Clustering
Combining survival models with player clusters for churn prediction in games.
Analyzing User Behavior in MMORPGs
Foundational study of player motivations and behavioral clustering in online games.
Deep Learning for Video Game Content Generation
Survey of ML for game content—includes UGC clustering and content organization.
Game Data Mining
Comprehensive guide to mining game data including player segmentation techniques.
Text & Document Clustering
Organize text content and documents for search and discovery
A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering (GSDMM)
Collapsed Gibbs sampling for short text clustering—handles sparse, short documents like tweets.
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Efficient sentence embeddings for semantic similarity—enables fast document clustering.
Self-Training with Contrastive Clustering for Short Text Clustering (STC2)
State-of-the-art short text clustering combining contrastive learning with self-training.
A Survey of Text Clustering Algorithms
Comprehensive survey of document clustering methods from TF-IDF to neural approaches.
Visual Content Clustering
Organize images and visual content for discovery and search
Unifying Visual Embeddings for Visual Search at Pinterest
Multi-task visual embeddings for image clustering and similarity search at Pinterest scale.
Deep Residual Learning for Image Recognition (ResNet)
ResNet architecture enabling powerful visual features for image clustering.
Learning Transferable Visual Models From Natural Language Supervision (CLIP)
Vision-language model enabling zero-shot image clustering via text descriptions.
Pinterest Visual Search: The Evolution and Beyond
Evolution of visual search and image clustering at Pinterest—practical lessons at scale.