Back to Projects

Customer Segmentation with K-Means and Feature Engineering for a Streaming Platform

Leveraged K-Means clustering and advanced feature engineering on subscriber behavior data to identify high-value customer segments and power targeted retention strategies.

PythonVertex AIBigQueryKMeansPandasScikit-learnGCPCustomer Segmentation
Customer Segmentation for Sports Streaming

Project Overview

This project involved developing a robust customer segmentation strategy for a major regional sports streaming network. Using unsupervised learning techniques, specifically K-Means clustering with advanced feature engineering, the solution identified distinct viewer segments based on their viewing behavior patterns.

The segmentation was designed to support personalized content recommendations, targeted retention campaigns, and churn mitigation strategies. By understanding the different types of subscribers and their unique viewing habits, the streaming platform was able to deliver more relevant experiences to each customer segment.

Role & Impact

As the lead data scientist on a cross-functional team, I was responsible for all aspects of the machine learning solution. This included developing the feature engineering approach, designing and implementing the clustering models, and orchestrating the end-to-end pipeline deployment.

The solution resulted in actionable, interpretable clusters that were immediately put to use by the marketing team for campaign targeting. The identification of "super-users" was particularly valuable for proactive retention efforts and premium feature upselling.

Business Challenge

The streaming platform faced several challenges in effectively targeting and retaining their subscribers:

  • Fragmented subscriber behavior data across multiple channels and platforms
  • No existing segmentation framework for lifecycle marketing and targeting
  • Rising customer acquisition costs making retention increasingly important
  • Need to scale personalized outreach without increasing campaign management overhead

Solution Architecture

We designed a scalable architecture on Google Cloud Platform that processed subscriber data on a weekly basis:

  • Data Source: BigQuery view serving as the source for weekly user behavior summaries
  • Orchestration: Vertex AI Pipelines with Kubeflow Pipeline components to manage the workflow
  • Storage: Cloud Storage for intermediate datasets and model artifacts
  • Activation: Results stored back in BigQuery for use in Customer Data Platform and email tools

Machine Learning Implementation

The core of the solution leveraged unsupervised learning techniques to identify natural groupings in subscriber behavior:

  • Feature Engineering: Created 19+ engineered features to capture viewing behavior patterns, including:
    • Total minutes watched per time period
    • Binge behavior metrics (consecutive viewing)
    • Sport-specific affinity scores
    • Weekday vs. weekend viewing ratios
    • Day-part preferences (morning, prime time, late night)
    • Content type preferences (live vs. replay, highlights vs. full games)
  • Dimensionality Reduction: Applied Principal Component Analysis (PCA) to reduce feature space while preserving variance
    • Optimized for maximum variance explanation with minimal components
    • Improved cluster separation and model performance
  • Cluster Analysis: Implemented K-Means with automated parameter selection
    • Used the elbow method to identify optimal number of clusters
    • Validated clusters using silhouette scores
    • Applied business-relevant labels post-clustering: Super, High, Moderate, Low, Dormant
  • Pipeline Automation: Developed a weekly pipeline that:
    1. Extracted the latest user behavior data
    2. Applied feature engineering and standardization
    3. Performed clustering and assigned segment labels
    4. Stored results with interpretable metadata
  • Strategic Filtering: Implemented business rules to filter the Dormant cluster from re-engagement campaigns to avoid unintentional churn triggers

Outcomes

Super User Identification

Targeted Retention

Identified "Super Users" for proactive retention offers (e.g., pause instead of cancel) and premium feature upselling.

Dormant User Management

Churn Prevention

Prevented reactivation of "Dormant" users that might churn if reminded, reducing negative campaign effects.

Lifecycle Messaging

Personalized Communication

Enabled lifecycle stage-specific messaging through CDP and email journeys, improving engagement across all user segments.

Technical Challenges

Implementing this customer segmentation solution required overcoming several complex technical hurdles:

  • Sparse Behavioral Signals: Many low-activity users had minimal viewing data, making it difficult to create meaningful features. We developed specialized techniques to handle sparse behavioral signals while still capturing meaningful patterns.
  • PCA Tuning: Balancing dimensionality reduction with interpretability required careful tuning of the PCA components. Too few components lost critical information, while too many introduced noise and reduced cluster separation.
  • Pipeline Automation: Creating a reliable pipeline from BigQuery to Cloud Storage to model processing without manual intervention required robust error handling and data quality validation steps.
  • Feature Consistency: Ensuring feature definitions remained consistent across pipeline versions as new data sources became available was challenging. We implemented strict schema validation and version control for feature definitions.
  • Cluster Stability: Maintaining consistent cluster definitions over time despite evolving user behavior required techniques to align new clusters with historical segments for consistent targeting.

Design Tradeoffs & Decisions

Several key architectural and modeling decisions were made to ensure the solution was robust, interpretable, and actionable:

  • Algorithm Selection: Chose K-Means over DBSCAN for simplicity and speed in the pipeline context. While DBSCAN might have found more complex cluster shapes, K-Means provided faster execution and more consistent results across runs.
  • Dimensionality Reduction: Applied PCA for noise reduction and improved cluster separation, trading some feature interpretability for more distinct and stable clusters.
  • Feature Preservation: Stored both original features and PCA components with cluster assignments to enable downstream explainability and business interpretation of the segments.
  • Metadata Integration: Rejected reclustering based on account metadata (age, location, etc.) due to high cardinality and privacy concerns, instead focusing purely on behavioral signals.
  • Weekly Cadence: Selected a weekly processing frequency to balance freshness of segments with computational efficiency and stability of the clusters over time.

Technologies Used

Cloud Infrastructure

  • Google Cloud Platform
  • BigQuery
  • Cloud Storage

ML & AI

  • Vertex AI
  • Scikit-learn
  • PCA
  • K-Means

Development

  • Python
  • Pandas
  • NumPy
  • Jupyter

Integrations

  • Customer Data Platform
  • Tableau
  • Pipeline Scheduler

Why It Matters

Business Perspective

Targeting users with relevant messages across the right lifecycle stage improves retention and customer experience. One-size-fits-all messaging hurts engagement and contributes to churn.

In subscription businesses, understanding the different audience segments is critical for sustainable growth—it's far more cost-effective to retain existing subscribers than to acquire new ones.

ML/DS Perspective

Effective segmentation combines modeling with business alignment—it's not just about statistical clusters, but creating interpretable personas that marketing teams can actually use.

The true challenge of unsupervised learning isn't just finding patterns in the data—it's translating those patterns into actionable segments that align with business realities and can drive meaningful interventions.

Conclusion

This project demonstrates the power of unsupervised learning techniques to derive meaningful customer segments from complex behavioral data. The solution successfully:

  • Created a scalable ML segmentation framework embedded into weekly marketing workflows
  • Enabled better personalization and lifecycle-appropriate messaging across user types
  • Identified high-value subscribers for proactive retention interventions
  • Protected potentially vulnerable segments from counterproductive campaigns
  • Demonstrated how unsupervised learning paired with real-time data pipelines can unlock tangible marketing ROI

Beyond the immediate business impact, this project showcases how machine learning can transform raw behavioral data into actionable customer insights that drive meaningful business outcomes.