

Aziz Ozmen, PhD
aziz.ozmen@gc4ss.org
Senior Security Analyst
Global Center for Security Studies
Abstract
The digital transformation of retail has produced an unprecedented wealth of customer behavioral data, creating both opportunities and challenges for marketing organizations. Among the most valuable applications of this data is the estimation of Customer Lifetime Value (CLV) and the prediction of customer churn—activities that directly impact profitability and long-term business sustainability. This paper investigates the distinct yet complementary roles of Data Analysts and Data Scientists in the development and deployment of CLV estimation and churn prediction models for e-commerce environments. Moving beyond traditional RFM (Recency, Frequency, Monetary) analysis, this study explores how survival analysis, cohort analysis, and machine learning techniques—including K-means clustering, Apriori association rule learning, and gradient boosting models—enable organizations to predict which customers are at risk of churn and to personalize retention interventions. Through a synthesis of academic literature, industry implementations, and methodological frameworks from 2014 to 2022, this paper delineates the division of labor: Data Scientists engineer predictive models such as XGBoost classifiers and survival analysis frameworks, while Data Analysts perform cohort analysis using SQL, validate model outputs, and translate computational findings into actionable marketing strategies. The paper concludes by addressing persistent challenges—including the non-contractual nature of e-commerce relationships, class imbalance in churn datasets, and the interpretability of black-box models—and proposes an integrated operational framework for customer analytics.
Keywords: Customer Lifetime Value, Churn Prediction, Survival Analysis, Cohort Analysis, K-Means Clustering, Apriori Algorithm, XGBoost, Data Science, E-commerce Analytics
The aphorism that "it is cheaper to retain an existing customer than to acquire a new one" has been empirically validated across multiple industries. As documented by Reichheld (1990), a mere 5% improvement in customer retention leads to profit increases of 85% in the banking sector, 50% in insurance brokerage, and 30% in the automotive industry. In the e-commerce sector, where customer acquisition costs continue to rise and competition intensifies, the ability to predict which customers are likely to churn—and to intervene before they do—has become a strategic imperative.
The challenge, however, is substantial. Unlike contractual settings such as telecommunications or subscription services where customers explicitly terminate their
relationships, e-commerce operates in a non-contractual environment. As noted by researchers at the University of Essex, "in a non-contractual setting such as retail, customers can change their purchasing habits at any moment, and typically the longer a customer takes to make their next purchase, the lower the probability is of that customer returning at all”. This uncertainty creates a censoring problem: analysts cannot definitively distinguish between a customer who has permanently churned and one who is simply experiencing a long inter-purchase pause.
The central thesis of this paper is that effective CLV estimation and churn prediction require a clear division of labor between Data Scientists, who build predictive models and algorithmic frameworks, and Data Analysts, who perform descriptive analyses, validate model outputs, and translate findings into business strategy. While machine learning models—particularly gradient boosting algorithms like XGBoost—can achieve high predictive accuracy, they require careful feature engineering, handling of class imbalance, and domain-specific validation that only human analysts can provide.
This paper is structured as follows. Section II reviews the foundational concepts of CLV estimation and churn prediction, including the specific challenges of the e-commerce context. Section III delineates the distinct roles of Data Analysts and Data Scientists within the customer analytics pipeline, drawing on industry literature and job architecture analyses. Section IV presents the technical framework, focusing on cohort analysis, K-means clustering, Apriori association rule learning, and XGBoost-based churn prediction. Section V discusses the collaborative workflow and persistent challenges, followed by a conclusion on the future trajectory of customer analytics.
Customer Lifetime Value (CLV) represents the total net profit a company can expect to generate from a customer over the entire duration of their relationship. As summarized by Jasek and colleagues (2019), CLV is "a key issue for companies that are introducing a CLV managerial approach in their online B2C relationship stores" . The selection of an appropriate CLV model depends on several assumptions specific to the online retail environment, including the non-contractual nature of the relationship, continuous purchase timing (anytime, not just at regular intervals), and the variable-spending environment where transaction values fluctuate.
The academic literature has produced numerous probabilistic CLV models. In their comparative analysis of eleven selected CLV models applied to e-commerce datasets from Central and Eastern Europe—representing annual revenues in the hundreds of millions of euros and nearly 2.3 million customers—Jasek and colleagues found that the BG/NBD (Beta Geometric/Negative Binomial Distribution) and Pareto/NBD models achieved "overall good and consistent results" and could be "considered stable with significant lifts from the baseline Status quo model”. These probabilistic models, originally developed by Fader and Hardie, estimate the probability that a customer is still alive (i.e., has not churned) based on historical purchase patterns.
More recently, researchers have proposed architecture frameworks that integrate multiple analytical approaches. Abdurrahman and colleagues (2022) proposed an architecture that "evaluates business strategy using customer segmentation and customer lifetime value prediction, churn prediction, uplift modelling, and survival analysis”. This integrated approach recognizes that no single method is sufficient; effective customer analytics requires a portfolio of techniques.
Customer churn—the act of a customer ending their relationship with a service provider—has been extensively studied across industries. In the telecom sector, where annual churn rates range from 20% to 40%, researchers have demonstrated that "the cost of retaining existing customers is 5–10 times lower than the cost of obtaining new customers" and that "decreasing the churn rate by 5% increases the profit from 25% to 85%”.
The e-commerce churn prediction problem has distinct characteristics. Unlike telecom, where churn is often signaled by a contract cancellation, e-commerce churn must be inferred from purchasing behavior. Survival analysis, a branch of statistics originally developed for medical research, has proven particularly valuable for this context. The survival function S(t) = P(T > t) represents the probability that a customer remains active beyond time t. The hazard function, which defines the event rate at time t conditional on survival up to that time, is expressed as γ(t) = lim Δt→0 P(T < t+Δt | T ≥ t)/Δt.
Recent advances have combined survival analysis with deep learning. Researchers have proposed "a deep survival framework to predict which customers are at risk of stopping to purchase with retail companies in non-contractual settings”. By leveraging recurrent neural networks to learn survival model parameters, this approach aims to "obtain individual-level survival models for purchasing behavior based only on individual customer behavior and avoid time-consuming feature engineering processes”.
Beyond prediction, understanding customer behavior requires segmentation and pattern discovery. K-means clustering, an unsupervised learning algorithm, partitions customers into distinct groups based on behavioral similarity. As demonstrated by Husein and colleagues (2022), "a clustering technique approach is proposed to classify customer data which is evaluated using the Davies Bouldin, Calinski Harabasz and Silhouette methods to determine the optimal number of clusters”. Their research found that K-means clustering produced 5 clusters with 76% better accuracy than Spectral Clustering and Gaussian Mixture Model methods.
The Apriori algorithm, originally developed by Agrawal and Srikant (1994) for market basket analysis, identifies frequent item sets and association rules. The core principle, as articulated by Leskovec, Rajaraman, and Ullman (2014), is that "a large set cannot be frequent unless all its subsets are"—the Apriori property that enables efficient pruning of the search space. In the customer analytics context, association rules can identify product bundles that frequently co-purchase, enabling targeted cross-selling and personalized recommendations.
The successful deployment of customer analytics requires a clear understanding of the distinct contributions of Data Analysts and Data Scientists. As noted in industry analysis, while both roles are extremely important, "the terms 'data scientist' and 'data analyst' are often used interchangeably by marketers as if they are one and the same," leading to confusion about responsibilities and expectations.
The Data Analyst in customer analytics is primarily focused on descriptive and diagnostic analytics. According to industry analysis, "arguably the most important role of a data analyst is collecting, sorting and studying different sets of information" with the goal of "pinning down a fixed value to some process or function so it can be assessed and compared over time”. The data "has to be regulated, normalized and calibrated so that it can be taken out of context and used as standalone information or paired with other data without losing its integrity”.
In the specific context of churn and CLV analysis, the Analyst's core competency is cohort analysis. As documented in technical literature on SQL-based retention analysis, rolling
retention (also known as cohort analysis) is defined as "the percentage of returning users measured at a regular interval, typically weekly or monthly, grouped by their sign-up
week/month, also known as cohort”. By grouping users based on when they signed up, analysts can gain insight into how product, marketing, and sales initiatives have impacted retention. The Analyst answers questions such as: "How well did these new users stick around compared to users that signed up a week prior?" or "How many of the dormant users who received discount offers came back and stayed on the product?”.
The technical implementation of cohort analysis requires advanced SQL skills, including the use of window functions such as FIRST_VALUE to calculate first purchase dates and week number calculations. The Analyst must be proficient in creating pivot-table-style reports using SUM functions with CASE statements to produce the retention matrix that visualizes cohort retention over time.
Beyond cohort analysis, the Analyst performs:
The Data Scientist in customer analytics represents "a kind of evolution from the traditional data or business analyst role". While formal training is similar, "the thing that sets data scientists apart is strong business acumen coupled with the ability to communicate findings to senior leaders in a way that can influence how the organization approaches a business challenge”. As one industry expert describes, a data scientist is "somebody who is inquisitive, who can stare at data and spot trends. It's almost like a renaissance individual who really wants to learn and bring change to an organization”.
In the churn prediction context, the Data Scientist's primary responsibility is developing and optimizing predictive models. A leading approach in the literature is the use of XGBoost
(Extreme Gradient Boosting), a decision-tree-based ensemble algorithm that "accurately predicts a target class by combining simple and weak models”. XGBoost has been shown to achieve high performance in churn prediction tasks. In a telecom churn prediction study, researchers proposed "a stacking model consisting of two levels with four algorithms: Xgboost (XGB), Logistic regression (LR), Decision tree (DT) and Naive Bayes classifier (NBC)”. The results demonstrated that "the proposed customer churn predictions have accuracies of 96.12% and 98.09% for the original and new churn datasets, respectively".
The Data Scientist is also responsible for:
|
Feature |
Customer Analytics Data Analyst |
Customer Analytics Data Scientist |
|
Primary Output |
Cohort retention reports, dashboards, descriptive statistics |
Churn prediction models, CLV estimates, customer segmentations |
|
Core Tools |
SQL (window functions, cohort queries), Tableau, Excel |
Python (XGBoost, scikit-learn, PyTorch), R, Spark |
|
Statistical Focus |
Descriptive statistics, retention rates, cohort comparisons |
Predictive modeling, survival analysis, hyperparameter optimization |
|
Domain Knowledge |
Marketing metrics, customer behavior, business KPIs |
Machine learning algorithms, survival analysis, feature engineering |
|
Typical Question |
"What are the common characteristics of lost customers?" |
"What is the probability that this new customer will churn within 90 days?" |
|
Temporal Scope |
Historical and current (what happened, what is happening) |
Future-oriented (what will happen, what might happen) |
Cohort analysis is the foundational technique for understanding customer retention. As documented in technical literature, the process involves three main steps:
Step 1: Bucketing visits by time period. Using SQL, the Analyst groups customer activity into weekly or monthly cohorts. A query using DATE_TRUNC or similar functions "squashes" all logins or purchases in each period into one row per customer per period.
Step 2: Normalizing visits relative to first activity. Using the FIRST_VALUE window function partitioned by customer_id and ordered by activity date, the Analyst calculates each customer's first activity date and then computes the week_number as the difference between current activity date and first activity date, divided by the number of seconds in a week.
Step 3: Creating the retention matrix. Using SUM with CASE statements, the Analyst creates a pivot table where rows represent cohorts (by first activity week) and columns represent week numbers (0, 1, 2, ...). Each cell contains the count of customers in that cohort who were active in that week. Retention percentages are calculated by dividing each week's count by the week 0 count.
The resulting cohort analysis answers the Analyst's core question: "What are the common characteristics of lost customers?" By comparing the retention patterns of different cohorts—for example, customers acquired through different marketing channels or during different promotional periods—the Analyst can identify which acquisition strategies produce the most loyal customers.
Customer segmentation enables targeted marketing strategies. K-means clustering partitions customers into K distinct groups based on behavioral features such as recency, frequency, monetary value (RFM), as well as other derived features.
As demonstrated by Husein and colleagues (2022), the optimal number of clusters should be determined using evaluation metrics. The silhouette score measures how similar a point is to its own cluster compared to other clusters, with values ranging from -1 to +1 (higher is better). The Davies-Bouldin index measures the average similarity between clusters, with lower values indicating better separation.
The clustering process yields interpretable segments: high-value loyal customers, at-risk but historically valuable customers, low-engagement window shoppers, seasonal buyers, and bargain-seekers. Each segment suggests different retention strategies.
The Apriori algorithm, introduced by Agrawal and Srikant (1994) and comprehensively treated by Leskovec, Rajaraman, and Ullman (2014), identifies frequent itemsets—sets of items that appear together in many transactions. The algorithm operates on the Apriori property: all subsets of a frequent itemset must also be frequent. This property enables efficient pruning: once an itemset is identified as infrequent, its supersets need not be considered.
In the customer analytics context, association rules reveal product affinities. A rule such as {coffee, creamer} → {sugar} with support of 0.05 (5% of transactions contain all three items) and confidence of 0.8 (80% of transactions containing coffee and creamer also contain sugar) enables personalized recommendations and cross-selling campaigns.
XGBoost (Extreme Gradient Boosting) has emerged as a leading algorithm for churn prediction due to its handling of mixed data types, built-in regularization to prevent overfitting, and ability to handle missing values. As documented in the telecom churn prediction literature, XGBoost achieves high accuracy when properly tuned.
The XGBoost model optimizes a regularized objective function: L(θ) = Σ_i l(y_i, ŷ_i) + Σ_k Ω(f_k) where l is a differentiable convex loss function (e.g., log loss for binary classification), and Ω(f) = γT + (1/2)λ||w||² is the regularization term that penalizes model complexity.
The Data Scientist's responsibilities include:
Effective customer analytics requires a seamless collaboration between Analysts and Scientists:
Non-contractual uncertainty: Unlike subscription businesses where churn is explicit, e-commerce churn must be inferred. The choice of a "churn definition" (e.g., no purchase for 90 days) is arbitrary and affects model performance.
Class imbalance: In typical e-commerce datasets, churn rates may be 5-15%. Models trained on imbalanced data tend to predict the majority class. The Data Scientist must apply appropriate techniques, and the Analyst must evaluate models using precision-recall curves rather than accuracy.
Seasonality: Customer purchasing behavior varies by season (holidays, sales events). A customer who appears to have churned may simply be between seasonal purchase cycles. The Analyst must account for seasonality in validation.
Interpretability vs. accuracy trade-off: Black-box models like XGBoost achieve high accuracy but are difficult to explain to marketing stakeholders. The Data Scientist may need to provide SHAP explanations or consider simpler, more interpretable models for certain use cases.
The integration of advanced analytics into customer relationship management has transformed marketing from an art into a science. This paper has argued that effective CLV estimation and churn prediction require a clear division of labor between Data Analysts and Data Scientists. The Analyst performs cohort analysis, validates model outputs, and translates findings into strategy. The Scientist builds predictive models—XGBoost classifiers, survival analysis frameworks, clustering algorithms—that forecast which customers are at risk and why.
The academic literature reviewed—from probabilistic CLV models (Jasek et al., 2019) to ensemble learning for churn prediction (Telecom Churn Study, 2021) to deep survival frameworks (Equihua et al., 2022)—demonstrates significant technical progress. The foundational techniques of cohort analysis, K-means clustering, and association rule learning remain essential components of the customer analytics toolkit.
The future trajectory will likely involve tighter integration of real-time data, enabling immediate intervention when a customer exhibits churn-risk signals. Personalization at scale—delivering the right offer to the right customer at the right moment—will become increasingly automated. However, the human roles of validation, interpretation, and strategy translation will remain essential. The Data Analyst who can answer "What are the common characteristics of lost customers?" and the Data Scientist who can predict "What is the probability that this new customer will churn?" together form the backbone of data-driven customer relationship management.