The Crucial Role of Mediation in Resolving Disputes

February 2, 2021

Understanding Conflict Resolution: Breaking Down Theories and Approaches

August 4, 2022

From Transaction to Prediction: The Roles of Data Analysts and Data Scientists in Customer Lifetime Value Estimation and Churn Reduction

Aziz Ozmen, PhD
aziz.ozmen@gc4ss.org

Senior Security Analyst
Global Center for Security Studies

From Transaction to Prediction: The Roles of Data Analysts and Data Scientists in Customer Lifetime Value Estimation and Churn Reduction

Abstract

The digital transformation of retail has produced an unprecedented wealth of customer behavioral data, creating both opportunities and challenges for marketing organizations. Among the most valuable applications of this data is the estimation of Customer Lifetime Value (CLV) and the prediction of customer churn—activities that directly impact profitability and long-term business sustainability. This paper investigates the distinct yet complementary roles of Data Analysts and Data Scientists in the development and deployment of CLV estimation and churn prediction models for e-commerce environments. Moving beyond traditional RFM (Recency, Frequency, Monetary) analysis, this study explores how survival analysis, cohort analysis, and machine learning techniques—including K-means clustering, Apriori association rule learning, and gradient boosting models—enable organizations to predict which customers are at risk of churn and to personalize retention interventions. Through a synthesis of academic literature, industry implementations, and methodological frameworks from 2014 to 2022, this paper delineates the division of labor: Data Scientists engineer predictive models such as XGBoost classifiers and survival analysis frameworks, while Data Analysts perform cohort analysis using SQL, validate model outputs, and translate computational findings into actionable marketing strategies. The paper concludes by addressing persistent challenges—including the non-contractual nature of e-commerce relationships, class imbalance in churn datasets, and the interpretability of black-box models—and proposes an integrated operational framework for customer analytics.

Keywords: Customer Lifetime Value, Churn Prediction, Survival Analysis, Cohort Analysis, K-Means Clustering, Apriori Algorithm, XGBoost, Data Science, E-commerce Analytics

1. Introduction

The aphorism that "it is cheaper to retain an existing customer than to acquire a new one" has been empirically validated across multiple industries. As documented by Reichheld (1990), a mere 5% improvement in customer retention leads to profit increases of 85% in the banking sector, 50% in insurance brokerage, and 30% in the automotive industry. In the e-commerce sector, where customer acquisition costs continue to rise and competition intensifies, the ability to predict which customers are likely to churn—and to intervene before they do—has become a strategic imperative.

The challenge, however, is substantial. Unlike contractual settings such as telecommunications or subscription services where customers explicitly terminate their

relationships, e-commerce operates in a non-contractual environment. As noted by researchers at the University of Essex, "in a non-contractual setting such as retail, customers can change their purchasing habits at any moment, and typically the longer a customer takes to make their next purchase, the lower the probability is of that customer returning at all”. This uncertainty creates a censoring problem: analysts cannot definitively distinguish between a customer who has permanently churned and one who is simply experiencing a long inter-purchase pause.

The central thesis of this paper is that effective CLV estimation and churn prediction require a clear division of labor between Data Scientists, who build predictive models and algorithmic frameworks, and Data Analysts, who perform descriptive analyses, validate model outputs, and translate findings into business strategy. While machine learning models—particularly gradient boosting algorithms like XGBoost—can achieve high predictive accuracy, they require careful feature engineering, handling of class imbalance, and domain-specific validation that only human analysts can provide.

This paper is structured as follows. Section II reviews the foundational concepts of CLV estimation and churn prediction, including the specific challenges of the e-commerce context. Section III delineates the distinct roles of Data Analysts and Data Scientists within the customer analytics pipeline, drawing on industry literature and job architecture analyses. Section IV presents the technical framework, focusing on cohort analysis, K-means clustering, Apriori association rule learning, and XGBoost-based churn prediction. Section V discusses the collaborative workflow and persistent challenges, followed by a conclusion on the future trajectory of customer analytics.

2. The Foundations of Customer Lifetime Value and Churn Prediction

Defining Customer Lifetime Value

Customer Lifetime Value (CLV) represents the total net profit a company can expect to generate from a customer over the entire duration of their relationship. As summarized by Jasek and colleagues (2019), CLV is "a key issue for companies that are introducing a CLV managerial approach in their online B2C relationship stores" . The selection of an appropriate CLV model depends on several assumptions specific to the online retail environment, including the non-contractual nature of the relationship, continuous purchase timing (anytime, not just at regular intervals), and the variable-spending environment where transaction values fluctuate.

The academic literature has produced numerous probabilistic CLV models. In their comparative analysis of eleven selected CLV models applied to e-commerce datasets from Central and Eastern Europe—representing annual revenues in the hundreds of millions of euros and nearly 2.3 million customers—Jasek and colleagues found that the BG/NBD (Beta Geometric/Negative Binomial Distribution) and Pareto/NBD models achieved "overall good and consistent results" and could be "considered stable with significant lifts from the baseline Status quo model”. These probabilistic models, originally developed by Fader and Hardie, estimate the probability that a customer is still alive (i.e., has not churned) based on historical purchase patterns.

More recently, researchers have proposed architecture frameworks that integrate multiple analytical approaches. Abdurrahman and colleagues (2022) proposed an architecture that "evaluates business strategy using customer segmentation and customer lifetime value prediction, churn prediction, uplift modelling, and survival analysis”. This integrated approach recognizes that no single method is sufficient; effective customer analytics requires a portfolio of techniques.

2.2 The Churn Prediction Problem

Customer churn—the act of a customer ending their relationship with a service provider—has been extensively studied across industries. In the telecom sector, where annual churn rates range from 20% to 40%, researchers have demonstrated that "the cost of retaining existing customers is 5–10 times lower than the cost of obtaining new customers" and that "decreasing the churn rate by 5% increases the profit from 25% to 85%”.

The e-commerce churn prediction problem has distinct characteristics. Unlike telecom, where churn is often signaled by a contract cancellation, e-commerce churn must be inferred from purchasing behavior. Survival analysis, a branch of statistics originally developed for medical research, has proven particularly valuable for this context. The survival function S(t) = P(T > t) represents the probability that a customer remains active beyond time t. The hazard function, which defines the event rate at time t conditional on survival up to that time, is expressed as γ(t) = lim Δt→0 P(T < t+Δt | T ≥ t)/Δt.

Recent advances have combined survival analysis with deep learning. Researchers have proposed "a deep survival framework to predict which customers are at risk of stopping to purchase with retail companies in non-contractual settings”. By leveraging recurrent neural networks to learn survival model parameters, this approach aims to "obtain individual-level survival models for purchasing behavior based only on individual customer behavior and avoid time-consuming feature engineering processes”.

2.3 Customer Segmentation and Association Rules

Beyond prediction, understanding customer behavior requires segmentation and pattern discovery. K-means clustering, an unsupervised learning algorithm, partitions customers into distinct groups based on behavioral similarity. As demonstrated by Husein and colleagues (2022), "a clustering technique approach is proposed to classify customer data which is evaluated using the Davies Bouldin, Calinski Harabasz and Silhouette methods to determine the optimal number of clusters”. Their research found that K-means clustering produced 5 clusters with 76% better accuracy than Spectral Clustering and Gaussian Mixture Model methods.

The Apriori algorithm, originally developed by Agrawal and Srikant (1994) for market basket analysis, identifies frequent item sets and association rules. The core principle, as articulated by Leskovec, Rajaraman, and Ullman (2014), is that "a large set cannot be frequent unless all its subsets are"—the Apriori property that enables efficient pruning of the search space. In the customer analytics context, association rules can identify product bundles that frequently co-purchase, enabling targeted cross-selling and personalized recommendations.

3. Role Delineation: Data Analyst vs. Data Scientist in Customer Analytics

The successful deployment of customer analytics requires a clear understanding of the distinct contributions of Data Analysts and Data Scientists. As noted in industry analysis, while both roles are extremely important, "the terms 'data scientist' and 'data analyst' are often used interchangeably by marketers as if they are one and the same," leading to confusion about responsibilities and expectations.

3.1 The Customer Analytics Data Analyst: The Cohort Investigator

The Data Analyst in customer analytics is primarily focused on descriptive and diagnostic analytics. According to industry analysis, "arguably the most important role of a data analyst is collecting, sorting and studying different sets of information" with the goal of "pinning down a fixed value to some process or function so it can be assessed and compared over time”. The data "has to be regulated, normalized and calibrated so that it can be taken out of context and used as standalone information or paired with other data without losing its integrity”.

In the specific context of churn and CLV analysis, the Analyst's core competency is cohort analysis. As documented in technical literature on SQL-based retention analysis, rolling

retention (also known as cohort analysis) is defined as "the percentage of returning users measured at a regular interval, typically weekly or monthly, grouped by their sign-up

week/month, also known as cohort”. By grouping users based on when they signed up, analysts can gain insight into how product, marketing, and sales initiatives have impacted retention. The Analyst answers questions such as: "How well did these new users stick around compared to users that signed up a week prior?" or "How many of the dormant users who received discount offers came back and stayed on the product?”.

The technical implementation of cohort analysis requires advanced SQL skills, including the use of window functions such as FIRST_VALUE to calculate first purchase dates and week number calculations. The Analyst must be proficient in creating pivot-table-style reports using SUM functions with CASE statements to produce the retention matrix that visualizes cohort retention over time.

Beyond cohort analysis, the Analyst performs:

Data validation and quality assurance: Ensuring that the input data—transaction logs, customer profiles, clickstream data—is clean, complete, and properly formatted for
Exploratory data analysis: Generating summary statistics, visualizing distributions, and identifying patterns that inform feature engineering.
False positive investigation: When the Data Scientist's churn prediction model flags a customer as high-risk, the Analyst investigates whether that prediction is accurate or a false positive driven by unusual but legitimate behavior (e.g., seasonal purchasing patterns).
Stakeholder communication: Translating analytical findings into reports, dashboards, and presentations that marketing managers can act

3.2 The Customer Analytics Data Scientist: The Predictive Modeler

The Data Scientist in customer analytics represents "a kind of evolution from the traditional data or business analyst role". While formal training is similar, "the thing that sets data scientists apart is strong business acumen coupled with the ability to communicate findings to senior leaders in a way that can influence how the organization approaches a business challenge”. As one industry expert describes, a data scientist is "somebody who is inquisitive, who can stare at data and spot trends. It's almost like a renaissance individual who really wants to learn and bring change to an organization”.

In the churn prediction context, the Data Scientist's primary responsibility is developing and optimizing predictive models. A leading approach in the literature is the use of XGBoost

(Extreme Gradient Boosting), a decision-tree-based ensemble algorithm that "accurately predicts a target class by combining simple and weak models”. XGBoost has been shown to achieve high performance in churn prediction tasks. In a telecom churn prediction study, researchers proposed "a stacking model consisting of two levels with four algorithms: Xgboost (XGB), Logistic regression (LR), Decision tree (DT) and Naive Bayes classifier (NBC)”. The results demonstrated that "the proposed customer churn predictions have accuracies of 96.12% and 98.09% for the original and new churn datasets, respectively".

The Data Scientist is also responsible for:

Feature engineering: Transforming raw transaction data into predictive features. This includes constructing behavioral variables such as recency (days since last purchase), frequency (number of purchases in a time window), monetary value (average or total spend), as well as more complex features like purchase regularity, category diversity, and engagement

Handling class imbalance: Churn datasets are typically imbalanced, with far more non-churn than churn observations. The Data Scientist must apply techniques such as SMOTE (Synthetic Minority Over-sampling Technique), class weighting, or appropriate evaluation metrics (precision, recall, F1, AUC-PR) rather than accuracy
Model evaluation and validation: Using appropriate cross-validation strategies, hyperparameter tuning (e.g., using Optuna or Grid Search), and ensuring that models generalize to holdout data.

Survival analysis implementation: For non-contractual settings, the Data Scientist may implement survival models such as Kaplan-Meier estimators or Cox Proportional Hazard models. The Kaplan-Meier estimator of the survival function is defined as Ŝ(t)= ∏_{k: t_k < t} (1 - d_k / n_k), where d_k is the number of individuals that experienced the event at time t_k and n_k is the total number at risk at that time.

Unsupervised learning for segmentation: Implementing K-means clustering and evaluating cluster quality using metrics such as Davies-Bouldin index, Calinski-Harabasz index, and silhouette

Comparative Summary

Feature	Customer Analytics Data Analyst	Customer Analytics Data Scientist
Primary Output	Cohort retention reports, dashboards, descriptive statistics	Churn prediction models, CLV estimates, customer segmentations
Core Tools	SQL (window functions, cohort queries), Tableau, Excel	Python (XGBoost, scikit-learn, PyTorch), R, Spark
Statistical Focus	Descriptive statistics, retention rates, cohort comparisons	Predictive modeling, survival analysis, hyperparameter optimization
Domain Knowledge	Marketing metrics, customer behavior, business KPIs	Machine learning algorithms, survival analysis, feature engineering
Typical Question	"What are the common characteristics of lost customers?"	"What is the probability that this new customer will churn within 90 days?"
Temporal Scope	Historical and current (what happened, what is happening)	Future-oriented (what will happen, what might happen)

Technical Framework: From Cohort Analysis to XGBoost
- Cohort Analysis with SQL

Cohort analysis is the foundational technique for understanding customer retention. As documented in technical literature, the process involves three main steps:

Step 1: Bucketing visits by time period. Using SQL, the Analyst groups customer activity into weekly or monthly cohorts. A query using DATE_TRUNC or similar functions "squashes" all logins or purchases in each period into one row per customer per period.

Step 2: Normalizing visits relative to first activity. Using the FIRST_VALUE window function partitioned by customer_id and ordered by activity date, the Analyst calculates each customer's first activity date and then computes the week_number as the difference between current activity date and first activity date, divided by the number of seconds in a week.

Step 3: Creating the retention matrix. Using SUM with CASE statements, the Analyst creates a pivot table where rows represent cohorts (by first activity week) and columns represent week numbers (0, 1, 2, ...). Each cell contains the count of customers in that cohort who were active in that week. Retention percentages are calculated by dividing each week's count by the week 0 count.

The resulting cohort analysis answers the Analyst's core question: "What are the common characteristics of lost customers?" By comparing the retention patterns of different cohorts—for example, customers acquired through different marketing channels or during different promotional periods—the Analyst can identify which acquisition strategies produce the most loyal customers.

4.2 K-Means Clustering for Customer Segmentation

Customer segmentation enables targeted marketing strategies. K-means clustering partitions customers into K distinct groups based on behavioral features such as recency, frequency, monetary value (RFM), as well as other derived features.

As demonstrated by Husein and colleagues (2022), the optimal number of clusters should be determined using evaluation metrics. The silhouette score measures how similar a point is to its own cluster compared to other clusters, with values ranging from -1 to +1 (higher is better). The Davies-Bouldin index measures the average similarity between clusters, with lower values indicating better separation.

The clustering process yields interpretable segments: high-value loyal customers, at-risk but historically valuable customers, low-engagement window shoppers, seasonal buyers, and bargain-seekers. Each segment suggests different retention strategies.

4.3 Apriori Algorithm for Association Rule Learning

The Apriori algorithm, introduced by Agrawal and Srikant (1994) and comprehensively treated by Leskovec, Rajaraman, and Ullman (2014), identifies frequent itemsets—sets of items that appear together in many transactions. The algorithm operates on the Apriori property: all subsets of a frequent itemset must also be frequent. This property enables efficient pruning: once an itemset is identified as infrequent, its supersets need not be considered.

In the customer analytics context, association rules reveal product affinities. A rule such as {coffee, creamer} → {sugar} with support of 0.05 (5% of transactions contain all three items) and confidence of 0.8 (80% of transactions containing coffee and creamer also contain sugar) enables personalized recommendations and cross-selling campaigns.

4.4 XGBoost for Churn Prediction

XGBoost (Extreme Gradient Boosting) has emerged as a leading algorithm for churn prediction due to its handling of mixed data types, built-in regularization to prevent overfitting, and ability to handle missing values. As documented in the telecom churn prediction literature, XGBoost achieves high accuracy when properly tuned.

The XGBoost model optimizes a regularized objective function: L(θ) = Σ_i l(y_i, ŷ_i) + Σ_k Ω(f_k) where l is a differentiable convex loss function (e.g., log loss for binary classification), and Ω(f) = γT + (1/2)λ||w||² is the regularization term that penalizes model complexity.

The Data Scientist's responsibilities include:

Feature engineering: Creating predictive features from raw transaction data, including recency, frequency, monetary aggregates, and behavioral sequences.
Hyperparameter tuning: Using frameworks such as Optuna to optimize tree depth, learning rate, subsample ratios, and regularization
Handling class imbalance: Applying techniques such as scale_pos_weight parameter adjustment or SMOTE oversampling.
Model interpretation: Using SHAP (SHapley Additive exPlanations) values to explain which features drive individual predictions, enabling marketing teams to understand why specific customers are flagged as high-risk.

5. The Collaborative Workflow and Persistent Challenges

Integrated Workflow

Effective customer analytics requires a seamless collaboration between Analysts and Scientists:

Data preparation (Analyst): The Analyst extracts, cleans, and validates transaction data, ensuring consistency and handling missing
Exploratory analysis (Analyst): The Analyst generates cohort retention reports and summary statistics, identifying patterns and
Feature engineering (Scientist, with Analyst input): The Scientist constructs predictive features; the Analyst provides domain expertise on which behavioral signals are most indicative of churn
Model development (Scientist): The Scientist trains and validates XGBoost or survival models, tuning hyperparameters and evaluating
Output validation (Analyst): The Analyst reviews model predictions, investigating false positives and false negatives to identify systematic
Strategy translation (Analyst): The Analyst translates model outputs into actionable marketing campaigns: "Contact these 5,000 customers with a 10% discount offer; these 2,000 high-value at-risk customers should receive a personalized outreach."

5.2 Persistent Challenges

Non-contractual uncertainty: Unlike subscription businesses where churn is explicit, e-commerce churn must be inferred. The choice of a "churn definition" (e.g., no purchase for 90 days) is arbitrary and affects model performance.

Class imbalance: In typical e-commerce datasets, churn rates may be 5-15%. Models trained on imbalanced data tend to predict the majority class. The Data Scientist must apply appropriate techniques, and the Analyst must evaluate models using precision-recall curves rather than accuracy.

Seasonality: Customer purchasing behavior varies by season (holidays, sales events). A customer who appears to have churned may simply be between seasonal purchase cycles. The Analyst must account for seasonality in validation.

Interpretability vs. accuracy trade-off: Black-box models like XGBoost achieve high accuracy but are difficult to explain to marketing stakeholders. The Data Scientist may need to provide SHAP explanations or consider simpler, more interpretable models for certain use cases.

6. Conclusion

The integration of advanced analytics into customer relationship management has transformed marketing from an art into a science. This paper has argued that effective CLV estimation and churn prediction require a clear division of labor between Data Analysts and Data Scientists. The Analyst performs cohort analysis, validates model outputs, and translates findings into strategy. The Scientist builds predictive models—XGBoost classifiers, survival analysis frameworks, clustering algorithms—that forecast which customers are at risk and why.

The academic literature reviewed—from probabilistic CLV models (Jasek et al., 2019) to ensemble learning for churn prediction (Telecom Churn Study, 2021) to deep survival frameworks (Equihua et al., 2022)—demonstrates significant technical progress. The foundational techniques of cohort analysis, K-means clustering, and association rule learning remain essential components of the customer analytics toolkit.

The future trajectory will likely involve tighter integration of real-time data, enabling immediate intervention when a customer exhibits churn-risk signals. Personalization at scale—delivering the right offer to the right customer at the right moment—will become increasingly automated. However, the human roles of validation, interpretation, and strategy translation will remain essential. The Data Analyst who can answer "What are the common characteristics of lost customers?" and the Data Scientist who can predict "What is the probability that this new customer will churn?" together form the backbone of data-driven customer relationship management.

7. References

Abdurrahman, , Agarwal, C., & Ramasamy, L. (2022). Architecture for evaluating customer retention strategies. ECS Transactions, 107(1), 1569. https://doi.org/10.1149/10701.1569ecst
Jasek, , Vrana, L., Sperkova, L., Smutny, Z., & Kobulsky, M. (2019). Comparative analysis of selected probabilistic customer lifetime value models in online shopping. Journal of Business Economics and Management, 20(3), 398-423. https://doi.org/10.3846/jbem.2019.9597
Husein, A. M., Setiawan, D., Sumangunsong, A. R. K., Simatupang, A., & Yasmin, S. A. (2022). Combination grouping techniques and association rules for marketing analysis-based customer SinkrOn, 7(4), 1998-2007. https://jurnal.polgan.ac.id/index.php/sinkron/article/view/12596
Leskovec, , Rajaraman, A., & Ullman, J. D. (2014). Frequent itemsets. In Mining of massive datasets (Chapter 6). Cambridge University Press. https://doi.org/10.1017/CBO9781139924801
Telecom churn prediction study. (2021). Telecom churn prediction system based on ensemble learning using feature grouping. Applied Sciences, 11(11), 4742. https://www.mdpi.com/2076-3417/11/11/4742
Treasure (2016, July 21). Rolling retention done right in SQL. Treasure Data Blog. https://www.treasuredata.com/blog/rolling-retention-done-right-in-sql
(2015, November 15). Data analysts vs. data scientists: What's the difference? Econsultancy. https://econsultancy.com/data-analysts-vs-data-scientists-what-s-the-difference/
Equihua, J. P., Nordmark, H., Ali, M., & Lausen, B. (2022). Modelling customer churn for the retail industry in a deep learning-based sequential framework. arXiv preprint, arXiv:2304.00575. https://ar5iv.labs.arxiv.org/html/2304.00575
Jasek, , Vrana, L., Sperkova, L., Smutny, Z., & Kobulsky, M. (2019). Predictive performance of customer lifetime value models in e-commerce and the use of non-financial data. Prague Economic Papers, 28(6), 648-669. https://doi.org/10.18267/j.pep.714

From Transaction to Prediction: The Roles of Data Analysts and Data Scientists in Customer Lifetime Value Estimation and Churn Reduction

The Crucial Role of Mediation in Resolving Disputes

Understanding Conflict Resolution: Breaking Down Theories and Approaches

From Transaction to Prediction: The Roles of Data Analysts and Data Scientists in Customer Lifetime Value Estimation and Churn Reduction

From Transaction to Prediction: The Roles of Data Analysts and Data Scientists in Customer Lifetime Value Estimation and Churn Reduction

1. Introduction

2. The Foundations of Customer Lifetime Value and Churn Prediction

2.2 The Churn Prediction Problem

2.3 Customer Segmentation and Association Rules

3. Role Delineation: Data Analyst vs. Data Scientist in Customer Analytics

3.1 The Customer Analytics Data Analyst: The Cohort Investigator

3.2 The Customer Analytics Data Scientist: The Predictive Modeler

4.2 K-Means Clustering for Customer Segmentation

4.3 Apriori Algorithm for Association Rule Learning

4.4 XGBoost for Churn Prediction

5. The Collaborative Workflow and Persistent Challenges

5.2 Persistent Challenges

6. Conclusion

7. References

Related

Aziz Ozmen

Leave a Reply Cancel reply

From Transaction to Prediction: The Roles of Data Analysts and Data Scientists in Customer Lifetime Value Estimation and Churn Reduction

The Crucial Role of Mediation in Resolving Disputes

Understanding Conflict Resolution: Breaking Down Theories and Approaches

From Transaction to Prediction: The Roles of Data Analysts and Data Scientists in Customer Lifetime Value Estimation and Churn Reduction

From Transaction to Prediction: The Roles of Data Analysts and Data Scientists in Customer Lifetime Value Estimation and Churn Reduction

1. Introduction

2. The Foundations of Customer Lifetime Value and Churn Prediction

2.2 The Churn Prediction Problem

2.3 Customer Segmentation and Association Rules

3. Role Delineation: Data Analyst vs. Data Scientist in Customer Analytics

3.1 The Customer Analytics Data Analyst: The Cohort Investigator

3.2 The Customer Analytics Data Scientist: The Predictive Modeler

4.2 K-Means Clustering for Customer Segmentation

4.3 Apriori Algorithm for Association Rule Learning

4.4 XGBoost for Churn Prediction

5. The Collaborative Workflow and Persistent Challenges

5.2 Persistent Challenges

6. Conclusion

7. References

Share this:

Related

Aziz Ozmen

Related posts

The Digital Pulse: Leveraging Natural Language Processing and Social Media Analytics for Real-Time Societal Sentiment Tracking

Leave a Reply Cancel reply