{"id":2816,"date":"2022-05-06T21:42:34","date_gmt":"2022-05-06T20:42:34","guid":{"rendered":"https:\/\/www.gc4ss.org\/?p=2816"},"modified":"2026-04-09T16:25:21","modified_gmt":"2026-04-09T15:25:21","slug":"from-transaction-to-prediction-the-roles-of-data-analysts-and-data-scientists-in-customer-lifetime-value-estimation-and-churn-reduction","status":"publish","type":"post","link":"https:\/\/www.gc4ss.org\/?p=2816","title":{"rendered":"From Transaction to Prediction: The Roles of Data Analysts and Data Scientists in Customer Lifetime Value Estimation and Churn Reduction"},"content":{"rendered":"<div id=\"pl-2816\"  class=\"panel-layout\" ><div id=\"pg-2816-0\"  class=\"panel-grid panel-no-style\" ><div id=\"pgc-2816-0-0\"  class=\"panel-grid-cell\" ><div id=\"panel-2816-0-0-0\" class=\"so-panel widget widget_sow-editor panel-first-child panel-last-child\" data-index=\"0\" ><div\n\t\t\t\n\t\t\tclass=\"so-widget-sow-editor so-widget-sow-editor-base\"\n\t\t\t\n\t\t>\n<div class=\"siteorigin-widget-tinymce textwidget\">\n\t<p><!-- wp:paragraph --><\/p>\n<p style=\"padding-left: 40px;\"><span style=\"color: #ff0000;\"><strong>Aziz Ozmen, PhD<\/strong><\/span><br \/><a title=\"\" href=\"mailto:aziz.ozmen@gc4ss.org\">aziz.ozmen@gc4ss.org<\/a><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>\u00a0 \u00a0 \u00a0 \u00a0 \u00a0Senior<\/strong> <strong>Security Analyst<\/strong><br \/><strong>\u00a0 \u00a0 \u00a0 \u00a0 \u00a0Global Center for Security Studies<\/strong><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:image {\"id\":2747,\"sizeSlug\":\"large\",\"linkDestination\":\"none\"} --><\/p>\n<figure class=\"wp-block-image size-large\">\n<h1>From Transaction to Prediction: The Roles of Data Analysts and Data Scientists in Customer Lifetime Value Estimation and Churn Reduction<\/h1>\n<p><strong>Abstract<\/strong><\/p>\n<p>The digital transformation of retail has produced an unprecedented wealth of customer behavioral data, creating both opportunities and challenges for marketing organizations. Among the most valuable applications of this data is the estimation of Customer Lifetime Value (CLV) and the prediction of customer churn\u2014activities that directly impact profitability and long-term business sustainability. This paper investigates the distinct yet complementary roles of Data Analysts and Data Scientists in the development and deployment of CLV estimation and churn prediction models for e-commerce environments. Moving beyond traditional RFM (Recency, Frequency, Monetary) analysis, this study explores how survival analysis, cohort analysis, and machine learning techniques\u2014including K-means clustering, Apriori association rule learning, and gradient boosting models\u2014enable organizations to predict which customers are at risk of churn and to personalize retention interventions. Through a synthesis of academic literature, industry implementations, and methodological frameworks from 2014 to 2022, this paper delineates the division of labor: Data Scientists engineer predictive models such as XGBoost classifiers and survival analysis frameworks, while Data Analysts perform cohort analysis using SQL, validate model outputs, and translate computational findings into actionable marketing strategies. The paper concludes by addressing persistent challenges\u2014including the non-contractual nature of e-commerce relationships, class imbalance in churn datasets, and the interpretability of black-box models\u2014and proposes an integrated operational framework for customer analytics.<\/p>\n<p><strong>Keywords: <\/strong>Customer Lifetime Value, Churn Prediction, Survival Analysis, Cohort Analysis, K-Means Clustering, Apriori Algorithm, XGBoost, Data Science, E-commerce Analytics<\/p>\n<h1>1.\u00a0\u00a0 Introduction<\/h1>\n<p>The aphorism that \"it is cheaper to retain an existing customer than to acquire a new one\" has been empirically validated across multiple industries. As documented by Reichheld (1990), a mere 5% improvement in customer retention leads to profit increases of 85% in the banking sector, 50% in insurance brokerage, and 30% in the automotive industry. In the e-commerce sector, where customer acquisition costs continue to rise and competition intensifies, the ability to predict which customers are likely to churn\u2014and to intervene before they do\u2014has become a strategic imperative.<\/p>\n<p>The challenge, however, is substantial. Unlike contractual settings such as telecommunications or subscription services where customers explicitly terminate their<\/p>\n<p><\/p>\n<p>relationships, e-commerce operates in a non-contractual environment. As noted by researchers at the University of Essex, \"in a non-contractual setting such as retail, customers can change their purchasing habits at any moment, and typically the longer a customer takes to make their next purchase, the lower the probability is of that customer returning at all\u201d. This uncertainty creates a censoring problem: analysts cannot definitively distinguish between a customer who has permanently churned and one who is simply experiencing a long inter-purchase pause.<\/p>\n<p>The central thesis of this paper is that effective CLV estimation and churn prediction require a clear division of labor between Data Scientists, who build predictive models and algorithmic frameworks, and Data Analysts, who perform descriptive analyses, validate model outputs, and translate findings into business strategy. While machine learning models\u2014particularly gradient boosting algorithms like XGBoost\u2014can achieve high predictive accuracy, they require careful feature engineering, handling of class imbalance, and domain-specific validation that only human analysts can provide.<\/p>\n<p>This paper is structured as follows. Section II reviews the foundational concepts of CLV estimation and churn prediction, including the specific challenges of the e-commerce context. Section III delineates the distinct roles of Data Analysts and Data Scientists within the customer analytics pipeline, drawing on industry literature and job architecture analyses. Section IV presents the technical framework, focusing on cohort analysis, K-means clustering, Apriori association rule learning, and XGBoost-based churn prediction. Section V discusses the collaborative workflow and persistent challenges, followed by a conclusion on the future trajectory of customer analytics.<\/p>\n<h1>2.\u00a0\u00a0 The Foundations of Customer Lifetime Value and Churn Prediction<\/h1>\n<ul>\n<li><strong>Defining<\/strong> <strong>Customer<\/strong> <strong>Lifetime<\/strong> <strong>Value<\/strong><\/li>\n<\/ul>\n<p>Customer Lifetime Value (CLV) represents the total net profit a company can expect to generate from a customer over the entire duration of their relationship. As summarized by Jasek and colleagues (2019), CLV is \"a key issue for companies that are introducing a CLV managerial approach in their online B2C relationship stores\" . The selection of an appropriate CLV model depends on several assumptions specific to the online retail environment, including the non-contractual nature of the relationship, continuous purchase timing (anytime, not just at regular intervals), and the variable-spending environment where transaction values fluctuate.<\/p>\n<p>The academic literature has produced numerous probabilistic CLV models. In their comparative analysis of eleven selected CLV models applied to e-commerce datasets from Central and Eastern Europe\u2014representing annual revenues in the hundreds of millions of euros and nearly 2.3 million customers\u2014Jasek and colleagues found that the BG\/NBD (Beta Geometric\/Negative Binomial Distribution) and Pareto\/NBD models achieved \"overall good and consistent results\" and could be \"considered stable with significant lifts from the baseline Status quo model\u201d. These probabilistic models, originally developed by Fader and Hardie, estimate the probability that a customer is still alive (i.e., has not churned) based on historical purchase patterns.<\/p>\n<p>More recently, researchers have proposed architecture frameworks that integrate multiple analytical approaches. Abdurrahman and colleagues (2022) proposed an architecture that \"evaluates business strategy using customer segmentation and customer lifetime value prediction, churn prediction, uplift modelling, and survival analysis\u201d. This integrated approach recognizes that no single method is sufficient; effective customer analytics requires a portfolio of techniques.<\/p>\n<h1>2.2\u00a0\u00a0 The Churn Prediction Problem<\/h1>\n<p>Customer churn\u2014the act of a customer ending their relationship with a service provider\u2014has been extensively studied across industries. In the telecom sector, where annual churn rates range from 20% to 40%, researchers have demonstrated that \"the cost of retaining existing customers is 5\u201310 times lower than the cost of obtaining new customers\" and that \"decreasing the churn rate by 5% increases the profit from 25% to 85%\u201d.<\/p>\n<p>The e-commerce churn prediction problem has distinct characteristics. Unlike telecom, where churn is often signaled by a contract cancellation, e-commerce churn must be inferred from purchasing behavior. Survival analysis, a branch of statistics originally developed for medical research, has proven particularly valuable for this context. The survival function S(t) = P(T &gt; t) represents the probability that a customer remains active beyond time t. The hazard function, which defines the event rate at time t conditional on survival up to that time, is expressed as \u03b3(t) = lim \u0394t\u21920 P(T &lt; t+\u0394t | T \u2265 t)\/\u0394t.<\/p>\n<p>Recent advances have combined survival analysis with deep learning. Researchers have proposed \"a deep survival framework to predict which customers are at risk of stopping to purchase with retail companies in non-contractual settings\u201d. By leveraging recurrent neural networks to learn survival model parameters, this approach aims to \"obtain individual-level survival models for purchasing behavior based only on individual customer behavior and avoid time-consuming feature engineering processes\u201d.<\/p>\n<h1>2.3\u00a0\u00a0 Customer Segmentation and Association Rules<\/h1>\n<p>Beyond prediction, understanding customer behavior requires segmentation and pattern discovery. K-means clustering, an unsupervised learning algorithm, partitions customers into distinct groups based on behavioral similarity. As demonstrated by Husein and colleagues (2022), \"a clustering technique approach is proposed to classify customer data which is evaluated using the Davies Bouldin, Calinski Harabasz and Silhouette methods to determine the optimal number of clusters\u201d. Their research found that K-means clustering produced 5 clusters with 76% better accuracy than Spectral Clustering and Gaussian Mixture Model methods.<\/p>\n<p>The Apriori algorithm, originally developed by Agrawal and Srikant (1994) for market basket analysis, identifies frequent item sets and association rules. The core principle, as articulated by Leskovec, Rajaraman, and Ullman (2014), is that \"a large set cannot be frequent unless all its subsets are\"\u2014the Apriori property that enables efficient pruning of the search space. In the customer analytics context, association rules can identify product bundles that frequently co-purchase, enabling targeted cross-selling and personalized recommendations.<\/p>\n<h1>3.\u00a0\u00a0 Role Delineation: Data Analyst vs. Data Scientist in Customer Analytics<\/h1>\n<p>The successful deployment of customer analytics requires a clear understanding of the distinct contributions of Data Analysts and Data Scientists. As noted in industry analysis, while both roles are extremely important, \"the terms 'data scientist' and 'data analyst' are often used interchangeably by marketers as if they are one and the same,\" leading to confusion about responsibilities and expectations.<\/p>\n<h1>3.1\u00a0\u00a0 The Customer Analytics Data Analyst: The Cohort Investigator<\/h1>\n<p>The Data Analyst in customer analytics is primarily focused on descriptive and diagnostic analytics. According to industry analysis, \"arguably the most important role of a data analyst is collecting, sorting and studying different sets of information\" with the goal of \"pinning down a fixed value to some process or function so it can be assessed and compared over time\u201d. The data \"has to be regulated, normalized and calibrated so that it can be taken out of context and used as standalone information or paired with other data without losing its integrity\u201d.<\/p>\n<p>In the specific context of churn and CLV analysis, the Analyst's core competency is cohort analysis. As documented in technical literature on SQL-based retention analysis, rolling<\/p>\n<p>retention (also known as cohort analysis) is defined as \"the percentage of returning users measured at a regular interval, typically weekly or monthly, grouped by their sign-up<\/p>\n<p>week\/month, also known as cohort\u201d. By grouping users based on when they signed up, analysts can gain insight into how product, marketing, and sales initiatives have impacted retention. The Analyst answers questions such as: \"How well did these new users stick around compared to users that signed up a week prior?\" or \"How many of the dormant users who received discount offers came back and stayed on the product?\u201d.<\/p>\n<p>The technical implementation of cohort analysis requires advanced SQL skills, including the use of window functions such as FIRST_VALUE to calculate first purchase dates and week number calculations. The Analyst must be proficient in creating pivot-table-style reports using SUM functions with CASE statements to produce the retention matrix that visualizes cohort retention over time.<\/p>\n<p>Beyond cohort analysis, the Analyst performs:<\/p>\n<ul>\n<li><strong>Data validation and quality assurance<\/strong>: Ensuring that the input data\u2014transaction logs, customer profiles, clickstream data\u2014is clean, complete, and properly formatted for<\/li>\n<li><strong>Exploratory data analysis<\/strong>: Generating summary statistics, visualizing distributions, and identifying patterns that inform feature engineering.<\/li>\n<li><strong>False positive investigation<\/strong>: When the Data Scientist's churn prediction model flags a customer as high-risk, the Analyst investigates whether that prediction is accurate or a false positive driven by unusual but legitimate behavior (e.g., seasonal purchasing patterns).<\/li>\n<li><strong>Stakeholder communication<\/strong>: Translating analytical findings into reports, dashboards, and presentations that marketing managers can act<\/li>\n<\/ul>\n<h1>3.2\u00a0\u00a0 The Customer Analytics Data Scientist: The Predictive Modeler<\/h1>\n<p>The Data Scientist in customer analytics represents \"a kind of evolution from the traditional data or business analyst role\". While formal training is similar, \"the thing that sets data scientists apart is strong business acumen coupled with the ability to communicate findings to senior leaders in a way that can influence how the organization approaches a business challenge\u201d. As one industry expert describes, a data scientist is \"somebody who is inquisitive, who can stare at data and spot trends. It's almost like a renaissance individual who really wants to learn and bring change to an organization\u201d.<\/p>\n<p>In the churn prediction context, the Data Scientist's primary responsibility is developing and optimizing predictive models. A leading approach in the literature is the use of XGBoost<\/p>\n<p>(Extreme Gradient Boosting), a decision-tree-based ensemble algorithm that \"accurately predicts a target class by combining simple and weak models\u201d. XGBoost has been shown to achieve high performance in churn prediction tasks. In a telecom churn prediction study, researchers proposed \"a stacking model consisting of two levels with four algorithms: Xgboost (XGB), Logistic regression (LR), Decision tree (DT) and Naive Bayes classifier (NBC)\u201d. The results demonstrated that \"the proposed customer churn predictions have accuracies of 96.12% and 98.09% for the original and new churn datasets, respectively\".<\/p>\n<p>The Data Scientist is also responsible for:<\/p>\n<ul>\n<li><strong>Feature engineering<\/strong>: Transforming raw transaction data into predictive features. This includes constructing behavioral variables such as recency (days since last purchase), frequency (number of purchases in a time window), monetary value (average or total spend), as well as more complex features like purchase regularity, category diversity, and engagement<\/li>\n<\/ul>\n<ul>\n<li><strong>Handling class imbalance<\/strong>: Churn datasets are typically imbalanced, with far more non-churn than churn observations. The Data Scientist must apply techniques such as SMOTE (Synthetic Minority Over-sampling Technique), class weighting, or appropriate evaluation metrics (precision, recall, F1, AUC-PR) rather than accuracy<\/li>\n<li><strong>Model evaluation and validation<\/strong>: Using appropriate cross-validation strategies, hyperparameter tuning (e.g., using Optuna or Grid Search), and ensuring that models generalize to holdout data.<\/li>\n<\/ul>\n<ul>\n<li><strong>Survival analysis implementation<\/strong>: For non-contractual settings, the Data Scientist may implement survival models such as Kaplan-Meier estimators or Cox Proportional Hazard models. The Kaplan-Meier estimator of the survival function is defined as \u015c(t)= \u220f_{k: t_k &lt; t} (1 - d_k \/ n_k), where d_k is the number of individuals that experienced the event at time t_k and n_k is the total number at risk at that time.<\/li>\n<\/ul>\n<ul>\n<li><strong>Unsupervised learning for segmentation<\/strong>: Implementing K-means clustering and evaluating cluster quality using metrics such as Davies-Bouldin index, Calinski-Harabasz index, and silhouette<\/li>\n<\/ul>\n<ul>\n<li><strong>Comparative<\/strong> <strong>Summary<\/strong><\/li>\n<\/ul>\n<p><strong>\u00a0<\/strong><\/p>\n<table>\n<tbody>\n<tr>\n<td width=\"96\">\n<p><strong>Feature<\/strong><\/p>\n<\/td>\n<td width=\"252\">\n<p><strong>Customer Analytics Data<\/strong> <strong>Analyst<\/strong><\/p>\n<\/td>\n<td width=\"338\">\n<p><strong>Customer Analytics Data<\/strong> <strong>Scientist<\/strong><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"96\">\n<p><strong>Primary <\/strong><strong>Output<\/strong><\/p>\n<\/td>\n<td width=\"252\">\n<p>Cohort retention reports, dashboards, descriptive statistics<\/p>\n<\/td>\n<td width=\"338\">\n<p>Churn prediction models, CLV estimates, customer segmentations<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"96\">\n<p><strong>\u00a0<\/strong><\/p>\n<p><strong>Core<\/strong> <strong>Tools<\/strong><\/p>\n<\/td>\n<td width=\"252\">\n<p>SQL (window functions, cohort queries), Tableau, Excel<\/p>\n<\/td>\n<td width=\"338\">\n<p><strong>\u00a0<\/strong><\/p>\n<p>Python (XGBoost, scikit-learn, PyTorch), R, Spark<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"96\">\n<p><strong>Statistical <\/strong><strong>Focus<\/strong><\/p>\n<\/td>\n<td width=\"252\">\n<p>Descriptive statistics, retention rates, cohort comparisons<\/p>\n<\/td>\n<td width=\"338\">\n<p>Predictive modeling, survival analysis, hyperparameter optimization<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"96\">\n<p><strong>Domain <\/strong><strong>Knowledge<\/strong><\/p>\n<\/td>\n<td width=\"252\">\n<p>Marketing metrics, customer behavior, business KPIs<\/p>\n<\/td>\n<td width=\"338\">\n<p>Machine learning algorithms, survival analysis, feature engineering<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"96\">\n<p><strong>\u00a0<\/strong><\/p>\n<p><strong>Typical <\/strong><strong>Question<\/strong><\/p>\n<\/td>\n<td width=\"252\">\n<p>\"What are the common characteristics of lost customers?\"<\/p>\n<\/td>\n<td width=\"338\">\n<p><strong>\u00a0<\/strong><\/p>\n<p>\"What is the probability that this new customer will churn within 90 days?\"<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"96\">\n<p><strong>Temporal <\/strong><strong>Scope<\/strong><\/p>\n<\/td>\n<td width=\"252\">\n<p>Historical and current (what happened, what is happening)<\/p>\n<\/td>\n<td width=\"338\">\n<p>Future-oriented (what will happen, what might happen)<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><\/p>\n<ol start=\"4\">\n<li><strong>Technical Framework: From Cohort Analysis to XGBoost<\/strong>\n<ul>\n<li><strong>Cohort Analysis<\/strong> <strong>with<\/strong> <strong>SQL<\/strong><\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<p>Cohort analysis is the foundational technique for understanding customer retention. As documented in technical literature, the process involves three main steps:<\/p>\n<p><strong>Step 1: Bucketing visits by time period. <\/strong>Using SQL, the Analyst groups customer activity into weekly or monthly cohorts. A query using DATE_TRUNC or similar functions \"squashes\" all logins or purchases in each period into one row per customer per period.<\/p>\n<p><strong>Step 2: Normalizing visits relative to first activity. <\/strong>Using the FIRST_VALUE window function partitioned by customer_id and ordered by activity date, the Analyst calculates each customer's first activity date and then computes the week_number as the difference between current activity date and first activity date, divided by the number of seconds in a week.<\/p>\n<p><strong>Step 3: Creating the retention matrix. <\/strong>Using SUM with CASE statements, the Analyst creates a pivot table where rows represent cohorts (by first activity week) and columns represent week numbers (0, 1, 2, ...). Each cell contains the count of customers in that cohort who were active in that week. Retention percentages are calculated by dividing each week's count by the week 0 count.<\/p>\n<p>The resulting cohort analysis answers the Analyst's core question: \"What are the common characteristics of lost customers?\" By comparing the retention patterns of different cohorts\u2014for example, customers acquired through different marketing channels or during different promotional periods\u2014the Analyst can identify which acquisition strategies produce the most loyal customers.<\/p>\n<h1>4.2\u00a0\u00a0 K-Means Clustering for Customer Segmentation<\/h1>\n<p>Customer segmentation enables targeted marketing strategies. K-means clustering partitions customers into K distinct groups based on behavioral features such as recency, frequency, monetary value (RFM), as well as other derived features.<\/p>\n<p>As demonstrated by Husein and colleagues (2022), the optimal number of clusters should be determined using evaluation metrics. The silhouette score measures how similar a point is to its own cluster compared to other clusters, with values ranging from -1 to +1 (higher is better). The Davies-Bouldin index measures the average similarity between clusters, with lower values indicating better separation.<\/p>\n<p>The clustering process yields interpretable segments: high-value loyal customers, at-risk but historically valuable customers, low-engagement window shoppers, seasonal buyers, and bargain-seekers. Each segment suggests different retention strategies.<\/p>\n<h1>4.3\u00a0\u00a0 Apriori Algorithm for Association Rule Learning<\/h1>\n<p>The Apriori algorithm, introduced by Agrawal and Srikant (1994) and comprehensively treated by Leskovec, Rajaraman, and Ullman (2014), identifies frequent itemsets\u2014sets of items that appear together in many transactions. The algorithm operates on the Apriori property: all subsets of a frequent itemset must also be frequent. This property enables efficient pruning: once an itemset is identified as infrequent, its supersets need not be considered.<\/p>\n<p>In the customer analytics context, association rules reveal product affinities. A rule such as {coffee, creamer} \u2192 {sugar} with support of 0.05 (5% of transactions contain all three items) and confidence of 0.8 (80% of transactions containing coffee and creamer also contain sugar) enables personalized recommendations and cross-selling campaigns.<\/p>\n<h1>4.4\u00a0\u00a0 XGBoost for Churn Prediction<\/h1>\n<p>XGBoost (Extreme Gradient Boosting) has emerged as a leading algorithm for churn prediction due to its handling of mixed data types, built-in regularization to prevent overfitting, and ability to handle missing values. As documented in the telecom churn prediction literature, XGBoost achieves high accuracy when properly tuned.<\/p>\n<p>The XGBoost model optimizes a regularized objective function: L(\u03b8) = \u03a3_i l(y_i, \u0177_i) + \u03a3_k \u03a9(f_k) where l is a differentiable convex loss function (e.g., log loss for binary classification), and \u03a9(f) = \u03b3T + (1\/2)\u03bb||w||\u00b2 is the regularization term that penalizes model complexity.<\/p>\n<p>The Data Scientist's responsibilities include:<\/p>\n<ul>\n<li><strong>Feature engineering<\/strong>: Creating predictive features from raw transaction data, including recency, frequency, monetary aggregates, and behavioral sequences.<\/li>\n<li><strong>Hyperparameter tuning<\/strong>: Using frameworks such as Optuna to optimize tree depth, learning rate, subsample ratios, and regularization<\/li>\n<li><strong>Handling class imbalance<\/strong>: Applying techniques such as scale_pos_weight parameter adjustment or SMOTE oversampling.<\/li>\n<li><strong>Model interpretation<\/strong>: Using SHAP (SHapley Additive exPlanations) values to explain which features drive individual predictions, enabling marketing teams to understand why specific customers are flagged as high-risk.<\/li>\n<\/ul>\n<h1>5.\u00a0\u00a0 The Collaborative Workflow and Persistent Challenges<\/h1>\n<ul>\n<li><strong>Integrated<\/strong> <strong>Workflow<\/strong><\/li>\n<\/ul>\n<p>Effective customer analytics requires a seamless collaboration between Analysts and Scientists:<\/p>\n<ol>\n<li><strong>Data preparation (Analyst)<\/strong>: The Analyst extracts, cleans, and validates transaction data, ensuring consistency and handling missing<\/li>\n<li><strong>Exploratory analysis (Analyst)<\/strong>: The Analyst generates cohort retention reports and summary statistics, identifying patterns and<\/li>\n<li><strong>Feature<\/strong> <strong>engineering (Scientist, with Analyst<\/strong> <strong>input)<\/strong>: The Scientist constructs predictive features; the Analyst provides domain expertise on which behavioral signals are most indicative of churn<\/li>\n<li><strong>Model development (Scientist)<\/strong>: The Scientist trains and validates XGBoost or survival models, tuning hyperparameters and evaluating<\/li>\n<li><strong>Output validation (Analyst)<\/strong>: The Analyst reviews model predictions, investigating false positives and false negatives to identify systematic<\/li>\n<li><strong>Strategy<\/strong> <strong>translation<\/strong><strong> (Analyst)<\/strong>: The Analyst translates model outputs into actionable marketing campaigns: \"Contact these 5,000 customers with a 10% discount offer; these 2,000 high-value at-risk customers should receive a personalized outreach.\"<\/li>\n<\/ol>\n<h1>5.2\u00a0\u00a0 Persistent Challenges<\/h1>\n<p><strong>Non-contractual uncertainty<\/strong>: Unlike subscription businesses where churn is explicit, e-commerce churn must be inferred. The choice of a \"churn definition\" (e.g., no purchase for 90 days) is arbitrary and affects model performance.<\/p>\n<p><strong>Class imbalance<\/strong>: In typical e-commerce datasets, churn rates may be 5-15%. Models trained on imbalanced data tend to predict the majority class. The Data Scientist must apply appropriate techniques, and the Analyst must evaluate models using precision-recall curves rather than accuracy.<\/p>\n<p><strong>Seasonality<\/strong>: Customer purchasing behavior varies by season (holidays, sales events). A customer who appears to have churned may simply be between seasonal purchase cycles. The Analyst must account for seasonality in validation.<\/p>\n<p><strong>Interpretability vs. accuracy trade-off<\/strong>: Black-box models like XGBoost achieve high accuracy but are difficult to explain to marketing stakeholders. The Data Scientist may need to provide SHAP explanations or consider simpler, more interpretable models for certain use cases.<\/p>\n<h1>6.\u00a0\u00a0 Conclusion<\/h1>\n<p>The integration of advanced analytics into customer relationship management has transformed marketing from an art into a science. This paper has argued that effective CLV estimation and churn prediction require a clear division of labor between Data Analysts and Data Scientists. The Analyst performs cohort analysis, validates model outputs, and translates findings into strategy. The Scientist builds predictive models\u2014XGBoost classifiers, survival analysis frameworks, clustering algorithms\u2014that forecast which customers are at risk and why.<\/p>\n<p>The academic literature reviewed\u2014from probabilistic CLV models (Jasek et al., 2019) to ensemble learning for churn prediction (Telecom Churn Study, 2021) to deep survival frameworks (Equihua et al., 2022)\u2014demonstrates significant technical progress. The foundational techniques of cohort analysis, K-means clustering, and association rule learning remain essential components of the customer analytics toolkit.<\/p>\n<p>The future trajectory will likely involve tighter integration of real-time data, enabling immediate intervention when a customer exhibits churn-risk signals. Personalization at scale\u2014delivering the right offer to the right customer at the right moment\u2014will become increasingly automated. However, the human roles of validation, interpretation, and strategy translation will remain essential. The Data Analyst who can answer \"What are the common characteristics of lost customers?\" and the Data Scientist who can predict \"What is the probability that this new customer will churn?\" together form the backbone of data-driven customer relationship management.<\/p>\n<h1>7.\u00a0\u00a0 References<\/h1>\n<ul>\n<li>Abdurrahman, , Agarwal, C., &amp; Ramasamy, L. (2022). Architecture for evaluating customer retention strategies. ECS Transactions, 107(1), 1569. <a href=\"https:\/\/doi.org\/10.1149\/10701.1569ecst\">https:\/\/doi.org\/10.1149\/10701.1569ecst<\/a><\/li>\n<li>Jasek, , Vrana, L., Sperkova, L., Smutny, Z., &amp; Kobulsky, M. (2019). Comparative analysis of selected probabilistic customer lifetime value models in online shopping. Journal of Business Economics and Management, 20(3), 398-423. <a href=\"https:\/\/doi.org\/10.3846\/jbem.2019.9597\">https:\/\/doi.org\/10.3846\/jbem.2019.9597<\/a><\/li>\n<li>Husein, A. M., Setiawan, D., Sumangunsong, A. R. K., Simatupang, A., &amp; Yasmin, S. A. (2022). Combination grouping techniques and association rules for marketing analysis-based customer SinkrOn, 7(4), 1998-2007. <a href=\"https:\/\/jurnal.polgan.ac.id\/index.php\/sinkron\/article\/view\/12596\">https:\/\/jurnal.polgan.ac.id\/index.php\/sinkron\/article\/view\/12596<\/a><\/li>\n<li>Leskovec, , Rajaraman, A., &amp; Ullman, J. D. (2014). Frequent itemsets. In Mining of massive datasets (Chapter 6). Cambridge University Press. <a href=\"https:\/\/doi.org\/10.1017\/CBO9781139924801\">https:\/\/doi.org\/10.1017\/CBO9781139924801<\/a><\/li>\n<li>Telecom churn prediction study. (2021). Telecom churn prediction system based on ensemble learning using feature grouping. Applied Sciences, 11(11), 4742. <a href=\"https:\/\/www.mdpi.com\/2076-3417\/11\/11\/4742\">https:\/\/www.mdpi.com\/2076-3417\/11\/11\/4742<\/a><\/li>\n<li>Treasure (2016, July 21). Rolling retention done right in SQL. Treasure Data Blog. <a href=\"https:\/\/www.treasuredata.com\/blog\/rolling-retention-done-right-in-sql\">https:\/\/www.treasuredata.com\/blog\/rolling-retention-done-right-in-sql<\/a><\/li>\n<li>(2015, November 15). Data analysts vs. data scientists: What's the difference? Econsultancy. <a href=\"https:\/\/econsultancy.com\/data-analysts-vs-data-scientists-what-s-the-difference\/\">https:\/\/econsultancy.com\/data-analysts-vs-data-scientists-what-s-the-difference\/<\/a><\/li>\n<li>Equihua, J. P., Nordmark, H., Ali, M., &amp; Lausen, B. (2022). Modelling customer churn for the retail industry in a deep learning-based sequential framework. arXiv preprint, arXiv:2304.00575. <a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2304.00575\">https:\/\/ar5iv.labs.arxiv.org\/html\/2304.00575<\/a><\/li>\n<li>Jasek, , Vrana, L., Sperkova, L., Smutny, Z., &amp; Kobulsky, M. (2019). Predictive performance of customer lifetime value models in e-commerce and the use of non-financial data. Prague Economic Papers, 28(6), 648-669. <a href=\"https:\/\/doi.org\/10.18267\/j.pep.714\">https:\/\/doi.org\/10.18267\/j.pep.714<\/a><\/li>\n<\/ul>\n<\/figure>\n<p><!-- \/wp:paragraph --><\/p>\n<\/div>\n<\/div><\/div><\/div><\/div><\/div>","protected":false},"excerpt":{"rendered":"<p>Aziz Ozmen, PhDaziz.ozmen@gc4ss.org \u00a0 \u00a0 \u00a0 \u00a0 \u00a0Senior Security Analyst\u00a0 \u00a0 \u00a0 \u00a0 \u00a0Global Center for Security Studies From Transaction to Prediction: The Roles of Data<span class=\"excerpt-hellip\"> [\u2026]<\/span><\/p>\n","protected":false},"author":519,"featured_media":2831,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[1],"tags":[228,224,226,223,230,231,227,225,229],"class_list":["post-2816","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized","tag-apriori-algorithm","tag-churn-prediction","tag-cohort-analysis","tag-customer-lifetime-value","tag-data-science","tag-e-commerce-analytics","tag-k-means-clustering","tag-survival-analysis","tag-xgboost"],"aioseo_notices":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2022\/05\/aozmen2022.webp?fit=1536%2C1024&ssl=1","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p9RaMN-Jq","jetpack-related-posts":[{"id":2825,"url":"https:\/\/www.gc4ss.org\/?p=2825","url_meta":{"origin":2816,"position":0},"title":"The Force Multiplier: Institutionalizing the Data Analyst and Data Scientist in Modern Cybersecurity Operations","author":"Aziz Ozmen","date":"July 6, 2025","format":false,"excerpt":"Aziz Ozmen, PhDaziz.ozmen@gc4ss.org \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 Senior Security Analyst\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 Global Center for Security Studies The Force Multiplier: Institutionalizing the Data Analyst and Data Scientist in Modern Cybersecurity Operations Abstract The contemporary cybersecurity landscape is characterized by an unprecedented volume, velocity, and variety of data,\u2026","rel":"","context":"In &quot;Cyber Security&quot;","block_context":{"text":"Cyber Security","link":"https:\/\/www.gc4ss.org\/?cat=56"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2025\/07\/aozmen2025.png?fit=1081%2C400&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2025\/07\/aozmen2025.png?fit=1081%2C400&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2025\/07\/aozmen2025.png?fit=1081%2C400&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2025\/07\/aozmen2025.png?fit=1081%2C400&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2025\/07\/aozmen2025.png?fit=1081%2C400&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":2821,"url":"https:\/\/www.gc4ss.org\/?p=2821","url_meta":{"origin":2816,"position":1},"title":"Silent Signals: The Role of Data Analysts and Data Scientists in Algorithmic Market Anomaly Detection for Fraud and Insider Trading Identification","author":"Aziz Ozmen","date":"February 6, 2024","format":false,"excerpt":"Aziz Ozmen, PhDaziz.ozmen@gc4ss.org \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 Senior Security Analyst\u00a0 \u00a0 \u00a0 \u00a0 \u00a0Global Center for Security Studies Silent Signals: The Role of Data Analysts and Data Scientists in Algorithmic Market Anomaly Detection for Fraud and Insider Trading Identification Abstract The digitization of global financial markets has produced an\u2026","rel":"","context":"In \"Silent Signals: The Role of Data Analysts and Data Scientists in Algorithmic Market Anomaly Detection for Fraud and Insider Trading Identification\"","block_context":{"text":"Silent Signals: The Role of Data Analysts and Data Scientists in Algorithmic Market Anomaly Detection for Fraud and Insider Trading Identification","link":"https:\/\/www.gc4ss.org\/?tag=silent-signals-the-role-of-data-analysts-and-data-scientists-in-algorithmic-market-anomaly-detection-for-fraud-and-insider-trading-identification"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2024\/02\/aozmen2024.png?fit=1081%2C400&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2024\/02\/aozmen2024.png?fit=1081%2C400&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2024\/02\/aozmen2024.png?fit=1081%2C400&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2024\/02\/aozmen2024.png?fit=1081%2C400&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2024\/02\/aozmen2024.png?fit=1081%2C400&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":2823,"url":"https:\/\/www.gc4ss.org\/?p=2823","url_meta":{"origin":2816,"position":2},"title":"The Digital Pulse: Leveraging Natural Language Processing and Social Media Analytics for Real-Time Societal Sentiment Tracking","author":"Aziz Ozmen","date":"August 6, 2024","format":false,"excerpt":"Aziz Ozmen, PhDaziz.ozmen@gc4ss.org Senior Security AnalystGlobal Center for Security Studies The Digital Pulse: Leveraging Natural Language Processing and Social Media Analytics for Real-Time Societal Sentiment Tracking Abstract The advent of social media has transformed the public sphere into an unprecedented source of real-time, high-velocity textual data. This paper investigates the\u2026","rel":"","context":"In \"Computational Social Science\"","block_context":{"text":"Computational Social Science","link":"https:\/\/www.gc4ss.org\/?tag=computational-social-science"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2024\/08\/aozmen2024-2.png?fit=1081%2C400&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2024\/08\/aozmen2024-2.png?fit=1081%2C400&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2024\/08\/aozmen2024-2.png?fit=1081%2C400&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2024\/08\/aozmen2024-2.png?fit=1081%2C400&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2024\/08\/aozmen2024-2.png?fit=1081%2C400&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":2829,"url":"https:\/\/www.gc4ss.org\/?p=2829","url_meta":{"origin":2816,"position":3},"title":"From Open Source to Actionable Intelligence: The Role of Data Analysts and Data Scientists in NLP-Driven Cyber Threat Intelligence","author":"Aziz Ozmen","date":"March 6, 2026","format":false,"excerpt":"Aziz Ozmen, PhDaziz.ozmen@gc4ss.org Senior Security AnalystGlobal Center for Security Studies From Open Source to Actionable Intelligence: The Role of Data Analysts and Data Scientists in NLP-Driven Cyber Threat Intelligence Abstract The digital ecosystem is awash with unstructured textual data relevant to cybersecurity: threat intelligence reports, dark web forums, vulnerability disclosures,\u2026","rel":"","context":"In &quot;Cyber Security&quot;","block_context":{"text":"Cyber Security","link":"https:\/\/www.gc4ss.org\/?cat=56"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2026\/03\/aozmen2026.png?fit=1081%2C400&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2026\/03\/aozmen2026.png?fit=1081%2C400&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2026\/03\/aozmen2026.png?fit=1081%2C400&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2026\/03\/aozmen2026.png?fit=1081%2C400&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2026\/03\/aozmen2026.png?fit=1081%2C400&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":2677,"url":"https:\/\/www.gc4ss.org\/?p=2677","url_meta":{"origin":2816,"position":4},"title":"Interagency Network Approach To Information Sharing In Combating Terrorism","author":"Ismail Sahin","date":"November 1, 2023","format":false,"excerpt":"Ismail Sahin, PhDismail.sahin@gc4ss.org Expert Global Center for Security Studies Introduction The aftermath of the 9\/11 terrorist attacks highlighted critical issues in information sharing among law enforcement agencies. The failure to track the movements leading to 9\/11 was attributed to inadequate sharing of counter-terrorism information among different agencies (the U.S. Congress\u2026","rel":"","context":"In &quot;Conflicting Zones&quot;","block_context":{"text":"Conflicting Zones","link":"https:\/\/www.gc4ss.org\/?cat=43"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2024\/02\/isahin2.png?fit=1200%2C686&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2024\/02\/isahin2.png?fit=1200%2C686&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2024\/02\/isahin2.png?fit=1200%2C686&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2024\/02\/isahin2.png?fit=1200%2C686&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2024\/02\/isahin2.png?fit=1200%2C686&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":884,"url":"https:\/\/www.gc4ss.org\/?p=884","url_meta":{"origin":2816,"position":5},"title":"What Works, What Doesn&#8217;t, What&#8217;s Promising: Exploring the Role of Evidence-Based Policing","author":"Emirhan Darcan","date":"August 23, 2018","format":false,"excerpt":"Emirhan Darcan, PhD emirhan.darcan@gc4ss.org Expert Global Center for Security Studies Evidence-based policing is determined by statistical evidence gathered after the implementation of a program. Instead of other policing strategies, this program is malleable, dependent on data and studies being conducted. Currently, police agencies operate under set conditions. The men and\u2026","rel":"","context":"In &quot;Issues in Law Enforcement&quot;","block_context":{"text":"Issues in Law Enforcement","link":"https:\/\/www.gc4ss.org\/?cat=58"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2018\/08\/Emirhan-2.webp?fit=1200%2C686&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2018\/08\/Emirhan-2.webp?fit=1200%2C686&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2018\/08\/Emirhan-2.webp?fit=1200%2C686&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2018\/08\/Emirhan-2.webp?fit=1200%2C686&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/www.gc4ss.org\/wp-content\/uploads\/2018\/08\/Emirhan-2.webp?fit=1200%2C686&ssl=1&resize=1050%2C600 3x"},"classes":[]}],"_links":{"self":[{"href":"https:\/\/www.gc4ss.org\/index.php?rest_route=\/wp\/v2\/posts\/2816","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.gc4ss.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.gc4ss.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.gc4ss.org\/index.php?rest_route=\/wp\/v2\/users\/519"}],"replies":[{"embeddable":true,"href":"https:\/\/www.gc4ss.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2816"}],"version-history":[{"count":0,"href":"https:\/\/www.gc4ss.org\/index.php?rest_route=\/wp\/v2\/posts\/2816\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.gc4ss.org\/index.php?rest_route=\/wp\/v2\/media\/2831"}],"wp:attachment":[{"href":"https:\/\/www.gc4ss.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2816"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.gc4ss.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2816"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.gc4ss.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2816"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}