The Force Multiplier: Institutionalizing the Data Analyst and Data Scientist in Modern Cybersecurity Operations

Analyzing Public Policy Types in the United States: A Typological Approach
January 31, 2025
How Does the Israeli-Palestinian Conflict in Recent Years Affect Australia’s National Security?
September 14, 2025

The Force Multiplier: Institutionalizing the Data Analyst and Data Scientist in Modern Cybersecurity Operations

Aziz Ozmen, PhD
aziz.ozmen@gc4ss.org

          Senior Security Analyst
          Global Center for Security Studies

The Force Multiplier: Institutionalizing the Data Analyst and Data Scientist in Modern Cybersecurity Operations

Abstract

The contemporary cybersecurity landscape is characterized by an unprecedented volume, velocity, and variety of data, rendering traditional, signature-based defense mechanisms increasingly obsolete. This paper argues that the integration of dedicated Data Analysts and Data Scientists into cybersecurity frameworks is no longer a supplementary luxury but an operational necessity. Moving beyond the conventional role of the security analyst who investigates known indicators of compromise, this research explores how data-centric roles transform cybersecurity from a reactive discipline into a predictive and prescriptive science. By synthesizing findings from academic literature and industry case studies, this paper delineates the distinct yet complementary functions of Data Analysts and Scientists in threat detection, anomaly identification, and automated response. The analysis reveals that while Data Scientists focus on developing algorithmic models and statistical baselines for large-scale pattern recognition, Data Analysts serve as crucial interpreters who translate these computational outputs into actionable intelligence for Security Operations Centers (SOCs). The paper concludes by addressing the persistent challenge of the skills gap and proposes an integrated operational model that leverages both roles to achieve a mature, data-driven security posture. This investigation relies on peer-reviewed conference proceedings, academic journal articles, and books from 2016 to 2023 to ensure contemporary relevance and scholarly rigor.

Keywords: Cybersecurity Analytics, Data Science, Threat Detection, Security Operations Center (SOC), Anomaly Detection, Machine Learning, Cyber Threat Intelligence

1.   Introduction

For the better part of the last decade, the cybersecurity industry has operated on a reactive logic: identify the threat, write a signature, and deploy a patch. However, the digital ecosystem has evolved beyond the capacity of human-centric analysis alone. The proliferation of Internet of Things (IoT) devices, cloud computing architectures, and remote work infrastructures has expanded the attack surface exponentially. According to industry estimates, the volume of data generated daily is so immense that human security analysts are effectively drowning in alerts, leading to "alert fatigue" and a high rate of false positives. It is within this crisis of scale that Data Science emerges not merely as a tool, but as the fundamental logic layer for modern defense.

The central thesis of this paper is that the convergence of Data Science and Cybersecurity—termed "Cybersecurity Data Science" (CSDS)—represents a distinct epistemological shift in how we understand digital risk. This shift moves the profession away from a rule-based paradigm (if/then logic) toward a probabilistic and behavioral paradigm (statistical likelihood of malice). To operationalize this shift, organizations require two distinct but symbiotic roles: the Data Analyst and the Data Scientist. While often conflated, these roles perform different labor. The Data Scientist is responsible for the engineering of statistical models and the creation of predictive algorithms, while the Data Analyst focuses on the interrogation of existing datasets to answer specific security questions and visualize trends for human decision-makers.

This paper is structured as follows: Section II reviews the evolution of cybersecurity data and the scientific need for analytics. Section III delineates the specific roles of Data Analysts and Scientists in the security context. Section IV analyzes the practical applications, from SIEM optimization to User and Entity Behavior Analytics (UEBA). Section V discusses the collaborative operational model and the challenges of integration, followed by a conclusion on the future trajectory of the field.

2.   The Evolution of Cybersecurity Data and the Case for Analytics

To understand the necessity of data specialists, one must first understand the nature of the data itself. Historically, cybersecurity data consisted primarily of log files from firewalls and intrusion detection systems (IDS). These were structured, finite, and relatively low-volume.

Today, security-relevant data encompasses network flow records, endpoint process creation events, authentication logs, email headers, Dark Web scraping, and full packet captures. The transition from a "shortage of data" to a "data deluge" has fundamentally broken the manual analysis model.

2.1   The Scientific Foundation of CSDS

The application of data science to security is essentially an exercise in applied epistemology. As argued by Mongeau and Hajdasinski in Cybersecurity Data Science: Best Practices in an Emerging Profession, CSDS is best understood as a diagnostic process grounded in the scientific method. The practitioner forms a hypothesis (e.g., "User X is exhibiting credential theft behavior"), collects data (authentication logs, VPN access times, geolocation), tests the hypothesis against a statistical baseline, and validates or rejects the premise.

Furthermore, recent scholarship in the Journal of Computing Sciences in Colleges emphasizes the necessity of treating cybersecurity as a hard science. This requires access to high-fidelity datasets that enable verification, validation, and replication of experimental work—the hallmarks of the hypothetico-deductive model. Without Data Scientists to clean, normalise, and model this data, and Data Analysts to query it, cybersecurity remains a "protoscience," reliant on anecdotal evidence rather than empirical rigour.

2.2   Bridging the Semantic Gap

One of the persistent difficulties in cybersecurity is the "semantic gap": the disconnect between low-level system events (e.g., a specific registry key change) and high-level security abstractions (e.g., a ransomware infection). Data analytics serves as the bridge across this gap.

By applying statistical analysis to thousands of low-level events, analysts can identify correlations that signify a high-level attack chain. This process, however, is computationally intensive. As noted by Ardagna et al., implementing robust security controls can inadvertently degrade the accuracy of analytics if security experts lack data science training. To solve this, they propose a "Big Data Analytics-as-a-Service" (MBDAaaS) middleware, which acts as a collaborative layer allowing security experts and data scientists to work without trampling on each other's domain constraints.

3.   Delineating the Roles: Data Analyst vs. Data Scientist in Security

While the titles are sometimes used interchangeably in smaller organizations, the academic and operational literature distinguishes them based on scope, temporality, and technical depth. In the context of cybersecurity, these differences become critical for team composition and incident response workflows.

3.1   The Security Data Analyst: The Interrogator

The Data Analyst in a Security Operations Center (SOC) is primarily focused on descriptive and diagnostic analytics. Their job is to answer specific questions: "What happened during the intrusion window?" or "How many failed logins originated from this IP range?"

Analysts are experts in Structured Query Language (SQL), Splunk Search Processing Language (SPL), and data visualization tools. According to a job architecture analysis by RealmOne (2025), a Level 2 Data Analyst in cybersecurity is expected to "serve as behavioral scientists for AI systems," defining what constitutes 'normal' behavior by analyzing telemetry data and establishing statistical baselines. The analyst’s output is typically a dashboard, a trend report, or a specific dataset handed off to incident responders.

Crucially, the analyst works with structured and semi-structured data that has already been ingested. Their value lies in speed and context; they can pivot between disparate data sources (e.g., correlating EDR alerts with HR termination lists) to validate a threat. As Champlain College’s industry analysis notes, "the cybersecurity analyst looks only at data that can assist in monitoring online networks for suspicious activity, maintaining firewalls... and generally protecting an organization’s digital assets”.

3.2   The Security Data Scientist: The Algorithmic Architect

If the Analyst looks at the present and recent past, the Data Scientist looks toward the future. The Data Scientist builds the predictive models that the Analyst eventually queries. This role requires expertise in machine learning (ML) frameworks (such as TensorFlow and PyTorch), advanced statistics, and feature engineering. A job description for a Security Data Scientist emphasizes the need to "build and deploy machine learning models that detect cyber threats such as malware, phishing, insider threats, and advanced persistent threats (APTs)" .

The Data Scientist addresses the problem of "unknown unknowns"—threats for which no signature exists. By applying unsupervised learning algorithms to network traffic, a Data Scientist can detect a zero-day exploit based on its deviation from expected entropy. However, the models created by Data Scientists are probabilistic, not deterministic. They produce risk scores ("This traffic is 89% likely to be malicious"), not binary verdicts. This probabilistic output requires a specialized interface—often provided by the Analyst—to be useful to human decision-makers.

3.3   Comparative Summary

Feature

Security Data Analyst

Security Data Scientist

Primary Focus

Descriptive & Diagnostic

Predictive & Prescriptive

Time Horizon

Real-time & Historical (Recent)

Future & Historical (Long-term)

Key Output

Dashboards, Alerts, Incident Data

ML Models, Statistical Baselines, Algorithms

Tools

SQL, Splunk, Tableau, Excel

Python (Pandas, Scikit-learn), R, TensorFlow

Statistical Goal

Correlation & Aggregation

Causation & Prediction

Threat Type

Known threats, IOCs

Unknown threats, anomalies, Zero-days

4.   Practical Applications and Operational Integration

The theoretical delineation of roles has practical implications for how security tools are deployed and how incidents are managed. The most mature application of this synergy is found in User and Entity Behavior Analytics (UEBA) and the modern SOC workflow.

4.1   From SIEM to UEBA: The Analytics Evolution

The traditional Security Information and Event Management (SIEM) system is a repository. It collects logs and allows for rule-based correlation. However, the integration of Data Science transforms a SIEM into a UEBA platform. This transformation relies on the Data Scientist to train a model on 90 days of authentication behavior to learn what "normal" looks like for a specific user. Once the model is deployed, the Data Analyst queries the output. For example, if an employee in accounting suddenly downloads 10GB of data at 3 AM, the model flags the deviation. The Analyst then pulls the relevant logs (the descriptive data) to confirm whether this was a backup routine or data exfiltration.

This division of labor prevents the "alert fatigue" endemic to the industry. The machine handles the volume (identifying deviation), the Data Scientist ensures the machine is accurate (tuning false positive rates), and the Data Analyst handles the velocity (investigating the true positive).

4.2   Collaboration for Threat Hunting

"Threat Hunting" is the proactive search for adversaries lurking in the network. It is an iterative process that requires both roles. Staheli et al. (2016) at MIT Lincoln Laboratory addressed this via the development of the Cyber Analyst Real-Time Integrated Notebook

Application (CARINA). They observed that existing tools limited collaboration, forcing analysts into "individual record keeping which hinders their ability to reflect on their own work and transition analytic insights to others" . The solution was a collaborative environment combining visualization and annotation, allowing Data Scientists to push algorithmic recommendations to Analysts, and Analysts to annotate data for retraining the Scientists' models.

4.3   The Middleware Solution

Returning to the work of Ardagna and Hebert (2021), the "Model-Based Big Data Analytics-as-a-Service" (MBDAaaS) concept provides a structural blueprint for this relationship.

In this model, the security expert (Analyst) configures the front-end controls and defines the threat scenarios, while the Data Scientist configures the back-end analytics models. The middleware allows them to deploy an analytics process without one side undermining the other. For example, if the security expert imposes encryption that anonymizes IP addresses, the Data Scientist's geolocation model breaks. The MBDAaaS acts as the translation layer, reconciling security utility with analytical accuracy.

5.   Challenges, the Skills Gap, and the Future

Despite the clear theoretical benefits, the implementation of dedicated data roles in cybersecurity faces significant obstacles, primarily revolving around human capital and data accessibility.

5.1   The Persistent Skills Gap

The dual-domain expert—the "Unicorn" who is a master of both network security and advanced statistical modeling—is exceedingly rare. Most professionals are trained in one domain and lack fluency in the other. As highlighted by Champlain College, "the average

cybersecurity professional may need to complete data analytics training before pursuing work as a cybersecurity analyst," while the data scientist must learn "the broader concepts and methodologies of cybersecurity”. This training pipeline is long and expensive. The industry has responded by trying to automate the middle layer (as seen with MBDAaaS), but for the foreseeable future, organizations must learn to manage teams of specialists rather than hiring singular geniuses.

5.2   The Data Availability Paradox

For Data Scientists to build accurate models, they need massive, labeled datasets of "attack" and "benign" traffic. However, due to privacy regulations (GDPR, CCPA) and corporate liability concerns, sharing raw security data is legally perilous. A 2023 study notes that "synthetic data generation" is emerging as a solution to this paradox. By using Generative Adversarial Networks (GANs) to create artificial network traffic that retains the statistical properties of real attacks without exposing Personally Identifiable Information (PII), researchers can train models without breaching privacy.

5.3   Future Outlook: Autonomous Security

The endgame of integrating Data Science into cybersecurity is the move toward "Autonomous Security"—or what is colloquially known as AI-driven SOCs. In this future, the Data Scientist will shift from building static models to building adaptive models that retrain themselves in real-time. The Data Analyst will shift from manual log review to "supervisory control," monitoring the performance of the AI and intervening only when the model exhibits low confidence. The routine triage of alerts will become fully automated, allowing human intellect to focus on strategic defense, deception operations, and root cause analysis.

6.   Conclusion

The era of the lone security generalist peering at firewall logs is ending. The complexity and scale of modern cyber threats necessitate a division of cognitive labor that only Data Analysts and Data Scientists can provide. This paper has argued that these roles, while distinct, are interdependent. The Data Scientist builds the engine of prediction; the Data Analyst steers the vehicle of investigation. Without the Scientist, the Analyst is blind to unknown threats. Without the Analyst, the Scientist’s models are untethered from operational reality.

For academic programs and industry leaders, the implication is clear: curricula must move beyond teaching basic networking or Python in isolation. The future of digital defense lies in interdisciplinary programs that force security experts to learn statistical rigor and data scientists to learn adversarial thinking. As the adage goes, "Everyone has a plan until they get punched in the mouth." In cybersecurity, the punch is the data deluge. The only defense is a data-literate workforce.

7.   References

  • Champlain College (2024, April 16). Advancing Your Career: Is Cybersecurity Analytics Really Needed? https://online.champlain.edu/blog/advancing-cybersecurity-analytics-career
  • Staheli, , Mancuso, V., Harnasch, R., Fulcher, C., Chmielinski, M., Kearns, A., Kelly, S., & Vuksani, E. (2016). Collaborative Data Analysis and Discovery for Cyber Security. In Twelfth Symposium on Usable Privacy and Security (SOUPS 2016). USENIX Association.
  • Ardagna, C. A., Bellandi, V., Damiani, E., Bezzi, M., & Hebert, C. (2021). Big Data Analytics-as-a-Service: Bridging the gap between security experts and data scientists. Computers & Electrical Engineering, 93, 107215. Elsevier.
  • Mongeau, S., & Hajdasinski, A. (2021). Cybersecurity Data Science: Best Practices in an Emerging Profession. Springer International Publishing.
  • RealmOne. (2025). *Data Analyst 2 - Cybersecurity Job Description*. Dice.com.
  • (2023). The Hunt for Cybersecurity Data: Exploring the Availability of Open Datasets for Cybersecurity Scientific Research. Journal of Computing Sciences in Colleges, 39(3). Consortium for Computing Sciences in Colleges.
  • Cyberr. (2024). Security Data Scientist - Job Description. Arc.dev.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.