From Open Source to Actionable Intelligence: The Role of Data Analysts and Data Scientists in NLP-Driven Cyber Threat Intelligence

How Does the Israeli-Palestinian Conflict in Recent Years Affect Australia’s National Security?
September 14, 2025

From Open Source to Actionable Intelligence: The Role of Data Analysts and Data Scientists in NLP-Driven Cyber Threat Intelligence

Aziz Ozmen, PhD
aziz.ozmen@gc4ss.org

Senior Security Analyst
Global Center for Security Studies

From Open Source to Actionable Intelligence: The Role of Data Analysts and Data Scientists in NLP-Driven Cyber Threat Intelligence

Abstract

The digital ecosystem is awash with unstructured textual data relevant to cybersecurity: threat intelligence reports, dark web forums, vulnerability disclosures, social media discussions of breaches, and technical blogs analyzing adversary tactics. For security operations centers (SOCs), the challenge is no longer the scarcity of intelligence but its overwhelming abundance. This paper investigates the integration of Natural Language Processing (NLP) into cyber threat intelligence (CTI) workflows, focusing specifically on the distinct yet complementary roles of Data Analysts and Data Scientists in processing open-source intelligence (OSINT). Moving beyond traditional indicator-of-compromise (IOC) feeds, this study explores how NLP techniques—ranging from named entity recognition (NER) to large language model (LLM)-based summarization—enable the automated extraction, classification, and credibility assessment of threat information from public sources. Through a synthesis of academic literature, industry job architectures, and open-source software implementations from 2021 to 2025, this paper delineates the division of labor: Data Scientists engineer and fine-tune NLP models for domain-specific tasks such as attack technique extraction and threat actor attribution, while Data Analysts validate model outputs, perform manual triage of ambiguous intelligence, and translate computational insights into operational decisions. The paper concludes by addressing persistent challenges—including data quality, model hallucination, and the need for human-in-the-loop validation—and proposes an integrated operational framework for NLP-enhanced CTI.

Keywords: Cyber Threat Intelligence, Natural Language Processing, Open-Source Intelligence, Data Science, Named Entity Recognition, Large Language Models, Security Operations Center

1. Introduction

The discipline of Cyber Threat Intelligence (CTI) rests on a seemingly simple premise: to defend against adversaries, one must understand them. Understanding, in this context, requires the systematic collection, processing, analysis, and dissemination of information about threat actors, their capabilities, their motivations, and their indicators of compromise. Yet the execution of this premise has become extraordinarily complex. As Arazzi et al. (2023) observe in their comprehensive survey, CTI has gained a paramount role in cybersecurity precisely because of the significant increase in the variety and number of cyber-attacks and malware samples. Security practitioners now rely on CTI to promptly recognise attack indicators, collect information about attack methods, and respond accurately and in a timely manner.

CTI’s raw material is predominantly unstructured text. Threat reports published by vendors and government agencies, posts on dark web forums, discussions in hacker communities, vulnerability announcements, and social media commentary on breaches—all of these sources contain valuable intelligence. However, they are not designed for machine consumption. They are written in natural language, replete with jargon, implicit references, and narrative structures that resist automated processing. As noted by Arazzi et al. (2023), “much of the information is represented by unstructured text data, such as threat reports, social media posts, news articles, and hacker forums” . The volume is staggering: a single security analyst cannot read every relevant threat report published daily, let alone monitor dark web forums in real time.

This is where Natural Language Processing (NLP) enters the picture. NLP encompasses a suite of computational techniques for analyzing, understanding, and generating human language. In the CTI context, NLP promises to automate the extraction of structured intelligence from unstructured text: identifying malware names, extracting IP addresses and domain hashes, mapping attack techniques to the MITRE ATT&CK framework, and even assessing the credibility of competing threat attributions. However, as this paper will argue, NLP is not a magic wand. It requires careful engineering, domain-specific adaptation, and—crucially—a clear division of labor between Data Scientists who build the models and Data Analysts who validate and operationalize their outputs.

The central thesis of this paper is that effective NLP-driven CTI depends on the institutionalization of two distinct data roles within security teams. The Data Scientist is responsible for the technical pipeline: selecting or fine-tuning models, preprocessing text, handling domain adaptation (e.g., training on cybersecurity corpora), and evaluating model performance. The Data Analyst, by contrast, serves as the human-in-the-loop: reviewing low- confidence extractions, manually annotating ambiguous cases for model retraining, correlating NLP outputs with other intelligence sources, and producing actionable briefs for SOC personnel and leadership.

This paper is structured as follows. Section II reviews the landscape of OSINT sources for CTI and the NLP techniques applied to them. Section III delineates the specific roles of Data Analysts and Data Scientists within the NLP-CTI pipeline, drawing on real-world job descriptions and open-source implementations. Section IV presents three case studies: named entity recognition for attack attribution, LLM-based attack technique identification, and credibility assessment of threat intelligence. Section V discusses persistent challenges—including data quality, model hallucination, and the limits of automation—followed by a conclusion on the future trajectory of human-AI collaboration in CTI.

2. The OSINT Landscape and NLP Techniques

  • Sources of Open-Source Cyber Threat Intelligence
  • Open-source intelligence (OSINT) for cybersecurity encompasses any publicly accessible information that can inform defensive operations. The diversity of sources is both a strength and a challenge. Arazzi et al. (2023) categorize CTI data sources into several major types:
  • Threat Intelligence Reports: Published by cybersecurity vendors (e.g., CrowdStrike, Mandiant, Kaspersky), government agencies (CISA, NSA, Europol), and industry groups (ISACs). These reports provide detailed analysis of specific threat actors, campaigns, or malware
  • Social Media: Twitter (now X) and LinkedIn are used by security researchers to share indicators, discuss emerging threats, and announce vulnerabilities. The velocity of information on social media is high, but so is the
  • Hacker Forums and Dark Web: Communities such as BreachForums, in, and Russian-language dark web markets contain discussions of exploit development, credential sales, and attack planning. Accessing and parsing these sources requires technical sophistication.
  • Vulnerability Databases: NVD (National Vulnerability Database), CVE Details, and vendor security bulletins provide structured and semi-structured data on software vulnerabilities.
  • Technical Blogs and Paste Sites: Individual researchers publish analyses on platforms like Medium, GitHub, and personal blogs. Paste sites (Pastebin, Ghostbin) are used by threat actors to share stolen data or

The challenge, as noted by industry practitioners, is that these sources produce data in different formats, at different velocities, and with varying levels of reliability. A job description for a Security Data Analyst specializing in NLP lists the target sources explicitly: “threat intelligence reports, dark web text, vulnerability bulletins, malicious code descriptions”. The analyst is expected to perform “text parsing, cleaning, and in-depth analysis to uncover hidden threat correlations, attack intent, and sensitive information”.

2.2 Core NLP Techniques for CTI

The application of NLP to CTI spans several technical tasks, each with its own methodological requirements.

  • Named Entity Recognition (NER) is perhaps the most fundamental NLP technique for CTI. NER involves identifying and classifying entities in text into predefined categories. In the cybersecurity domain, these categories include malware names, threat actor aliases, attack techniques, tools, vulnerabilities, and indicators of compromise (IP addresses, domains, file hashes). The AttackER dataset, introduced by Deka et al. (2024), represents a significant advance in this area. The authors created the first dataset specifically designed for cyber-attack attribution using NER, incorporating 18 distinct entity types based on the STIX 2.1 framework. They divided certain entities into sub-classes to remove ambiguity: for example, “Tools” are split into “GENERAL_TOOLS” and “ATTACK_TOOLS” to distinguish between legitimate software and malicious tools.
  • Text Classification is used to categorize threat reports by type (e.g., malware analysis campaign report), severity, or relevance to specific sectors. As Arazzi et al. (2023) note, supervised classification is one of the primary techniques used for CTI extraction.
  • Relation Extraction goes beyond identifying individual entities to understanding the relationships between them. For example, a threat report might state that “APT28 used Zebrocy malware against diplomatic targets in ” An NER system would identify “APT28” as a threat actor, “Zebrocy” as malware, and “diplomatic targets” as a victim sector. A relation extraction system would additionally capture that APT28 used Zebrocy, and that the attack targeted diplomatic entities.
  • Summarization addresses the volume problem: generating concise abstracts of lengthy threat reports so that analysts can triage This can be extractive (selecting key sentences from the original text) or abstractive (generating novel paraphrases).
  • Credibility Assessment is an emerging Wu et al. (2025) introduced KGV

(Knowledge Graph-based Verifier), the first framework integrating LLMs with knowledge graphs for automated CTI credibility assessment. Their system constructs paragraph-level semantic graphs where nodes represent text segments connected through similarity analysis, then evaluates the consistency and plausibility of threat intelligence claims. The authors also created and publicly released CTI-200, the first dataset specifically for credibility assessment, distinct from existing datasets that focus on identification rather than evaluation.

2.3 The Large Language Model Revolution

The release of transformer-based models (Vaswani et al., 2017) and subsequent large language models (LLMs) such as GPT-4, Llama, and Mistral has transformed NLP for CTI. Unlike earlier machine learning approaches that required task-specific training data for each new application, LLMs can perform zero-shot or few-shot learning on novel tasks. However, their application to cybersecurity is not straightforward.

The CyberSOCEval benchmark, released by Meta and CrowdStrike, provides sobering data on current LLM capabilities in SOC environments. According to the evaluation, current AI systems achieve accuracy scores ranging from approximately 15% to 28% on malware analysis tasks and 43% to 53% on threat intelligence reasoning. The benchmark includes 609 malware analysis questions and 588 threat intelligence questions, evaluating models on JSON logs, MITRE ATT&CK mappings, and complex attack chains. Critically, the researchers found that “reasoning models leveraging test-time scaling did not demonstrate the performance improvements observed in coding and mathematics domains,” suggesting that cybersecurity reasoning requires specialized training.

Similarly, Neri et al. (2025) evaluated CTI extraction methods for identifying attack techniques from threat reports using the MITRE ATT&CK framework. They found “significant challenges, including class imbalance, overfitting, and domain-specific complexity, which impede accurate technique extraction” . Their proposed solution—a two-step pipeline combining LLM summarization with a retrained SciBERT model on an augmented dataset—achieved improvements in F1-scores, with several attack techniques surpassing 0.90.

These findings have direct implications for role definition. If state-of-the-art LLMs achieve only 15-28% accuracy on malware analysis, then fully automated CTI is not viable. Human analysts—specifically, Data Analysts with cybersecurity domain knowledge—must remain in the loop to validate, correct, and supplement model outputs.

3. Role Delineation: Data Scientist vs. Data Analyst in NLP-CTI

The successful deployment of NLP in CTI requires a clear understanding of who does what. Confusion between the roles of Data Scientist and Data Analyst leads to inefficient pipelines, frustrated practitioners, and—most critically—missed threats. This section delineates the responsibilities of each role based on job architecture analyses from industry and government sources.

3.1 The NLP Data Scientist: Model Architect and Engineer

The Data Scientist in the NLP-CTI pipeline is responsible for the technical infrastructure that transforms raw text into structured intelligence. This role requires deep expertise in machine learning, NLP architectures, and software engineering, combined with sufficient cybersecurity knowledge to understand the domain relevance of different extraction tasks.

According to job postings for security-focused NLP roles, the Data Scientist is expected to :

  • Fine-tune transformer models for cybersecurity-specific tasks using frameworks such as Hugging Face Transformers, PyTorch, and This includes domain adaptation: taking a general-purpose model like BERT or RoBERTa and continuing its training on cybersecurity corpora (threat reports, forum posts, vulnerability descriptions).
  • Implement core NLP techniques including text classification, named entity recognition, relation extraction, text summarization, semantic matching, and intent recognition.
  • Track and evaluate emerging technologies such as large model fine-tuning, prompt engineering, text provenance, and multimodal intelligence fusion.
  • Design validation frameworks to measure model performance using appropriate metrics (precision, recall, F1-score) with particular attention to class imbalance—a persistent problem in CTI where certain threat actors or attack techniques are vastly overrepresented in training data.
  • Deploy models in production environments with attention to latency, scalability, and security. The threat-intelligence-pipeline open-source project, for example, implements a complete Flask-based backend with scikit-learn for threat classification, NLTK and TextBlob for NLP processing, and real-time WebSocket communication.

A Data Scientist position supporting information operations at Leidos/NSS further elaborates these requirements, emphasizing the need to “collect, ingest, and preprocess multi-source data sets from HUMINT, SIGINT, OSINT, open-source platforms, social media, sensor data, and

commercial datasets” and to “apply advanced analytical techniques—including machine learning (ML), natural language processing (NLP), deep learning (DL), and predictive modeling—to extract operational insights” .

Crucially, the Data Scientist is not expected to be a cybersecurity practitioner. They need sufficient domain knowledge to understand what entities are relevant and how to evaluate model outputs, but their primary expertise is computational. As one job description notes, the ideal candidate has “security domain knowledge (such as threat intelligence frameworks, attack chain models)” as a requirement—but not necessarily SOC operations experience.

3.2 The Data Analyst: Validator, Interpreter, and Bridge

If the Data Scientist builds the engine, the Data Analyst ensures it is driving in the right

direction. The Analyst’s role is fundamentally interpretive and evaluative. They work with the outputs of the Data Scientist’s models—NER extractions, classification labels, summarizations—and perform several critical functions.

First, the Analyst validates model outputs. NLP models, even state-of-the-art LLMs, make errors. They miss entities, misclassify threat actors, and hallucinate relationships that do not exist in the source text. The Analyst reviews low-confidence extractions, corrects errors, and

provides feedback for model retraining. In the COREII Scout system developed at Idaho National Laboratory, analysts review and classify entities extracted via NER using the “Attack Chain Estimator (ACE),” adding their comments before an LLM generates a final report. This human-in-the-loop validation is not a luxury; it is a necessity given current model limitations.

Second, the Analyst performs manual triage of ambiguous intelligence. Not all extracted information is equally valuable or credible. The Analyst applies domain expertise to assess whether a reported indicator is likely to be a false positive, whether an attribution claim is plausible given known adversary behavior, and whether a vulnerability disclosure is relevant to the organization’s specific technology stack.

Third, the Analyst translates computational outputs into operational decisions. A dashboard showing NER extractions is not actionable. The Analyst produces briefings for SOC personnel, threat hunters, and leadership that answer questions like: “What new indicators should we

block?” “Which threat actors are actively targeting our sector?” “What vulnerabilities require immediate patching?”

Fourth, the Analyst performs qualitative validation through random sampling. As noted in the DataCamp industry report, a best practice is to “randomly sample tweets that the model classified as ‘highly negative’ but with low confidence, reading them manually, and identifying recurring patterns that the model fails to capture”. In the CTI context, this means sampling threat reports where the model showed low confidence in its extractions and identifying systematic errors—e.g., the model consistently misses a particular threat actor’s new alias.

The job requirements for security Data Analysts reflect this hybrid of technical and domain skills. A posting for a “Security Data Analyst – NLP Direction” lists “OSINT natural language data processing in the security field” as the primary responsibility, with emphasis on “text parsing, cleaning, and in-depth analysis”. The Analyst must be proficient in SQL, data visualization (Tableau, Power BI), and statistical testing—but the distinguishing competence is the ability to “uncover hidden threat correlations, attack intent, and sensitive information”.

3.3 Comparative Summary

Feature

NLP Data Scientist (CTI)

NLP Data Analyst (CTI)
Primary Output

Fine-tuned models, inference pipelines, and entity extraction systems

Validated intelligence, dashboards, actionable briefs

Core Tools

PyTorch, Hugging Face Transformers, scikit-learn, spaCy, NLTK

SQL, Pandas, Tableau, Python (for analysis, not modelling)

Statistical Focus

Model evaluation (precision, recall, F1, AUC), bias detection

Descriptive statistics, hypothesis testing, and confidence assessment

Domain Knowledge

Cybersecurity fundamentals (MITRE ATT&CK, threat frameworks)

SOC operations, threat actor behavior, and intelligence analysis

Typical Question

“How can we improve recall for APT29-related NER?”

“Is this extracted IOC credible given the source and context?

Validation Role

Cross-validation, test set evaluation, and ablation studies

Manual review of low-confidence outputs, error correction


4. Case Studies in NLP-Driven CTI

  • Case Study: Named Entity Recognition for Attack Attribution

Attack attribution—determining the identity or location of an attacker—is one of the most challenging tasks in cybersecurity. It is also one of the most information-intensive, requiring analysts to synthesize data from multiple sources and recognize patterns across disparate reports. Deka et al. (2024) argue that “cybersecurity experts perform the attribution process manually, as currently there is no tool that can automate or provide support for such a complex process”.

Their contribution, the AttackER dataset, is designed to fill this gap. Using NER, the authors annotated cybersecurity texts with 18 entity types, including threat actor names, malware families, attack tools, victim identities, and attack motivations. They then trained transformer-based models (using Hugging Face and spaCy) and fine-tuned LLMs (GPT-3.5, Llama-2, Mistral-7B) on the dataset.

The results were positive, particularly for LLMs, demonstrating that fine-tuned models can automatically extract attribution-relevant entities from text. However, the authors emphasize that their work is “the first step in using AI and NLP techniques to support and in the future automate the attribution process”. In the current state, the Data Scientist builds the model; the Data Analyst reviews its extractions, particularly for high-stakes attribution decisions where false attribution could have diplomatic or legal consequences.

4.2 Case Study: LLM-Based Attack Technique Identification

Identifying the specific techniques used in an attack (e.g., T1055.001 Process Injection, T1112 Registry Run Keys) is essential for defensive response and threat hunting. However, mapping narrative threat reports to the MITRE ATT&CK framework is labor-intensive.

Neri et al. (2025) evaluated multiple configurations for this task, including the Threat Report ATT&CK Mapper (TRAM) and open-source LLMs such as Llama 2. Their findings revealed significant challenges: class imbalance (some techniques appear in many reports, others rarely), overfitting, and domain-specific complexity. Their proposed solution—a two-step pipeline where an LLM first summarizes the report and a retrained SciBERT model then performs the classification on an augmented dataset—achieved F1-scores exceeding 0.90 for several techniques.

The practical implication for role definition is that the Data Scientist is responsible for designing, implementing, and evaluating such pipelines. The Data Analyst, in turn, uses the pipeline’s outputs to prioritize threat hunting activities: “Focus on Technique X, which our model indicates is being actively used against organizations in our sector.”

4.3 Case Study: Credibility Assessment with KGV

Not all threat intelligence is equally trustworthy. Threat reports may contain errors, reflect analyst biases, or be deliberately misleading (e.g., false flag operations where one adversary mimics another’s techniques). Wu et al. (2025) address this problem with KGV (Knowledge Graph-based Verifier), which integrates LLMs with knowledge graphs for automated credibility assessment.

KGV constructs paragraph-level semantic graphs where nodes represent text segments connected through similarity analysis, then evaluates the consistency of claims. The system outperformed state-of-the-art fact reasoning methods on the CTI-200 dataset, achieving a 5.7% improvement in F1. Notably, compared to entity-based knowledge graphs for equivalent-length texts, KGV reduced node quantities by nearly two-thirds while improving precision by 1.7% and cutting response time by 46.7%.

In an operational context, the Data Scientist implements KGV or a similar credibility assessment framework. The Data Analyst receives its outputs—confidence scores and inconsistency flags—and makes a judgment: trust this intelligence, distrust it, or investigate further.

5. Challenges and Limitations

5.1 Data Quality and Annotation Bottlenecks

NLP models are only as good as their training data. In the cybersecurity domain, high-quality labeled datasets are scarce. As Deka et al. (2024) note, “NER is underexplored in domains like cybersecurity due to the complex nature of the involved entities”. The AttackER dataset is a

significant contribution precisely because it addresses this gap, but one dataset is insufficient for the diversity of CTI sources and tasks.

Moreover, annotation requires cybersecurity expertise. Annotators must distinguish between a legitimate tool and an attack tool, recognize threat actor aliases, and understand attack chain relationships. This expertise is expensive and scarce. The Data Analyst often serves as the annotator for model retraining, which creates a tension: the Analyst’s time is divided between operational intelligence work and data labeling.

5.2 Model Hallucination and the Limits of LLMs

LLMs are prone to hallucination—generating plausible-sounding but factually incorrect information. In the CTI context, this is unacceptable. A model that hallucinates a new indicator of compromise could lead a SOC to block legitimate traffic. A model that hallucinates an attribution could lead to misguided legal or diplomatic actions.

The CyberSOCEval results are instructive: current LLMs achieve only 15-28% accuracy on malware analysis tasks. This is not a criticism of the models; it is a recognition that cybersecurity reasoning is genuinely difficult and requires specialized knowledge that general-purpose LLMs do not possess. The implication is clear: fully automated CTI is not currently viable. The Data Analyst’s validation role is not a temporary workaround; it is a permanent necessity.

5.3 Adversarial Considerations

Threat actors are aware that defenders use NLP. They can adapt their language to evade detection: using novel malware names that are not in the model’s vocabulary, discussing attacks in code or encrypted channels, or deliberately planting false intelligence to mislead defenders. As Arazzi et al. (2023) note, adversarial attacks on CTI systems are a genuine concern. Defenders must assume that their NLP pipelines will face active evasion attempts.

5.4 Ethical and Legal Considerations

NLP-driven CTI raises privacy and legal questions. Monitoring dark web forums may involve observing communications of individuals who are not charged with any crime. Scraping social media for threat intelligence may violate platform terms of service or privacy expectations. The Data Scientist and Data Analyst must operate within legal and ethical boundaries, which vary by jurisdiction and organizational policy.

6. Conclusion

The integration of Natural Language Processing into cyber threat intelligence represents a maturing field with significant operational potential. This paper has argued that realizing this potential requires the institutionalization of two distinct data roles. The Data Scientist builds the models: fine-tuning transformers for NER, designing credibility assessment frameworks, and evaluating performance metrics. The Data Analyst validates the outputs: reviewing low-confidence extractions, assessing credibility, and translating computational insights into actionable intelligence for SOC operations.

The case studies examined—AttackER for attribution, LLM-based technique identification, and KGV for credibility assessment—demonstrate the technical progress that has been made. The benchmark results from CyberSOCEval remind us, however, that current models are far from perfect. Accuracy rates of 15-28% on malware analysis tasks mean that the human analyst is not an optional extra; the human is the center of the operation.

The future trajectory of NLP in CTI will likely involve tighter human-AI collaboration, not replacement. The Data Analyst will become a supervisor of multiple AI agents, intervening only when models signal low confidence or when the stakes of a decision are particularly high. The Data Scientist will focus on improving model robustness, reducing hallucination, and developing domain-specific architectures. Together, they form a partnership that is greater than the sum of its parts: the scale and speed of computation combined with the judgment and context- awareness of human expertise.

7. References

  • Arazzi, , Arikkat, D. R., Nicolazzo, S., Nocera, A., Rehiman K. A., R., Vinod, P., & Conti, M. (2023). NLP-based techniques for cyber threat intelligence. arXiv preprint, arXiv:2311.08807.
  • Deka, , Rajapaksha, S., Rani, R., Almutairi, A., & Karafili, E. (2024). AttackER: Towards enhancing cyber-attack attribution with a named entity recognition dataset. arXiv preprint, arXiv:2408.05149.
  • Meta & (2025). CyberSOCEval: Open source benchmark for LLMs in security operations center environments. Cyber Security News.
  • Neri, D., et al. (2025). Towards effective identification of attack techniques in cyber threat intelligence reports using large language models. arXiv preprint, arXiv:2505.03147.
  • Pluth, , Kennedy, S., Ramos, E., Quach, A., Wen, S., & Manley, R. (2025). Coreii – Scout. OSTI.GOV, DOE Code 168739. https://doi.org/10.11578/dc.20251105.1
  • Threat Intelligence Pipeline. (2025). Real-time threat intelligence with NLP analysis and MITRE ATT&CK integration. GitHub Repository. https://github.com/ghelaw01/threat- intelligence-pipeline
  • Wu, , et al. (2025). KGV: Integrating large language models with knowledge graphs for cyber threat intelligence credibility assessment. arXiv preprint, arXiv:2408.08088.
  • Zhengzhou Boshi Talent (2025). Security data analyst – NLP direction [Job description]. Liepin.com.
  • Yupao Direct (2026). Security data analyst – NLP direction [Job description]. Yupao.com.
  • Leidos. (2025). Data scientist – Information operations [Job description]. ClearanceJobs.com.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.