

Aziz Ozmen, PhD
aziz.ozmen@gc4ss.org
Senior Security Analyst
Global Center for Security Studies
Abstract
The digitization of global financial markets has produced an unprecedented deluge of high-frequency trading data, creating both opportunities and vulnerabilities. Among the most persistent threats to market integrity are insider trading and algorithmic fraud— activities that leave subtle but detectable traces in transactional data. This paper investigates the distinct yet complementary roles of Data Analysts and Data Scientists in the detection of anomalous market patterns, specifically focusing on unusual trading volumes and synchronized order placements. Moving beyond traditional rule-based surveillance systems, this study explores how time series analysis and isolation forest
algorithms enable the automated identification of suspicious activities that precede price-sensitive events. Through a synthesis of academic literature, industry job architectures, and empirical studies from 2019 to 2023, this paper delineates the division of labor: Data Scientists engineer and optimize unsupervised anomaly detection models, while Data Analysts validate model outputs, perform contextual investigations, and translate computational findings into actionable intelligence for compliance officers and regulators. The paper concludes by addressing persistent challenges—including false positive reduction, model explainability, and the evolving sophistication of fraudulent techniques—and proposes an integrated operational framework for financial market surveillance.
Keywords: Insider Trading Detection, Market Anomaly Detection, High-Frequency Trading, Isolation Forest, Time Series Analysis, Data Science, Financial Fraud Surveillance
The integrity of financial markets rests on a foundational principle: all investors should have equal access to material information. Insider trading—the illegal practice of trading securities based on non-public, material information—violates this principle, eroding public trust and distorting capital allocation. Similarly, algorithmic manipulation, including spoofing (placing orders with intent to cancel before execution) and layering (creating artificial demand),
undermines price discovery mechanisms. As markets have become increasingly electronic and high-frequency, the scale of transactional data has grown beyond the capacity of human surveillance alone.
According to Tony Sio, Head of Regulatory Strategy and Innovation at Nasdaq Anti-Financial Crime, "The problem around financial crime is finding patterns of abuse among huge data sets." Sio further notes that "AI, in particular the different styles and techniques that are coming across now, are incredibly suited to solve this problem”. This observation was echoed by both
U.S. Securities and Exchange Commission Chair Gary Gensler and Financial Conduct Authority Chief Executive Nikhil Rathi in their 2023 speeches, with Gensler calling AI "the most transformative technology of our time" and noting its growing application in compliance, trading algorithms, and market surveillance functions.
The central thesis of this paper is that effective algorithmic anomaly detection for insider trading and fraud requires a clear division of labor between Data Scientists, who build and optimize the computational models, and Data Analysts, who validate outputs, provide domain context, and translate findings into actionable intelligence. While machine learning models—particularly unsupervised approaches like isolation forests—can identify statistical anomalies at scale, they cannot distinguish between a legitimate trading strategy and illegal insider activity without human validation.
This paper is structured as follows. Section II reviews the landscape of insider trading and market manipulation, including the characteristic signals they produce in trading data. Section III delineates the distinct roles of Data Analysts and Data Scientists within the anomaly detection pipeline, drawing on real-world job descriptions and industry implementations. Section IV presents the technical framework, focusing on time series analysis and isolation forest algorithms. Section V discusses the collaborative workflow, including model development, output validation, and investigation. Section VI addresses persistent challenges—false positives, model explainability, and adversarial adaptation—followed by a conclusion on the future trajectory of human-AI collaboration in market surveillance.
Insider trading occurs when individuals trade securities based on material, non-public information. While some forms of insider trading are legal (e.g., corporate executives trading according to pre-established 10b5-1 plans), illegal insider trading involves the exploitation of confidential information for personal gain or to avoid losses. As documented by Deng and colleagues in their study of the Chinese securities market, insider trading has existed since the birth of modern financial markets and remains a persistent enforcement challenge for regulators worldwide.
The challenge of detection lies in the signal-to-noise ratio. Legitimate trading occurs for myriad reasons: portfolio rebalancing, tax considerations, liquidity needs, and genuine informational advantages derived from public research. Illegal insider trading must be distinguished from this noisy background. As noted in a study by Deriu and colleagues (2022) on dimensionality reduction for insider trading detection, "Identification of market abuse is an extremely complicated activity that requires the analysis of large and complex datasets”.
Characteristic signatures of potential insider trading include:
Beyond traditional insider trading, the high-frequency trading era has given rise to new forms of algorithmic manipulation. Spoofing involves placing orders with the intent to cancel them before execution, creating a false impression of supply or demand to manipulate prices. Layering is a more sophisticated variant where multiple spoof orders are placed at different price levels to create an illusion of market depth.
These activities leave detectable patterns in order book data: orders that are consistently canceled within milliseconds of execution, patterns of order placement and cancellation that correlate with price movements in the opposite direction, and anomalous order-to-trade ratios. The detection of such patterns requires analysis at microsecond granularity, far beyond human capability.
Regulatory bodies have increasingly embraced AI and machine learning for market surveillance. Rathi stated that the FCA itself is "using AI methods to monitor portfolios and identify risk behaviors”. Gensler similarly indicated that the SEC "could benefit from staff making greater use of AI in their market surveillance”. This regulatory endorsement reflects a broader recognition that traditional rule-based surveillance systems—which trigger alerts based on fixed thresholds (e.g., volume exceeds 200% of 30-day average)—are inadequate for detecting sophisticated abuse. As one industry analysis notes, these legacy systems have historically had difficulty "keeping pace with expanding trading volumes”.
The successful deployment of algorithmic anomaly detection for market surveillance requires a clear understanding of the distinct contributions of Data Scientists and Data Analysts. This section delineates these roles based on job architecture analyses, academic literature, and industry implementations.
The Data Scientist in financial market surveillance is responsible for the technical infrastructure that transforms raw trading data into actionable anomaly signals. According to a job posting for a Principal Data Scientist at BNY Mellon's Financial Crimes Risk group—described as a "founding member" opportunity in a newly formed data science and decision support team—the Data Scientist is expected to "build an Augmented Intelligence platform for risk probabilities across a series of key indicating metrics" using "classification and anomaly detection techniques to perform statistical, econometric, and drill-down analysis underpinned by machine-learning models" .
The core technical responsibilities of the Data Scientist in this domain include:
Crucially, the Data Scientist is not expected to be a compliance expert or forensic investigator. Their expertise is computational and statistical. However, as the BNY Mellon posting indicates, "deep understanding of financial engineering and mathematical finance concepts" is highly desirable, as is the ability to "present research findings to a wide range of audiences”.
If the Data Scientist builds the detection engine, the Data Analyst ensures it is targeting the right threats and that its outputs are actionable. The Analyst's role is fundamentally interpretive and investigative.
Drawing on general frameworks for data analytics roles, the financial Data Analyst's responsibilities include:
Scientist's models, distinguishing between true positives (suspicious activity warranting investigation) and false positives (legitimate trading that merely appears anomalous). This requires domain expertise: understanding what constitutes normal trading behavior for specific securities, market conditions, and investor types.
|
Feature |
Financial Data Scientist |
Financial Data Analyst |
|
Primary Output |
Anomaly detection models, inference pipelines, risk scores |
Investigated cases, dashboards, regulatory referrals |
|
Core Tools |
Python (scikit-learn, PyTorch), Spark, time series databases |
SQL, Tableau/PowerBI, surveillance platforms |
|
Statistical Focus |
Model evaluation (precision, recall, F1), hyperparameter tuning |
Descriptive statistics, hypothesis testing, pattern recognition |
|
Domain Knowledge |
Machine learning, time series analysis, financial engineering |
Market microstructure, trading regulations, investigative procedures |
|
Typical Question |
"How can we improve recall for spoofing detection?" |
"Is this flagged pattern legitimate trading or manipulation?" |
|
Validation Role |
Cross-validation, backtesting, performance monitoring |
Manual review of anomalies, false positive identification |
Time series analysis forms the foundation of algorithmic market surveillance. Trading data is inherently temporal: each transaction and order carries a timestamp, and patterns of interest unfold over time. As noted in a comprehensive review by Blázquez-García and colleagues (2021), time series anomaly detection encompasses three primary categories: point anomalies (individual data points deviating from expected patterns), contextual anomalies (points anomalous in specific contexts but not globally), and collective anomalies (sequences of points that together form anomalous patterns).
For insider trading detection, contextual anomalies are particularly relevant. A large trade that would be normal during high-volume periods may be anomalous during quiet trading hours.
Similarly, a pattern of order cancellations that would be unremarkable for a high-frequency market maker may be highly suspicious for a retail investor.
Chen and colleagues (2024) proposed a novel approach for insider trading detection in industrial chains using "logistics time interval characteristics" and a dynamic sliding window method. Their approach recognizes that the temporal span of suspicious trading can vary significantly depending on the nature of the non-public information and the trading strategy employed. By using an adaptive window that expands or contracts based on detected patterns, their method achieved a 20.68% improvement in F1 score compared to standard isolation forest implementations.
Isolation Forest, introduced by Liu, Ting, and Zhou (2008), has emerged as a leading algorithm for anomaly detection in financial surveillance. Unlike traditional methods that attempt to model "normal" behavior and identify deviations, isolation forest explicitly isolates anomalies by exploiting two key properties: anomalies are few and they are different.
The algorithm works by recursively partitioning the feature space using random splits.
Anomalies are "easier to isolate" than normal points—they require fewer splits to be separated into their own region. The path length from root to leaf serves as an anomaly score: shorter paths indicate higher anomaly likelihood.
The suitability of isolation forest for financial fraud detection stems from several characteristics:
A study on fraud detection in carbon emission allowance markets compared multiple unsupervised methods including Isolation Forest, One-Class SVM, Autoencoder, DBSCAN, LOF, K-Means, Elliptic Envelope, and PCA. The researchers found that "models such as Isolation Forest and Elliptic Envelope significantly improve trading outcomes, with notable increases in net profit and win rate while reducing drawdown”. This finding demonstrates that anomaly detection can be integrated directly into trading strategies to avoid manipulated markets.
Several recent studies have advanced the application of unsupervised learning to insider trading detection. Deriu and colleagues (2022, 2024) proposed two complementary unsupervised methods for market surveillance:
The first method uses "clustering to identify, in the vicinity of a price sensitive event such as a takeover bid, discontinuities in the trading activity of an investor with respect to his/her own past trading history and on the present trading activity of his/her peers" . This approach captures both temporal anomalies (deviation from one's own baseline) and social anomalies (deviation from peer behavior).
The second method aims to identify "small groups of investors that act coherently around price sensitive events, pointing to potential insider rings, i.e. a group of synchronized traders displaying strong directional trading in rewarding position in a period before the price sensitive event”. This approach is particularly important because insider trading often involves coordinated activity among multiple individuals—family members, business associates, or professional networks.
The researchers applied their methods to "investor resolved data of Italian stocks around takeover bids" as a case study, demonstrating the practical applicability of unsupervised approaches to real-world surveillance challenges.
The collaborative process begins with the Data Scientist developing the anomaly detection model. This involves:
Once the model is deployed, the Analyst takes primary responsibility:
Both roles contribute to iterative model improvement:
False positives—legitimate trading flagged as suspicious—are the bane of automated
surveillance systems. A high false positive rate overwhelms Analysts, leading to alert fatigue and potentially causing genuine threats to be missed. The Optuna-optimized model described by
Priyadarshi and Kumar (2024) reduced false positives by 20%, but this still leaves substantial room for improvement.
False positives arise from multiple sources: legitimate trading strategies that mimic suspicious patterns (e.g., momentum trading before earnings), market-wide events that affect many securities simultaneously, and data quality issues that introduce spurious signals. The Analyst's role in distinguishing these from genuine threats is irreplaceable.
Machine learning models, particularly deep neural networks, are often "black boxes"—they produce outputs without revealing how those outputs were derived. This poses a challenge for market surveillance, where investigators may need to explain, in legal proceedings, why a particular trade was flagged as suspicious.
The explainability challenge is noted in the regulatory context: Sio observed that one potential concern with AI surveillance is "explainability, for example in that brokers who are obligated to act in the best interest of their clients will have to prove AI models do that”. This suggests that interpretable models (e.g., decision trees, logistic regression) may be preferred in some contexts despite potentially lower performance.
Sophisticated insider traders adapt their behavior to evade detection. If a particular pattern becomes known to be monitored, traders may shift to alternative strategies. The detection models must therefore be continuously updated—a task requiring ongoing collaboration between Data Scientists (who develop new algorithms) and Analysts (who identify emerging evasion techniques).
Market surveillance operates within legal frameworks that constrain data collection and analysis. Privacy regulations may limit the use of personally identifiable information; evidentiary standards may require that flagged activities meet specific thresholds before referral. The Analyst must navigate these constraints while ensuring that legitimate investigations proceed.
The detection of insider trading and algorithmic manipulation in modern financial markets is fundamentally a data problem. The volume, velocity, and complexity of trading data exceed human analytical capacity, necessitating computational approaches. Yet the problem is not solvable by algorithms alone. The low base rate of illegal activity, the subtlety of suspicious signals, and the high cost of false positives require human judgment at critical junctures.
This paper has argued that effective algorithmic market surveillance requires a clear division of labor between Data Scientists and Data Analysts. The Data Scientist builds the technical infrastructure: engineering features, selecting and optimizing anomaly detection algorithms (particularly isolation forests and time series methods), and evaluating model performance. The Data Analyst validates outputs, conducts deep investigations of flagged anomalies, and translates computational findings into actionable intelligence for compliance and enforcement.
The academic literature reviewed—from the Optuna-optimized CNN achieving 87.5% accuracy on insider trading detection to the clustering-based identification of insider rings— demonstrates significant technical progress. The regulatory endorsements from the SEC and FCA suggest that AI-based surveillance is moving from experimental to operational. However, the persistent challenges of false positives, model explainability, and adversarial adaptation ensure that the human analyst will remain central to market integrity efforts for the foreseeable future.
The future trajectory will likely involve tighter integration of human and machine intelligence: Analysts training models through their annotations, models learning from Analyst feedback, and systems that adaptively allocate attention based on risk. The goal is not to replace human judgment but to augment it—allowing Analysts to focus their expertise on the most promising leads while machines handle the volume.