Silent Signals: The Role of Data Analysts and Data Scientists in Algorithmic Market Anomaly Detection for Fraud and Insider Trading Identification

Interagency Network Approach To Information Sharing In Combating Terrorism
November 1, 2023
The Digital Pulse: Leveraging Natural Language Processing and Social Media Analytics for Real-Time Societal Sentiment Tracking
August 6, 2024

Silent Signals: The Role of Data Analysts and Data Scientists in Algorithmic Market Anomaly Detection for Fraud and Insider Trading Identification

Aziz Ozmen, PhD
aziz.ozmen@gc4ss.org

          Senior Security Analyst
         Global Center for Security Studies

Silent Signals: The Role of Data Analysts and Data Scientists in Algorithmic Market Anomaly Detection for Fraud and Insider Trading Identification

Abstract

The digitization of global financial markets has produced an unprecedented deluge of high-frequency trading data, creating both opportunities and vulnerabilities. Among the most persistent threats to market integrity are insider trading and algorithmic fraud— activities that leave subtle but detectable traces in transactional data. This paper investigates the distinct yet complementary roles of Data Analysts and Data Scientists in the detection of anomalous market patterns, specifically focusing on unusual trading volumes and synchronized order placements. Moving beyond traditional rule-based surveillance systems, this study explores how time series analysis and isolation forest

algorithms enable the automated identification of suspicious activities that precede price-sensitive events. Through a synthesis of academic literature, industry job architectures, and empirical studies from 2019 to 2023, this paper delineates the division of labor: Data Scientists engineer and optimize unsupervised anomaly detection models, while Data Analysts validate model outputs, perform contextual investigations, and translate computational findings into actionable intelligence for compliance officers and regulators. The paper concludes by addressing persistent challenges—including false positive reduction, model explainability, and the evolving sophistication of fraudulent techniques—and proposes an integrated operational framework for financial market surveillance.

Keywords: Insider Trading Detection, Market Anomaly Detection, High-Frequency Trading, Isolation Forest, Time Series Analysis, Data Science, Financial Fraud Surveillance

1.   Introduction

The integrity of financial markets rests on a foundational principle: all investors should have equal access to material information. Insider trading—the illegal practice of trading securities based on non-public, material information—violates this principle, eroding public trust and distorting capital allocation. Similarly, algorithmic manipulation, including spoofing (placing orders with intent to cancel before execution) and layering (creating artificial demand),

undermines price discovery mechanisms. As markets have become increasingly electronic and high-frequency, the scale of transactional data has grown beyond the capacity of human surveillance alone.

According to Tony Sio, Head of Regulatory Strategy and Innovation at Nasdaq Anti-Financial Crime, "The problem around financial crime is finding patterns of abuse among huge data sets." Sio further notes that "AI, in particular the different styles and techniques that are coming across now, are incredibly suited to solve this problem”. This observation was echoed by both

U.S. Securities and Exchange Commission Chair Gary Gensler and Financial Conduct Authority Chief Executive Nikhil Rathi in their 2023 speeches, with Gensler calling AI "the most transformative technology of our time" and noting its growing application in compliance, trading algorithms, and market surveillance functions.

The central thesis of this paper is that effective algorithmic anomaly detection for insider trading and fraud requires a clear division of labor between Data Scientists, who build and optimize the computational models, and Data Analysts, who validate outputs, provide domain context, and translate findings into actionable intelligence. While machine learning models—particularly unsupervised approaches like isolation forests—can identify statistical anomalies at scale, they cannot distinguish between a legitimate trading strategy and illegal insider activity without human validation.

This paper is structured as follows. Section II reviews the landscape of insider trading and market manipulation, including the characteristic signals they produce in trading data. Section III delineates the distinct roles of Data Analysts and Data Scientists within the anomaly detection pipeline, drawing on real-world job descriptions and industry implementations. Section IV presents the technical framework, focusing on time series analysis and isolation forest algorithms. Section V discusses the collaborative workflow, including model development, output validation, and investigation. Section VI addresses persistent challenges—false positives, model explainability, and adversarial adaptation—followed by a conclusion on the future trajectory of human-AI collaboration in market surveillance.

2.   The Landscape of Insider Trading and Market Anomalies

  • Defining Insider Trading and Its Signatures

Insider trading occurs when individuals trade securities based on material, non-public information. While some forms of insider trading are legal (e.g., corporate executives trading according to pre-established 10b5-1 plans), illegal insider trading involves the exploitation of confidential information for personal gain or to avoid losses. As documented by Deng and colleagues in their study of the Chinese securities market, insider trading has existed since the birth of modern financial markets and remains a persistent enforcement challenge for regulators worldwide.

The challenge of detection lies in the signal-to-noise ratio. Legitimate trading occurs for myriad reasons: portfolio rebalancing, tax considerations, liquidity needs, and genuine informational advantages derived from public research. Illegal insider trading must be distinguished from this noisy background. As noted in a study by Deriu and colleagues (2022) on dimensionality reduction for insider trading detection, "Identification of market abuse is an extremely complicated activity that requires the analysis of large and complex datasets”.

Characteristic signatures of potential insider trading include:

  • Abnormal Trading Volume: A sudden, statistically significant increase in trading activity for a specific security, particularly in the days or weeks preceding a price-sensitive event (e.g., earnings announcement, merger bid, regulatory decision).
  • Synchronized Trading: Multiple accounts exhibiting highly correlated trading patterns—same security, same direction (buy or sell), similar timing—suggesting coordination based on shared non-public
  • Unusual Profitability: Trades that consistently generate abnormal returns relative to market benchmarks, particularly when the trader has no apparent informational
  • Timing Anomalies: Trading activity that occurs immediately before price-moving news, with the timing suggesting

2.2   Algorithmic Manipulation: Spoofing and Layering

Beyond traditional insider trading, the high-frequency trading era has given rise to new forms of algorithmic manipulation. Spoofing involves placing orders with the intent to cancel them before execution, creating a false impression of supply or demand to manipulate prices. Layering is a more sophisticated variant where multiple spoof orders are placed at different price levels to create an illusion of market depth.

These activities leave detectable patterns in order book data: orders that are consistently canceled within milliseconds of execution, patterns of order placement and cancellation that correlate with price movements in the opposite direction, and anomalous order-to-trade ratios. The detection of such patterns requires analysis at microsecond granularity, far beyond human capability.

2.3   The Regulatory Imperative

Regulatory bodies have increasingly embraced AI and machine learning for market surveillance. Rathi stated that the FCA itself is "using AI methods to monitor portfolios and identify risk behaviors”. Gensler similarly indicated that the SEC "could benefit from staff making greater use of AI in their market surveillance”. This regulatory endorsement reflects a broader recognition that traditional rule-based surveillance systems—which trigger alerts based on fixed thresholds (e.g., volume exceeds 200% of 30-day average)—are inadequate for detecting sophisticated abuse. As one industry analysis notes, these legacy systems have historically had difficulty "keeping pace with expanding trading volumes”.

3.   Role Delineation: Data Analyst vs. Data Scientist in Market Anomaly Detection

The successful deployment of algorithmic anomaly detection for market surveillance requires a clear understanding of the distinct contributions of Data Scientists and Data Analysts. This section delineates these roles based on job architecture analyses, academic literature, and industry implementations.

3.1   The Financial Data Scientist: Model Architect and Engineer

The Data Scientist in financial market surveillance is responsible for the technical infrastructure that transforms raw trading data into actionable anomaly signals. According to a job posting for a Principal Data Scientist at BNY Mellon's Financial Crimes Risk group—described as a "founding member" opportunity in a newly formed data science and decision support team—the Data Scientist is expected to "build an Augmented Intelligence platform for risk probabilities across a series of key indicating metrics" using "classification and anomaly detection techniques to perform statistical, econometric, and drill-down analysis underpinned by machine-learning models" .

The core technical responsibilities of the Data Scientist in this domain include:

  • Algorithm Selection and Implementation: Choosing appropriate anomaly detection methods for specific surveillance tasks. For insider trading detection, researchers have successfully employed unsupervised learning approaches including principal component analysis, autoencoders, and clustering techniques. For spoofing and layering detection, isolation forests have proven particularly effective due to their efficiency with high-dimensional data.
  • Feature Engineering: Transforming raw tick data into meaningful features for anomaly detection. This includes constructing variables such as order-to-trade ratios, cancellation rates, volume-weighted average price deviations, and temporal clustering metrics. The quality of feature engineering directly determines detection
  • Hyperparameter Optimization: Fine-tuning model parameters to balance sensitivity and As demonstrated in a study on insider trading detection in the Indian stock market using multi-channel convolutional neural networks, hyperparameter optimization using frameworks like Optuna can reduce false positive rates by 20% across all time windows. The researchers found that optimized deep learning models achieved accuracy rates of 87.50%, 75.00%, and 62.50% for 30-, 60-, and 90-day time windows respectively, significantly outperforming benchmark models like logistic regression and random forest.
  • Model Evaluation and Validation: Assessing model performance using appropriate In anomaly detection for fraud, where the positive class (illegal trades) is extremely rare, accuracy is a misleading metric. Data Scientists must focus on precision, recall, F1-score, and area under the precision-recall curve, with particular attention to false positive rates that could overwhelm compliance teams.
  • Production Deployment: Implementing models in real-time or near-real-time surveillance systems, with attention to latency requirements. As noted in the BNY Mellon job description, this requires "deep knowledge of time series database technologies" and "handling real-time transactional data feeds across asset classes and regions”.

Crucially, the Data Scientist is not expected to be a compliance expert or forensic investigator. Their expertise is computational and statistical. However, as the BNY Mellon posting indicates, "deep understanding of financial engineering and mathematical finance concepts" is highly desirable, as is the ability to "present research findings to a wide range of audiences”.

3.2   The Financial Data Analyst: Validator, Investigator, and Bridge

If the Data Scientist builds the detection engine, the Data Analyst ensures it is targeting the right threats and that its outputs are actionable. The Analyst's role is fundamentally interpretive and investigative.

Drawing on general frameworks for data analytics roles, the financial Data Analyst's responsibilities include:

  • Output Validation and Triage: The Analyst reviews the anomalies flagged by the Data

Scientist's models, distinguishing between true positives (suspicious activity warranting investigation) and false positives (legitimate trading that merely appears anomalous). This requires domain expertise: understanding what constitutes normal trading behavior for specific securities, market conditions, and investor types.

  • Contextual Investigation: For flagged anomalies, the Analyst performs deeper investigation, examining related accounts, historical trading patterns, and external context (e.g., upcoming corporate events, news announcements). This investigation may involve querying multiple data sources using SQL, pulling additional information from surveillance platforms, and documenting findings.
  • Pattern Recognition and Feedback: The Analyst identifies recurring false positive patterns and communicates these to the Data Scientist for model improvement. For example, if the model consistently flags end-of-quarter window dressing by mutual funds as suspicious, the Analyst documents this pattern so the Data Scientist can adjust feature weights or add exclusion rules.
  • Visualization and Reporting: The Analyst creates dashboards and reports that translate model outputs into formats usable by compliance officers, legal teams, and regulators. This includes time-series visualizations of suspicious activity, network graphs of coordinated trading, and summary statistics by security, trader, or time
  • Case Management: When the Analyst determines that flagged activity warrants formal investigation, they prepare case files that document the evidence, articulate the suspicious patterns, and provide the analytical basis for referral to as noted in the General Motors advanced analytics internship description, the Analyst role requires proficiency in "data visualization techniques" and the ability to "identify practical insights, make recommendations, and communicate results to business and technical audiences”.
  • Comparative Summary

 

Feature

Financial Data Scientist

Financial Data Analyst

 

Primary Output

Anomaly detection models, inference pipelines, risk scores

Investigated cases, dashboards, regulatory referrals

 

Core Tools

Python (scikit-learn, PyTorch), Spark, time series databases

 

SQL, Tableau/PowerBI, surveillance platforms

 

Statistical Focus

Model evaluation (precision, recall, F1), hyperparameter tuning

 

Descriptive statistics, hypothesis testing, pattern recognition

Domain Knowledge

Machine learning, time series analysis, financial engineering

Market microstructure, trading regulations, investigative procedures

Typical Question

"How can we improve recall for spoofing detection?"

"Is this flagged pattern legitimate trading or manipulation?"

 

Validation Role

Cross-validation, backtesting, performance monitoring

Manual review of anomalies, false positive identification

  1. Technical Framework: Time Series Analysis and Isolation Forests
  • Time Series Analysis for Anomaly Detection

Time series analysis forms the foundation of algorithmic market surveillance. Trading data is inherently temporal: each transaction and order carries a timestamp, and patterns of interest unfold over time. As noted in a comprehensive review by Blázquez-García and colleagues (2021), time series anomaly detection encompasses three primary categories: point anomalies (individual data points deviating from expected patterns), contextual anomalies (points anomalous in specific contexts but not globally), and collective anomalies (sequences of points that together form anomalous patterns).

For insider trading detection, contextual anomalies are particularly relevant. A large trade that would be normal during high-volume periods may be anomalous during quiet trading hours.

Similarly, a pattern of order cancellations that would be unremarkable for a high-frequency market maker may be highly suspicious for a retail investor.

Chen and colleagues (2024) proposed a novel approach for insider trading detection in industrial chains using "logistics time interval characteristics" and a dynamic sliding window method. Their approach recognizes that the temporal span of suspicious trading can vary significantly depending on the nature of the non-public information and the trading strategy employed. By using an adaptive window that expands or contracts based on detected patterns, their method achieved a 20.68% improvement in F1 score compared to standard isolation forest implementations.

4.2   Isolation Forest for Anomaly Detection

Isolation Forest, introduced by Liu, Ting, and Zhou (2008), has emerged as a leading algorithm for anomaly detection in financial surveillance. Unlike traditional methods that attempt to model "normal" behavior and identify deviations, isolation forest explicitly isolates anomalies by exploiting two key properties: anomalies are few and they are different.

The algorithm works by recursively partitioning the feature space using random splits.

Anomalies are "easier to isolate" than normal points—they require fewer splits to be separated into their own region. The path length from root to leaf serves as an anomaly score: shorter paths indicate higher anomaly likelihood.

The suitability of isolation forest for financial fraud detection stems from several characteristics:

  • Efficiency: Isolation forest has linear time complexity, making it scalable to the massive datasets generated by high-frequency
  • Handling of High Dimensionality: Financial data may include dozens or hundreds of features (price, volume, order book depth, cancellation rates, ). Isolation forest performs well even when many features are irrelevant.
  • No Assumption of Distribution: Unlike parametric methods that assume normal distributions, isolation forest makes no distributional assumptions—important because trading data is typically heavy-tailed and non-normal.

A study on fraud detection in carbon emission allowance markets compared multiple unsupervised methods including Isolation Forest, One-Class SVM, Autoencoder, DBSCAN, LOF, K-Means, Elliptic Envelope, and PCA. The researchers found that "models such as Isolation Forest and Elliptic Envelope significantly improve trading outcomes, with notable increases in net profit and win rate while reducing drawdown”. This finding demonstrates that anomaly detection can be integrated directly into trading strategies to avoid manipulated markets.

4.3   Hybrid Approaches: Unsupervised Learning for Insider Trading

Several recent studies have advanced the application of unsupervised learning to insider trading detection. Deriu and colleagues (2022, 2024) proposed two complementary unsupervised methods for market surveillance:

The first method uses "clustering to identify, in the vicinity of a price sensitive event such as a takeover bid, discontinuities in the trading activity of an investor with respect to his/her own past trading history and on the present trading activity of his/her peers" . This approach captures both temporal anomalies (deviation from one's own baseline) and social anomalies (deviation from peer behavior).

The second method aims to identify "small groups of investors that act coherently around price sensitive events, pointing to potential insider rings, i.e. a group of synchronized traders displaying strong directional trading in rewarding position in a period before the price sensitive event”. This approach is particularly important because insider trading often involves coordinated activity among multiple individuals—family members, business associates, or professional networks.

The researchers applied their methods to "investor resolved data of Italian stocks around takeover bids" as a case study, demonstrating the practical applicability of unsupervised approaches to real-world surveillance challenges.

5.   The Collaborative Workflow: From Data to Investigation

  • Model Development Phase

The collaborative process begins with the Data Scientist developing the anomaly detection model. This involves:

  1. Data Acquisition and Preparation: The Data Scientist ingests trading data from exchange feeds, order books, and trade repositories. This data must be cleaned, normalized, and aligned temporally.
  1. Feature Engineering: Working with input from Analysts regarding known suspicious patterns, the Data Scientist constructs features that capture potential signals of abuse: volume z-scores, cancellation rates, order-to-trade ratios, temporal clustering metrics, and correlation measures across accounts.
  2. Model Selection and Training: The Data Scientist selects appropriate algorithms (isolation forest, autoencoders, clustering) and trains them on historical data, using periods with known enforcement actions as validation.
  1. Threshold Calibration: The Data Scientist sets anomaly score thresholds that determine which activities are flagged for This involves trade-offs between sensitivity (catching more true positives) and specificity (avoiding false positives).

5.2   Operational Deployment Phase

Once the model is deployed, the Analyst takes primary responsibility:

  1. Alert Triage: The Analyst reviews flagged anomalies, examining the underlying trading data, the model's confidence scores, and any available context about the security or
  2. Initial Validation: The Analyst determines whether the flagged pattern warrants deeper investigation. Many anomalies will be resolved as false positives—legitimate trading that appears unusual due to known events (e.g., index rebalancing, option expiration).
  3. Deep Investigation: For suspicious patterns, the Analyst conducts deeper investigation: querying historical data for similar patterns, examining related accounts, checking for connections to price-sensitive events, and documenting the trading
  4. Case Referral: When the Analyst determines that evidence supports a reasonable suspicion of illegal activity, they prepare a case file for referral to compliance, legal, or regulatory authorities.

5.3   Continuous Improvement

Both roles contribute to iterative model improvement:

  • The Analyst documents false positives and patterns the model misses, providing feedback to the Data
  • The Data Scientist uses this feedback to adjust features, retrain models, or develop new
  • The team periodically reviews model performance against known enforcement actions and emerging manipulation

6.   Challenges and Limitations

  • The False Positive Problem

False positives—legitimate trading flagged as suspicious—are the bane of automated

surveillance systems. A high false positive rate overwhelms Analysts, leading to alert fatigue and potentially causing genuine threats to be missed. The Optuna-optimized model described by

Priyadarshi and Kumar (2024) reduced false positives by 20%, but this still leaves substantial room for improvement.

False positives arise from multiple sources: legitimate trading strategies that mimic suspicious patterns (e.g., momentum trading before earnings), market-wide events that affect many securities simultaneously, and data quality issues that introduce spurious signals. The Analyst's role in distinguishing these from genuine threats is irreplaceable.

6.2   Model Explainability

Machine learning models, particularly deep neural networks, are often "black boxes"—they produce outputs without revealing how those outputs were derived. This poses a challenge for market surveillance, where investigators may need to explain, in legal proceedings, why a particular trade was flagged as suspicious.

The explainability challenge is noted in the regulatory context: Sio observed that one potential concern with AI surveillance is "explainability, for example in that brokers who are obligated to act in the best interest of their clients will have to prove AI models do that”. This suggests that interpretable models (e.g., decision trees, logistic regression) may be preferred in some contexts despite potentially lower performance.

6.3   Adversarial Adaptation

Sophisticated insider traders adapt their behavior to evade detection. If a particular pattern becomes known to be monitored, traders may shift to alternative strategies. The detection models must therefore be continuously updated—a task requiring ongoing collaboration between Data Scientists (who develop new algorithms) and Analysts (who identify emerging evasion techniques).

6.4   Data Privacy and Legal Constraints

Market surveillance operates within legal frameworks that constrain data collection and analysis. Privacy regulations may limit the use of personally identifiable information; evidentiary standards may require that flagged activities meet specific thresholds before referral. The Analyst must navigate these constraints while ensuring that legitimate investigations proceed.

7.   Conclusion

The detection of insider trading and algorithmic manipulation in modern financial markets is fundamentally a data problem. The volume, velocity, and complexity of trading data exceed human analytical capacity, necessitating computational approaches. Yet the problem is not solvable by algorithms alone. The low base rate of illegal activity, the subtlety of suspicious signals, and the high cost of false positives require human judgment at critical junctures.

This paper has argued that effective algorithmic market surveillance requires a clear division of labor between Data Scientists and Data Analysts. The Data Scientist builds the technical infrastructure: engineering features, selecting and optimizing anomaly detection algorithms (particularly isolation forests and time series methods), and evaluating model performance. The Data Analyst validates outputs, conducts deep investigations of flagged anomalies, and translates computational findings into actionable intelligence for compliance and enforcement.

The academic literature reviewed—from the Optuna-optimized CNN achieving 87.5% accuracy on insider trading detection to the clustering-based identification of insider rings— demonstrates significant technical progress. The regulatory endorsements from the SEC and FCA suggest that AI-based surveillance is moving from experimental to operational. However, the persistent challenges of false positives, model explainability, and adversarial adaptation ensure that the human analyst will remain central to market integrity efforts for the foreseeable future.

The future trajectory will likely involve tighter integration of human and machine intelligence: Analysts training models through their annotations, models learning from Analyst feedback, and systems that adaptively allocate attention based on risk. The goal is not to replace human judgment but to augment it—allowing Analysts to focus their expertise on the most promising leads while machines handle the volume.

8.   References

  • Priyadarshi, P., & Kumar, (2024). Detecting insider trading in the Indian stock market: An optimized deep learning approach. Computational Economics, 65, 3923-3943. https://doi.org/10.1007/s10614-024-10697-z
  • Chen, , Di, K., Tao, H., Jiang, Y., & Li, P. (2024). Insider trading detection algorithm in industrial chain based on logistics time interval characteristics. In J. S. Park, H. Takizawa, H. Shen, & J. J. Park (Eds.), Parallel and distributed computing, applications and technologies (Lecture Notes in Electrical Engineering, Vol. 1112). Springer. https://doi.org/10.1007/978-981-99-8211-0_12
  • Mazzarisi, , Ravagnani, A., Deriu, P., Lillo, F., Medda, F., & Russo, A. (2022). A machine learning approach to support decision in insider trading detection. arXiv. https://arxiv.org/abs/2212.05912

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.