Data Drift Detection

Overview

Automated statistical drift monitoring in production datasets against training baselines for code comment classifiers (Java, Python, Pharo). Uses Deepchecks with scipy fallback to identify significant shifts in text properties and label distributions.

Architecture

DriftDetector (turing/monitoring/drift_detector.py) — statistical test orchestration and result aggregation

BaselineManager (turing/monitoring/baseline_manager.py) — reference distributions (length, word count, labels)

SyntheticDataGenerator (turing/monitoring/synthetic_data_generator.py) — controlled drift scenarios for validation

Reports — TXT and JSON output in reports/monitoring/

Detection Methods

Deepchecks (primary)

TextPropertyDrift: distributional property comparison (length, word frequency, lexical diversity)
Label Distribution: chi-squared test on class frequencies

SciPy (fallback)

Kolmogorov–Smirnov: text length and word count distribution
Chi-square: label distribution

Identical output structure between engines for downstream compatibility.

Statistical Thresholds

Threshold	Value	Action
No drift	p > 0.05	Continue monitoring
Warning	0.01 < p ≤ 0.05	Increase sampling frequency
Alert	p ≤ 0.01	Trigger retraining

API

from turing.monitoring.drift_detector import DriftDetector

detector = DriftDetector()
results = detector.detect_all_drifts(
    production_texts=production_batch,
    production_labels=production_labels,
    reference_texts=training_texts,
    reference_labels=training_labels,
)

if results["overall"]["alert"]:
    initiate_retraining_pipeline()

JSON Output

{
  "text_property": {"drifted": false, "alert": false, "p_value": 0.087},
  "label_distribution": {"drifted": true, "alert": true, "p_value": 0.003},
  "overall": {"drifted": true, "alert": true, "num_drifts": 1}
}

Validation

python -m turing.CLI_runner.verify_drift_detection --language java --n-samples 100

Scenarios: normal, short_text, long_text, corrupted_vocab, class_imbalance

Dynamic Model Management

The system operates in a continuous optimization cycle—models are never static:

Alert → Retraining: critical drift (p ≤ 0.01) triggers automatic retraining
Multi-Model Evaluation: all architectures (LinearSVM, GRU-RNN, SentenceBERT, CodeBERTa, DeBERTaV3, XGBoost) evaluated on new dataset
Best Model Deployment: classifier with highest F1/AUC becomes baseline
Baseline Recomputation: thresholds and distributions recalculated

Rationale: - Models optimized for short comments (Java) fail on long docstrings (Python) - Vocabulary shift invalidates pre-computed embeddings - Class imbalance changes favor different architectures

Operational Procedures

Deployment: validate 5/5 synthetic scenarios, verify baseline, test notification channels

Alert Response (p ≤ 0.01): notify ops → root cause analysis → retraining → log incident

Warning Response (0.01 < p ≤ 0.05): increase sampling 2x → segment analysis within 24h

GitHub Actions Pipeline

Triggers

Schedule: 1st of month, 00:00 UTC — monthly reports for annual trend tracking and seasonal patterns
Manual: on-demand via GitHub UI
Push: workflow changes

Jobs

Execution: setup → {pull-data, generate-datasets} → drift-detection (matrix 3 languages) → aggregate-notify

Setup: output language matrix [java, python, pharo]
Pull Data: DVC pull from DagsHub S3, cached by commit SHA
Generate Datasets: parquet → CSV → TF-IDF (5000 features), cache results
Drift Detection: parallel execution 3 languages, 5 test scenarios, generates JSON/TXT reports
Aggregate: Step Summary, PR comments, CI status (fail on alert)

Dependencies

Package	Version	Reason
numpy	1.26.4	Deepchecks incompatible with numpy≥2.0
fsspec[http]	≤2025.10.0	datasets library constraint
deepchecks[nlp]	0.19.1	text analysis

Access Control

Environment drift-detection-env requires approval before exposing secrets (MLflow, DagsHub, S3).

Troubleshooting

Deepchecks import error: install deepchecks[nlp]==0.19.1, not separate packages

False positives: verify baseline statistics, optionally raise thresholds (p=0.10, alert=0.02)

Fallback mode: normal if Deepchecks unavailable, identical output