Skip to content

Data Drift Detection

Overview

Automated statistical drift monitoring in production datasets against training baselines for code comment classifiers (Java, Python, Pharo). Uses Deepchecks with scipy fallback to identify significant shifts in text properties and label distributions.

Architecture

DriftDetector (turing/monitoring/drift_detector.py) — statistical test orchestration and result aggregation

BaselineManager (turing/monitoring/baseline_manager.py) — reference distributions (length, word count, labels)

SyntheticDataGenerator (turing/monitoring/synthetic_data_generator.py) — controlled drift scenarios for validation

Reports — TXT and JSON output in reports/monitoring/

Detection Methods

Deepchecks (primary)

  • TextPropertyDrift: distributional property comparison (length, word frequency, lexical diversity)
  • Label Distribution: chi-squared test on class frequencies

SciPy (fallback)

  • Kolmogorov–Smirnov: text length and word count distribution
  • Chi-square: label distribution

Identical output structure between engines for downstream compatibility.

Statistical Thresholds

Threshold Value Action
No drift p > 0.05 Continue monitoring
Warning 0.01 < p ≤ 0.05 Increase sampling frequency
Alert p ≤ 0.01 Trigger retraining

API

from turing.monitoring.drift_detector import DriftDetector

detector = DriftDetector()
results = detector.detect_all_drifts(
    production_texts=production_batch,
    production_labels=production_labels,
    reference_texts=training_texts,
    reference_labels=training_labels,
)

if results["overall"]["alert"]:
    initiate_retraining_pipeline()

JSON Output

{
  "text_property": {"drifted": false, "alert": false, "p_value": 0.087},
  "label_distribution": {"drifted": true, "alert": true, "p_value": 0.003},
  "overall": {"drifted": true, "alert": true, "num_drifts": 1}
}

Validation

python -m turing.CLI_runner.verify_drift_detection --language java --n-samples 100

Scenarios: normal, short_text, long_text, corrupted_vocab, class_imbalance

Dynamic Model Management

The system operates in a continuous optimization cycle—models are never static:

  1. Alert → Retraining: critical drift (p ≤ 0.01) triggers automatic retraining
  2. Multi-Model Evaluation: all architectures (LinearSVM, GRU-RNN, SentenceBERT, CodeBERTa, DeBERTaV3, XGBoost) evaluated on new dataset
  3. Best Model Deployment: classifier with highest F1/AUC becomes baseline
  4. Baseline Recomputation: thresholds and distributions recalculated

Rationale: - Models optimized for short comments (Java) fail on long docstrings (Python) - Vocabulary shift invalidates pre-computed embeddings - Class imbalance changes favor different architectures

Operational Procedures

Deployment: validate 5/5 synthetic scenarios, verify baseline, test notification channels

Alert Response (p ≤ 0.01): notify ops → root cause analysis → retraining → log incident

Warning Response (0.01 < p ≤ 0.05): increase sampling 2x → segment analysis within 24h

GitHub Actions Pipeline

GitHub Actions Drift Detection Pipeline

Triggers

  • Schedule: 1st of month, 00:00 UTC — monthly reports for annual trend tracking and seasonal patterns
  • Manual: on-demand via GitHub UI
  • Push: workflow changes

Jobs

Execution: setup → {pull-data, generate-datasets} → drift-detection (matrix 3 languages) → aggregate-notify

  1. Setup: output language matrix [java, python, pharo]
  2. Pull Data: DVC pull from DagsHub S3, cached by commit SHA
  3. Generate Datasets: parquet → CSV → TF-IDF (5000 features), cache results
  4. Drift Detection: parallel execution 3 languages, 5 test scenarios, generates JSON/TXT reports
  5. Aggregate: Step Summary, PR comments, CI status (fail on alert)

Dependencies

Package Version Reason
numpy 1.26.4 Deepchecks incompatible with numpy≥2.0
fsspec[http] ≤2025.10.0 datasets library constraint
deepchecks[nlp] 0.19.1 text analysis

Access Control

Environment drift-detection-env requires approval before exposing secrets (MLflow, DagsHub, S3).

Troubleshooting

Deepchecks import error: install deepchecks[nlp]==0.19.1, not separate packages

False positives: verify baseline statistics, optionally raise thresholds (p=0.10, alert=0.02)

Fallback mode: normal if Deepchecks unavailable, identical output