Data Drift Detection
Overview
Automated statistical drift monitoring in production datasets against training baselines for code comment classifiers (Java, Python, Pharo). Uses Deepchecks with scipy fallback to identify significant shifts in text properties and label distributions.
Architecture
DriftDetector (turing/monitoring/drift_detector.py) — statistical test orchestration and result aggregation
BaselineManager (turing/monitoring/baseline_manager.py) — reference distributions (length, word count, labels)
SyntheticDataGenerator (turing/monitoring/synthetic_data_generator.py) — controlled drift scenarios for validation
Reports — TXT and JSON output in reports/monitoring/
Detection Methods
Deepchecks (primary)
- TextPropertyDrift: distributional property comparison (length, word frequency, lexical diversity)
- Label Distribution: chi-squared test on class frequencies
SciPy (fallback)
- Kolmogorov–Smirnov: text length and word count distribution
- Chi-square: label distribution
Identical output structure between engines for downstream compatibility.
Statistical Thresholds
| Threshold | Value | Action |
|---|---|---|
| No drift | p > 0.05 | Continue monitoring |
| Warning | 0.01 < p ≤ 0.05 | Increase sampling frequency |
| Alert | p ≤ 0.01 | Trigger retraining |
API
from turing.monitoring.drift_detector import DriftDetector
detector = DriftDetector()
results = detector.detect_all_drifts(
production_texts=production_batch,
production_labels=production_labels,
reference_texts=training_texts,
reference_labels=training_labels,
)
if results["overall"]["alert"]:
initiate_retraining_pipeline()
JSON Output
{
"text_property": {"drifted": false, "alert": false, "p_value": 0.087},
"label_distribution": {"drifted": true, "alert": true, "p_value": 0.003},
"overall": {"drifted": true, "alert": true, "num_drifts": 1}
}
Validation
Scenarios: normal, short_text, long_text, corrupted_vocab, class_imbalance
Dynamic Model Management
The system operates in a continuous optimization cycle—models are never static:
- Alert → Retraining: critical drift (p ≤ 0.01) triggers automatic retraining
- Multi-Model Evaluation: all architectures (LinearSVM, GRU-RNN, SentenceBERT, CodeBERTa, DeBERTaV3, XGBoost) evaluated on new dataset
- Best Model Deployment: classifier with highest F1/AUC becomes baseline
- Baseline Recomputation: thresholds and distributions recalculated
Rationale: - Models optimized for short comments (Java) fail on long docstrings (Python) - Vocabulary shift invalidates pre-computed embeddings - Class imbalance changes favor different architectures
Operational Procedures
Deployment: validate 5/5 synthetic scenarios, verify baseline, test notification channels
Alert Response (p ≤ 0.01): notify ops → root cause analysis → retraining → log incident
Warning Response (0.01 < p ≤ 0.05): increase sampling 2x → segment analysis within 24h
GitHub Actions Pipeline
Triggers
- Schedule: 1st of month, 00:00 UTC — monthly reports for annual trend tracking and seasonal patterns
- Manual: on-demand via GitHub UI
- Push: workflow changes
Jobs
Execution: setup → {pull-data, generate-datasets} → drift-detection (matrix 3 languages) → aggregate-notify
- Setup: output language matrix [java, python, pharo]
- Pull Data: DVC pull from DagsHub S3, cached by commit SHA
- Generate Datasets: parquet → CSV → TF-IDF (5000 features), cache results
- Drift Detection: parallel execution 3 languages, 5 test scenarios, generates JSON/TXT reports
- Aggregate: Step Summary, PR comments, CI status (fail on alert)
Dependencies
| Package | Version | Reason |
|---|---|---|
| numpy | 1.26.4 | Deepchecks incompatible with numpy≥2.0 |
| fsspec[http] | ≤2025.10.0 | datasets library constraint |
| deepchecks[nlp] | 0.19.1 | text analysis |
Access Control
Environment drift-detection-env requires approval before exposing secrets (MLflow, DagsHub, S3).
Troubleshooting
Deepchecks import error: install deepchecks[nlp]==0.19.1, not separate packages
False positives: verify baseline statistics, optionally raise thresholds (p=0.10, alert=0.02)
Fallback mode: normal if Deepchecks unavailable, identical output
