Monitoring Stack

The entire monitoring stack is started in the docker-compose file.

Uptime monitoring

Uptime monitoring is implemented to verify the public availability of both backend and frontend services:

  • API
  • Monitored endpoint: /status
  • Purpose: verifies that the API is reachable and responsive
  • Frontend
  • Monitored resource: application index page
  • Purpose: verifies that the frontend is correctly served and accessible

Both checks are executed every 10 minutes.
This interval has been intentionally chosen since the application is not currently in an active release or production phase.

In a production or release scenario, the monitoring frequency should be increased to ensure faster detection of service outages.

For each monitored service (API and frontend), an availability badge is included in the repository README, providing a high-level view of the current uptime status directly from the monitoring provider.

Grafana

Grafana provides comprehensive visualization dashboards for real-time monitoring of Prometheus metrics collected from the FastAPI backend. Two custom JSON dashboards follow observability best practices: ML API Performance (request rates, P95/P99 latency, error rates, prediction counts) and System Health (HTTP status codes, throughput, in-progress requests).

Dashboards are automatically provisioned via grafana/provisioning/dashboards/dashboards.yml, with Grafana running on port 4444 (credentials: admin/admin_password) and persistent storage via grafana_data volume. The service depends on Prometheus readiness, with health checks ensuring reliable startup in the docker-compose stack. The available dashboards are: - ML API Performance Dashboard: Tracks Request Rate (RPS), Latency (P95/P99), Error Rates, and Prediction counts. - System Health Dashboard: Monitors HTTP status codes, throughput, and in-progress requests.

Access dashboards at http://localhost:4444 after docker compose up -d; verify Prometheus target UP at http://localhost:9090/targets and add http://prometheus:9090 as datasource. Generate traffic via /status, /models, and /predict endpoints to populate metrics.

Prometheus

Prometheus metrics collection integrates directly into FastAPI via prometheus-fastapi-instrumentator (v7.1.0), tracking request/response sizes, latency, and total requests across endpoints. The /metrics endpoint, which exposes metrics for Prometheus scraping, is comprehensively tested via tests/test_monitoring.py. This test suite validates metric availability and confirms that counters (e.g., request totals) increment correctly after each API request. Prometheus (localhost:9090) is configured via prometheus.yml, enabling PromQL queries like fastapi_model_http_requests_total. Alertmanager (localhost:9093) handles alerting from alerts.yml, but alerts are accessible directly from Prometheus (localhost:9090) in the alerts section, without the need to visit Alertmanager separately.

Drift Detection

Data drift occurs when the statistical properties of incoming production data shift from the training data distribution, potentially degrading model performance in tasks like code comment classification. This is particularly challenging for text data, where raw string comparisons fail to capture semantic changes.

We introduced a drift detection pipeline in /monitoring/drift/drift_detection.py that leverages the fine-tuned CodeBERT Transformer model to embed text inputs into 768-dimensional semantic vectors. These embeddings enable robust statistical comparisons, bypassing direct text matching limitations.

Drift Detection Methodology

Validation Strategy Evolution

Initially, the drift detection pipeline compared the test set against the training set. However, test sets typically contain minor distribution shifts that lead to false positive drift detections. To address this limitation, the current approach validates detectors against subsets of the training data:

  • Reference: 1,000 samples from the training set
  • Test 1 (Fake Production): 150 different samples from the training set
  • Test 2 (Anomaly): Generated garbage/corrupted data

This approach ensures detectors can distinguish between natural train-test variance and actual anomalies.

Improved Garbage Data Generation

The anomaly dataset was significantly enhanced to reflect realistic production drift scenarios, including:

  • Variable renaming: Comments with obfuscated or non-descriptive variable names
  • Comment length inflation: Artificially lengthened comments with padding
  • Different coding style: Comments written in inconsistent or non-idiomatic styles
  • Cross-language comments: Comments mixing syntax or patterns from different programming languages

Classifier Architecture Upgrade

The drift detector was upgraded from a simple linear classifier to a PyTorch Multi-Layer Perceptron (MLP) to capture non-linear relationships in the 768-dimensional embedding space. This aligns with Alibi Detect's official recommendations for improved sensitivity to subtle distribution shifts.

Drift Detection Algorithms and results

Three Alibi Detect methods process the embeddings: KSDrift (Kolmogorov-Smirnov test), MMDDrift (Maximum Mean Discrepancy), and ClassifierDrift (which trains a distinguisher model).

Testing was performed on training subset data (no drift expected) versus generated garbage/anomaly inputs (drift expected) for each language:

Language Algorithm Test 1: Baseline
(Expectation: No Drift)
Test 2: Garbage Data
(Expectation: Drift)
Analysis
Python KSDrift No Drift (P-val: 0.45) Detected (P-val: 0.00) Perfectly calibrated.
MMDDrift No Drift (P-val: 0.24) Detected (P-val: 0.00) Correctly identifies distribution match.
Classifier No Drift (Score: 0.03) Detected (Score: 0.63) Clear separation between valid (0.03) and invalid (0.63) data.
Java KSDrift No Drift (P-val: 0.52) Detected (P-val: 0.00) Perfectly calibrated.
MMDDrift No Drift (P-val: 0.40) Detected (P-val: 0.00) Stable baseline.
Classifier No Drift (Score: 0.14) Detected (Score: 0.45) Score increases by ~3x on garbage data.
Pharo KSDrift No Drift (P-val: 0.57) Detected (P-val: 0.00) Perfectly calibrated.
MMDDrift No Drift (P-val: 0.61) Detected (P-val: 0.00) Stable baseline.
Classifier No Drift (Score: 0.05) Detected (Score: 0.82) Excellent separation (0.05 vs 0.82).

Running the drift detection scripts

With python monitoring/drift/drift_detection.py it is possible to start the drift detection script.

Unit tests on drift detection can be executed with pytest tests/test_drift_detection.py.

Locust

To verify the robustness of the API torwards several concurrent connection, the Locust library was used. This provides a high traffic simulation that tests the API stability under load.

The load testing environment is fully containerized and integrated into the existing docker-compose workflow. A dedicated tests/load_test/Dockerfile builds a minimal image with Locust and pandas dependencies. The docker-compose.yml adds a load_test service exposing the Locust UI on port 8089, with internal networking to http://api:8080 for isolated testing.

Core User Simulation

The core user simulation is designed to approximate how real clients interact with the API, both in terms of traffic mix and payload variety. It emphasizes realistic comment content, endpoint coverage, and differentiated load across routes.

Request mix and task weights

The CodeCommentUser class defines multiple tasks with different weights, which control how often each endpoint is hit relative to the others. Higher-weight tasks (e.g., prediction on /predict?model_type=transformer) run much more frequently, reflecting the fact that inference is the main workload of the service. Lower-weight tasks (such as /models and /privacy) still run regularly, ensuring that discovery and informational endpoints are exercised under load without dominating traffic.

Realistic comment sampling

Instead of sending synthetic or static strings, the user behavior loads real code comment data for Java, Python, and Pharo from CSV files and keeps them in memory. For each request, a language is chosen at random and a comment is sampled uniformly from that language’s dataset. This produces non-repeating, semantically rich inputs, which better stress the Transformer model’s embedding and classification path and avoids cache-friendly, overly optimistic performance characteristics.

Endpoints covered by the simulation

The simulated user interacts with several endpoints to mirror a typical client lifecycle:

  • Root (/): Basic connectivity and welcome message, useful as a quick sanity check under load.
  • Health (/status): Periodic health checks mimicking liveness/readiness probes, ensuring the monitoring surface stays responsive even during heavy inference traffic.
  • Models (/models): A model discovery call representing clients querying available model types or metadata before calling /predict.
  • Privacy (/privacy): A lightweight informational endpoint that users or UIs may hit to retrieve privacy policy information.
  • Predict (/predict?model_type=transformer): The main, high-cost operation that receives randomly sampled comments and triggers full Transformer inference, responsible for the majority of CPU/RAM usage during the test.