Tests

To run the full test suite, or specific paìrts, you can use one of the following commands: - make test to run data tests followed by all pytest-based tests under tests/ - make data_tests to run only the Great Expectations and Deepchecks data checks - make api_tests to run just the FastAPI endpoint tests

helpers.py

This module centralizes shared configuration for the tests, including the list of supported languages, the number of labels per language, and the available model types (SetFit, Random Forest, Transformer). It also provides utility functions to resolve label indices, lazily load train/test CSV splits from data/raw, control sample sizes via an environment variable, and locate or check the presence of trained models on disk; many other test modules import these helpers to remain concise and consistent.

test_api.py

This file contains unit tests for the FastAPI service using TestClient, covering both “happy path” predictions and validation failures. It exercises the root, status, and models endpoints, checks that real models can be invoked for Java, Python, and Pharo, and verifies that the Pydantic request schema rejects malformed or incomplete payloads (e.g., missing language, unsupported language, missing text or model type).

test_data_quality.py

This script implements dataset-level quality checks using Great Expectations, treating each raw and transformer-processed CSV (for every language and split) as a separate data asset. It programmatically builds expectation suites that check for the presence and non-nullness of key columns, reasonable comment lengths, valid partition flags, and well-formed multi‑hot label vectors, runs checkpoints for all configured datasets, and optionally generates HTML Data Docs reports under reports/data_tests/great_expectations; the process exits with a non-zero status if any dataset fails validation.

test_data_validation.py

This module holds a narrative HTML/Markdown template that documents the results of running deepchecks.nlp data checks on the NLBSE dataset for each language. It explains, in prose, the purpose of the Deepchecks suite (duplicates, conflicting labels, special characters, label drift, and train–test sample mix), summarises the overall suite status, and describes how to interpret per‑label distribution tables and each individual check when no failures are raised.

test_preprocessing.py

These tests focus on the preprocessing utilities used by the transformer models, ensuring that label strings are parsed correctly, the combo column is created or preserved as expected, and the supersampling procedure behaves sensibly. Integration-style tests also exercise the load_or_prepare_data function end to end, writing tiny raw CSVs to a temporary directory, running preprocessing (with and without supersampling), and asserting that processed artifacts, derived columns (like labels_array), and CSV outputs are created consistently.

test_sync_models.py

This module validates the MLflow-based model synchronization logic that pulls “champion” models from the registry to the local filesystem. It replaces the real MLflow client and artifact download function with lightweight stubs, runs sync_best_models_to_disk into a temporary directory, and checks that only the expected language/model combination is materialized, that transformer models are correctly flattened from an inner model/ subdirectory into the final api/<lang>/transformer folder, and that wrapper directories are removed.

test_minimum_functionality.py

These tests implement minimum functionality checks for the different model families by combining golden examples and structural assertions. The golden-example tests (currently marked as expected-to-fail while models are still improving) compare predicted multi-label vectors against hand-crafted target labels for representative comments, while additional tests verify that all models return integer NumPy arrays with the correct 2D shape and number of rows for several real comments from each language’s test split.

test_invariance.py

This module examines various invariance properties of the predictors, such as robustness to whitespace changes, duplicate inputs, benign text transformations, punctuation edits, and simple typos. Some tests are marked with xfail to document known sensitivities (e.g., tokenizer behavior and lack of augmentation), while others, like the duplicate-input check, are expected to pass and help ensure deterministic, consistent predictions for identical inputs across all languages and model types.

test_directional.py

These tests perform directional behavioral checks: they assess whether predictions respond monotonically when extra evidence for a label is added to the input. For each language and label, the module locates a real example where the label is active, constructs a “strengthened” version of the text by prefixing a semantic trigger phrase (such as “Deprecated:” or “Example:”), and asserts that the model’s confidence for that label does not decrease, thereby checking that the classifier’s outputs move in a sensible direction when the comment becomes more explicit.

test_data_quality and test_data_validation interplay

Together, test_data_quality.py and test_data_validation.py provide complementary dataset checks: the former enforces structural expectations and generates GX reports, while the latter offers Deepchecks-based semantic diagnostics and human-readable explanations of the checks and their outcomes. Running make data_tests executes these components and ensures that both raw and processed datasets are structurally sound and free from obvious issues like duplicates, label conflicts, or severe distribution drift.

test_monitoring.py

This module tests Prometheus metrics exposure and prediction counter functionality using TestClient and prometheus_client.parser. It verifies that the /metrics endpoint returns valid Prometheus text format (with # HELP and # TYPE headers) and checks that the prediction_count_total metric increments correctly after successful predictions.

The parametrized test simulates predictor calls for transformer models across supported languages, monkeypatching get_predictor to use a DummyPredictor stub, then asserts the metric appears in /metrics output with correct labels (language and model_type). The _find_sample helper parses metric families to locate samples matching specific label combinations for detailed validation.

These tests ensure the monitoring stack captures inference metrics reliably, enabling Grafana dashboards and alerting on prediction volume by language/model.

test_drift_detection.py

This module validates the drift detection pipeline that monitors data distribution changes in code comment embeddings using the fine-tuned Transformer model. The test suite is organized into multiple test classes:

TestDriftClassifier: Validates the PyTorch MLP classifier used for drift detection, ensuring correct initialization of the neural network architecture (input dimension of 768, output dimension of 2), forward pass tensor operations, and device placement (CPU/GPU/MPS).

TestEmbeddingFunction: Tests the embedding generation pipeline that loads the fine-tuned CodeBERT model and computes 768-dimensional semantic vectors. It verifies that the embedding function is callable, produces correct output shapes (batch_size × 768), handles both list and NumPy array inputs, and returns embeddings with correct dtype (float32).

TestDriftDetectionIntegration: Confirms that all required Alibi Detect modules (KSDrift, MMDDrift, ClassifierDrift) are properly installed and available for use.

TestDriftDetectionDataHandling: Ensures data preprocessing works correctly, including parsing of label columns from string representations and validation of empty dataset handling.

TestDriftDetectionRun: Performs end-to-end integration testing of the complete drift detection pipeline. It loads real data from the processed transformer datasets, requires a minimum of 150 samples (to enable drift detector training), and verifies that the detector correctly: - Reports no drift for baseline test data (samples from the same training distribution) - Reports drift detected for garbage/anomalous data (generated corrupted inputs) - Produces appropriate output messages for both test scenarios across Python, Java, and Pharo languages

Tests are parameterized across all three supported languages and skip gracefully when required models or datasets are unavailable (e.g., in CI environments without cached model artifacts).