CodeCommentClassification overview

CodeCommentClassification is an end-to-end pipeline to classify comment sentences into language-specific categories and to aggregate results at file/PR level so reviewers can focus on rationale, usage notes, deprecations, examples, and other high-value signals.

The project targets and aims to surpass the NLBSE’26 baselines, providing reproducible training, evaluation, and inference.

Core choices:

Task: multi-label text classification at sentence level
Scope: three languages with per-language models (Java, Python, Pharo)
Usage: batch predictions on submissions (pre-review), summaries per file/PR
Human-in-the-loop: reviewer confirmations/overrides feed threshold recalibration

This is an excerpt from the README.md file of the repository, for additional info please refer to that file.

Model Selection

We progressed from underwhelming Random Forest results to SetFit as a solid baseline, then achieved our final milestone with a high-performance CodeBERT-based pipeline for multi-label code comment classification across Java, Python, and Pharo datasets.

Initial Random Forest attempts failed due to poor handling of semantic nuances in code comments, prompting a shift to SetFit as the effective baseline.

Milestone 4 introduced CodeBERT enhancements: label-aware supersampling in CSV preprocessing to balance classes, a custom TransformerTrainer with BCEWithLogitsLoss, positive class weights, WeightedRandomSampler, and linear warmup/decay schedulers, all tracked via MLflow.

CodeBERT Model

CodeBERT, a bimodal transformer pretrained on code and natural language, excels in code understanding tasks by generating contextual embeddings for comments, enabling superior multi-label classification (e.g., Java Macro F1 0.7457, Micro F1 0.8364; Python Macro F1 0.6385).

Sync

The api/sync_models.py module automatically downloads the latest champion MLflow models for each language (python, java, pharo) from the remote registry to local disk at API startup. It searches registered models prefixed by language, resolves the "<lang>-champion" alias, downloads artifacts to a normalized directory structure (models/api/<lang>/<model_type>/), and flattens transformer model subdirectories for seamless predictor loading, ensuring the API always serves the best-performing versions without manual intervention.

Modules

CodeCommentClassification consists of two main modules: a FastAPI-based API for model serving and inference, and a Vite-powered React frontend for user-friendly file upload, comment extraction, and result visualization across Java, Python, and Pharo codebases.

API Module

The API module runs as a secure FastAPI web service in a Python 3.11 Docker container, automatically syncing the latest champion models from MLflow at startup. It exposes endpoints for dynamic model listing and core prediction, accepting code comments with specified language and model type, then returning multi-label classifications using SetFit, Random Forest, or Transformer models via lazy-loaded predictors.

Frontend Module

The frontend provides an intuitive React interface where users upload source files; it auto-detects the language from extensions, extracts individual comments, queries available models from the API, and batches classification requests while caching results to avoid duplicates. Results appear overlaid as labels and scores directly on syntax-highlighted code lines, with real-time status updates throughout the seamless parse-extract-classify-visualize workflow.

CI and CD

The project implements a streamlined GitHub Actions CI/CD pipeline that automates testing and deployment for the dockerized FastAPI API, React frontend, and ML models, with parallel CI workflows (ci-api for endpoint tests, ci-models for CodeBERT/SetFit validation, ci-ruff for linting) triggered on push/PR events using Python 3.11.13 venvs, and automated CD container builds/pushes to registry on main merges for seamless production orchestration.

Monitoring Stack

The monitoring stack integrates uptime monitoring (API /status, frontend index every 10min), Prometheus metrics via prometheus-fastapi-instrumentator (tested in test_monitoring.py), CodeBERT-powered drift detection (ClassifierDrift), and Locust load testing with realistic comment sampling, all containerized in docker-compose for comprehensive observability.

Dockerized deployment

The entire CodeCommentClassification project is fully dockerized through a comprehensive docker-compose orchestration that deploys five interconnected services: the FastAPI backend API (port 8080) with health checks, MLflow environment integration, hot-reload volumes, and resource limits (1 CPU/2GB RAM); the React frontend (port 80) that waits for API health before starting; Prometheus (port 9090) for metrics collection; Alertmanager (port 9093) for alert handling; and Grafana for visualization dashboards, all with automatic restarts and proper dependency chains ensuring robust, production-ready deployment across development and production environments.