Skip to content

Turing – Code Comment Classification with LLMs

Background

Code comments play a vital role in software development. They help explain complex algorithms, clarify design decisions, and improve overall code maintainability. Automatically classifying these comments can significantly enhance code comprehension, documentation quality, and developer productivity.

The NLBSE'26 competition provides datasets for Java, Python, and Pharo, along with baseline models built using Sentence Transformers. This project leverages these datasets to build a robust code comment classification system using modern language models.


Development Workflow

Milestone 1: ML Canvas and AI Risk Analysis

This milestone focuses on aligning technical objectives with business goals and assessing potential risks.

ML Canvas

The ML Canvas is a strategic tool that bridges the gap between technical work and business goals. It provides a high-level overview of the machine learning project, including:

  • Target users and stakeholders
  • Expected outcomes
  • Model predictions
  • Data sources and collection methods
  • Features used

The canvas facilitates collaboration between technical and non-technical team members, ensuring the ML solution aligns with real-world objectives.

Documentation: ML Canvas

AI Risk Analysis

The AI Risk Analysis evaluates potential risks associated with deploying machine learning models. Key areas of focus include:

  • Data bias and fairness issues
  • Security vulnerabilities
  • Impact of incorrect predictions on developer productivity and code maintenance

This step helps ensure that the ML system is safe, reliable, and ethically sound.

Documentation: AI Risk Analysis


Milestone 2: DVC and MLflow Integration

Goal: Build a reproducible and trackable machine learning workflow.

  • Reproducible Pipeline: Developed a modeling pipeline using DVC (Data Version Control) to manage datasets, intermediate outputs, and model artifacts
  • Experiment Tracking: Integrated MLflow to log model parameters, metrics, and artifacts, enabling easy comparison and reproducibility of experiments

Documentation: Model Selection & Tags


Milestone 3: Data and Code Quality

This milestone focused on data integrity, code quality, and robust testing.

Code Structure & CLI

  • Refactored main scripts (e.g., dataset.py) into modular, class-based CLI applications
  • Structured the training pipeline in a clear, modular way (modeling/train.py) for maintainability and reusability

Data Versioning & Validation

  • Implemented Deepchecks to monitor data quality and validate datasets for consistency

Code Quality (CI)

  • Used Ruff for linting and code formatting
  • Enforced coding standards through GitHub Actions to ensure automated quality checks

Testing (CI)

  • Set up Pytest for unit testing
  • Added behavioral tests to validate model logic, robustness, and expected performance

Milestone 4: API Implementation

In this milestone, we developed a HTTP POST endpoint to send text to the deployed MLflow model for predictions.

Additional Features Implemented

  • CodeBERTa Model: Integrated a transformer-based model to improve accuracy and F1 score over classical baselines
  • Test Report Generator: Automated generation of data and code quality reports
  • Model Unit Tests: Comprehensive unit tests for all models to ensure correctness
  • API & Model Documentation: Provided detailed documentation on API usage, endpoint behavior, and model information

Documentation: Model API User Guide

  • Model Card: Added a detailed model card describing the architecture, training data, intended uses, evaluation metrics, and environmental impact

Documentation: Model Card for Java, Model Card for Python, Model Card for Pharo

  • Dataset Card: Provides information about the dataset's contents, intended context, creation process, and other relevant considerations for users

Documentation: Dataset Card


Milestone 5: Containerization, CI/CD, HF Spaces and GUI

Goal: Enable containerized deployment, automation, and provide an interactive GUI for model inference.

MLflow Model Tagging

  • Implemented a model tagging mechanism in MLflow, assigning tags to each model based on attributes such as language, dataset, and model type
  • The tag best_model is automatically assigned to the best-performing model for each programming language immediately after training any new model. The tag is removed from the previous best model, ensuring that only the current best model has the best_model tag

Containerization and Deployment

  • Added a Dockerfile and docker-compose.yml to enable containerized deployment of the FastAPI application, including user permissions, dependency installation, and environment variable configuration
  • The Docker image is autonomous in selecting models: thanks to the best_model MLflow tag, when the container starts, it automatically downloads the best model for each language and uses it for serving future API requests. This ensures that the system always serves the most performant models
  • Updated .dockerignore to exclude unnecessary files and directories from Docker builds, improving build performance and security

Automation, CI/CD and HF Spaces

  • Built and uploaded the Docker image to a Hugging Face Space to enable containerized deployment

API: https://turing-team-turing-space.hf.space/docs

  • Introduced a GitHub Actions workflow (.github/workflows/push-folders.yml) to automatically sync the turing folder to the Hugging Face Space when changes are pushed to relevant branches

Model Inference GUI

  • Implemented a graphical user interface for model classification inference using the Gradio library
  • The GUI is accessible via a localhost endpoint and is included in the Docker image for seamless deployment

GUI: https://turing-team-turing-space.hf.space/gradio

  • Added functionality for user feedback: after performing inference, users can select the correct category from a dropdown menu. Feedback is automatically saved to a CSV file for future analysis and model improvement

Documentation: GUI User Guide


Milestone 6: Monitoring

This milestone focuses on system stability, real-time observability, and data quality assurance through load testing and drift detection and and centralized logging pipelines.

Load Testing and Inference Optimization

  • Integrated Locust to simulate concurrent user traffic and dynamic language switching, ensuring the API handles stress gracefully.
  • Refactored the inference engine to use pure PyTorch instead of the HuggingFace Trainer. This eliminates disk-access errors during high-concurrency scenarios.
  • Implemented a caching mechanism for models, ensuring that language-specific models are loaded once and reused, significantly reducing latency and avoiding repeated disk I/O.

Observability with Prometheus and Grafana

  • Exposed a /metrics endpoint via Prometheus to track custom business metrics, including total HTTP requests grouped by language, processed code comment volume, and character counts.
  • Integrated with Grafana Cloud to visualize real-time API performance and application health through dedicated dashboards.
  • Refactored API routes to accept language parameters via query strings, enabling more granular metric tracking and analysis.

Data Drift Detection Pipeline

  • Engineered a comprehensive drift detection pipeline using DeepChecks to compare production data against training baselines.
  • Implemented modules for synthetic drift generation (e.g., corrupted vocab, class imbalance) to validate the detection logic and ensure robust reporting.

Centralized Logging and Uptime with Better Stack

  • Implemented centralized log management using Better Stack, migrating the infrastructure from loguru to the standard Python logging library to ensure full ecosystem compatibility.
  • Configured the LogtailHandler to route application logs directly to the cloud dashboard for real-time analysis via the Live Tail interface.
  • Established uptime monitoring for the Hugging Face Space, featuring automated incident alerts that notify the team immediately upon system downtime or detection of critical ERROR level logs.

Quick Navigation