OWNML MACHINE LEARNING CANVAS

Value Proposition

This project aims to enhance code comprehension and maintenance by developing a machine learning model capable of automatically classifying code comments. The primary beneficiaries are software developers, maintainers, and documentation managers. They struggle with inefficient code comprehension and maintenance in large, complex codebases. The sheer volume of unstructured comments makes it difficult and time-consuming to locate specific information, such as design rationale, implementation details, or pending tasks. This slows down development, complicates the onboarding of new team members, and increases the risk of introducing bugs during maintenance. The comment classifier integrates into the developer's workflow by providing its output as a foundational layer for other software engineering tools. The purpose is to enhance these tools' capabilities, enabling more effective code comprehension and analysis. This allows for functionalities such as targeted searching and filtering of comments by their category, and provides high-level analytics on the codebase's documentation patterns.

Prediction Task

The task is a multi-label classification task. The entities on which predictions made are code comment sentences belonging to 18 categories across three programming languages (7 for Java, 5 for Python, and 6 for Pharo). The list of labels is defined for each of as follows:

Java: [summary, Ownership, Expand, usage, Pointer, deprecation, rational]
Python: [Usage, Parameters, DevelopmentNotes, Expand, Summary]
Pharo: [Keyimplementationpoints, Example, Responsibilities, Intent, Keymessages, Collaborators]

Decisions

The ML system’s predictions support a structured process for code comprehension and maintenance. It adds value by filtering, highlighting, or grouping comment sentences by category enhancing code comprehension and maintenance.

Based on the predictions, a developer can decide to selectively filter comments to focus on specific tasks or refactor the code to match the documented design, or update documentation for new features.

Impact Simulation

The impact of this classification model is measured in terms of its contribution to developer productivity and code comprehension. A correct classification reduces the cognitive load on developers and saves time during code navigation and maintenance. Conversely, an incorrect classification can lead to minor inefficiencies, such as a developer overlooking a relevant comment or spending extra time verifying its purpose. The pre-deployment impact is simulated through a two-fold evaluation process: By evaluating the model on a test set: deployment is conditional on maintaining or improving Precision, Recall, and F1 scores. After stress testing via Locust to ensure the inference engine meets latency requirements and stability targets under concurrent load.

Making Predictions

Predictions are primarily performed in Real-time. The system is designed to serve instant classifications for code comments through a web interface, adhering to strict low-latency requirements to ensure a seamless developer experience.

The computation for inference will be performed on standard research hardware, such as a local machine or cloud-based server, utilizing GPU resources to accelerate the processing phase.

Data Sources

The sole data source for this project is the official dataset provided for the NLBSE'26 competition, which is hosted on and accessed via the Hugging Face Hub (NLBSE/nlbse26-code-comment-classification). This dataset consists of both entities (code comment sentences for Java, Python, and Pharo) and observed outcomes (the ground-truth, multi-label annotations).

Data Collection

The NLBSE'26 dataset is made of 1,733 manually labeled class comments and 9,361 sentences from these comments, distributed into various categories specific to each programming language. These comments were extracted from 20 open-source projects written in three programming languages: Java, Pharo, or Python. We provide the associated code class for each sentence as well as the source code of the software projects. Each comment sentence can belong to one or more categories of the language (from a minimum of 5 to a maximum of 7, depending on the language). Each category represents the type of information that the sentence is conveying. While the training dataset is fixed, a strategy for continuous data collection is implemented in the deployment phase. The Gradio interface incorporates a Human-in-the-loop feedback mechanism: users can validate or correct model predictions directly via the UI. These corrections are logged to a persistent storage system creating a valuable repository of real-world examples for future model monitoring and fine-tuning.

Features

During prediction, the model works with input representations that simplify and enrich the original data. The main text is converted into dense numerical vectors (embeddings) that capture subtle shades of meaning. Specifically, code comments are processed via subword tokenization to generate sequences of Input IDs and Attention Masks. These features are consumed by pre-trained Transformer architectures (such as CodeBERTa and GraphCodeBERT), which have been fine-tuned to extract contextual semantic representations optimized for the multi-label classification task

Building Models

The goal is to develop and deploy three distinct, optimized models, one for each programming language (Java, Python, and Pharo). Rather than following a continuous development cycle, the modeling phase will focus on experimenting with distinct model architectures to identify the most effective design for each language. Each candidate model will be rigorously evaluated, where its average F1-score on a standardized test set will be compared against previous versions. Efficiency metrics such as inference runtime and GFLOPS will also be considered to ensure that performance improvements remain computationally balanced.

Monitoring

The system employs a comprehensive monitoring strategy to guarantee both operational stability and model reliability. Real-time infrastructure health and API performance are tracked using Prometheus and Grafana, while Better Stack ensures external uptime and incident alerting. To maintain strict service levels, Locust is utilized for periodic stress testing, verifying that the inference engine sustains low latency under concurrent load. On the data side, Deepchecks continuously audits the pipeline for data drift and performance degradation, ensuring the classifier remains accurate over time.