Model Card for GraphCodeBERT-Comment-Classification: Python
A fine-tuned GraphCodeBERT model designed for the multi-label classification of source code comments, capable of identifying various comment types (e.g., Summary, Ownership, Deprecation, ...) in Python programming language.
Model Details
Model Description
This model is a fine-tuned version of microsoft/graphcodebert-base tailored for classifying code comments. It utilizes a Transformer-based architecture (AutoModelForSequenceClassification) to perform multi-label classification, meaning a single comment can be assigned multiple categories simultaneously.
The model classifies comments into 5 distinct categories.
Model info:
- Developed by: Turing group
- Model type: Multi-label classification
- Language(s) (NLP): English (code comments)
- Finetuned from model: microsoft/graphcodebert-base
Model Sources
- Repository: Turing group repository
- Demo: API Link
Uses
Direct Use
The model is intended for the automatic analysis of source code repositories. It can be used to categorize comments to help developers understand codebases. The model takes a string (comment) as input and outputs a binary vector representing the active categories.
Downstream Use
This model is designed to be integrated into larger software engineering ecosystems. Below are specific downstream tasks enabled by the model's multi-label classification capabilities:
- IDE Plugin Integration: Filtering and organizing comments within editors.
- Maintenance Search Tools: Filtering codebases by comment category.
- Automated Documentation Auditing: Enhancing code review workflows with intelligent comment analysis.
- Semantic Code Navigation: Enabling advanced search capabilities beyond simple keyword matching.
Out-of-Scope Use
This model is not designed for code generation or completion. It should not be used for general-purpose natural language tasks unrelated to software engineering contexts (e.g., sentiment analysis of social media).
Bias, Risks, and Limitations
The model is trained on the NLBSE'26 Code Comment Classification dataset, which consists of comments extracted from open-source repositories. Consequently, the model may inherit biases present in the source code of these projects, such as:
- Domain Bias: A bias towards specific programming domains or styles prevalent in the collected repositories.
- Label Noise: Potential inconsistencies in the ground truth labels inherent to crowd-sourced or automatically mined datasets.
Recommendations
Users should validate the model's predictions on their specific codebase, as comment conventions vary significantly between projects. It is recommended to manually audit a sample of predictions before integrating the model into automated decision-making pipelines.
Training Details
Training Data
The model utilizes the NLBSE'26 Code Comment Classification dataset (hosted on Hugging Face).
For detailed information about the dataset, see the Dataset Card.
- Source: The dataset consists of source code comments extracted from various open-source repositories.
- Composition: It contains labeled data for multi-label classification, covering multiple programming languages (Java, Python, Pharo).
- Data Quality: The training set is a curated subset of the original data. It has been processed to remove duplicates, ambiguous labels, and non-informative comments (noise) to ensure high-quality supervision.
- Class Balance: Strategies were employed to mitigate class imbalance, ensuring that under-represented categories have sufficient examples for training.
Training Procedure
Preprocessing
The training pipeline implements a "Smart Cleaning" and "Safe Augmentation" strategy to prepare the data before it reaches the model:
1. Data Cleaning & Filtering: Prior to tokenization, the raw text undergoes heuristic filtering:
- Deduplication: Removal of exact duplicates and semantic conflicts (identical text with different labels).
- Noise Reduction: Comments are filtered out if they are too short (< 2 tokens), too long, or consist primarily of code symbols rather than natural language.
- Normalization: Text is converted to lowercase and comment markers (e.g.,
//,/*) are stripped.
2. Data Augmentation: To handle imbalance, a Safe Augmentation technique is applied to minority classes:
- Method: Synonym replacement via WordNet and random case injection.
- Safety: A strict "protected list" of code keywords (e.g.,
return,if,void) is used to prevent the augmentation process from corrupting the semantic logic of code snippets.
3. Model Input Processing:
- Tokenization: Uses the
AutoTokenizerforgraphcodebert-base. - Encoding: Labels are transformed into one-hot encoded vectors to support multi-label classification.
Training Hyperparameters
- Model:
microsoft/graphcodebert-base. - Optimizer: AdamW (
adamw_torch). - Learning Rate: 3e-05.
- Batch Size: 32 (Train), 64 (Eval).
- Epochs: 20 (with Early Stopping patience of 3).
- Precision: Mixed Precision (fp16) enabled for CUDA devices.
- Max Length: 128 tokens.
Evaluation
Testing Data, Factors & Metrics
Testing Data
The testing data corresponds to the official test split of the NLBSE'26 dataset. Critically, no augmentation was applied to the test set to ensure evaluation against realistic, unmodified data distributions.
Factors
Evaluation is performed separately for each target programming language, as the label definitions and counts differ between languages.
Metrics
- F1 Score: The primary metric used for checkpoint selection.
- Accuracy: Subset accuracy.
- Decision Threshold: A sigmoid threshold of 0.5 is used for binary predictions.
Results
| Label | Category Name | Precision | Recall | F1-Score |
|---|---|---|---|---|
| 0 | Usage | 0.72 | 0.71 | 0.72 |
| 1 | Parameters | 0.78 | 0.71 | 0.74 |
| 2 | Development Notes | 0.67 | 0.12 | 0.21 |
| 3 | Expand | 0.42 | 0.49 | 0.45 |
| 4 | Summary | 0.80 | 0.64 | 0.71 |
| Weighted Avg | 0.70 | 0.60 | 0.63 |
Python Test Metrics:
- Accuracy: 0.5138
- F1-Score: 0.5666
Efficiency & Computational Cost
- Average Runtime:
1.770s - Computational Cost:
14,781 GFLOPs
Environmental Impact
- Hardware Type: The code is designed to run on NVIDIA GPUs (CUDA) to leverage hardware acceleration.
- Energy Optimization: Mixed Precision training (
fp16) was enabled to reduce memory usage and energy consumption on compatible hardware.
Citation
BibTeX:
@inproceedings{nlbse26,
title={NLBSE'26 Tool Competition: Code Comment Classification},
author={NLBSE Organizers},
booktitle={Proceedings of the 2026 International Workshop on Natural Language-based Software Engineering},
year={2026}
}
Model Card Authors
Turing Group (SE4AI Course)
Model Card Contact
GitHub: https://github.com/se4ai2526-uniba/Turing