Skip to content

Model Card for GraphCodeBERT-Comment-Classification: Python

A fine-tuned GraphCodeBERT model designed for the multi-label classification of source code comments, capable of identifying various comment types (e.g., Summary, Ownership, Deprecation, ...) in Python programming language.

Model Details

Model Description

This model is a fine-tuned version of microsoft/graphcodebert-base tailored for classifying code comments. It utilizes a Transformer-based architecture (AutoModelForSequenceClassification) to perform multi-label classification, meaning a single comment can be assigned multiple categories simultaneously.

The model classifies comments into 5 distinct categories.

Model info:

  • Developed by: Turing group
  • Model type: Multi-label classification
  • Language(s) (NLP): English (code comments)
  • Finetuned from model: microsoft/graphcodebert-base

Model Sources

Uses

Direct Use

The model is intended for the automatic analysis of source code repositories. It can be used to categorize comments to help developers understand codebases. The model takes a string (comment) as input and outputs a binary vector representing the active categories.

Downstream Use

This model is designed to be integrated into larger software engineering ecosystems. Below are specific downstream tasks enabled by the model's multi-label classification capabilities:

  • IDE Plugin Integration: Filtering and organizing comments within editors.
  • Maintenance Search Tools: Filtering codebases by comment category.
  • Automated Documentation Auditing: Enhancing code review workflows with intelligent comment analysis.
  • Semantic Code Navigation: Enabling advanced search capabilities beyond simple keyword matching.

Out-of-Scope Use

This model is not designed for code generation or completion. It should not be used for general-purpose natural language tasks unrelated to software engineering contexts (e.g., sentiment analysis of social media).

Bias, Risks, and Limitations

The model is trained on the NLBSE'26 Code Comment Classification dataset, which consists of comments extracted from open-source repositories. Consequently, the model may inherit biases present in the source code of these projects, such as:

  • Domain Bias: A bias towards specific programming domains or styles prevalent in the collected repositories.
  • Label Noise: Potential inconsistencies in the ground truth labels inherent to crowd-sourced or automatically mined datasets.

Recommendations

Users should validate the model's predictions on their specific codebase, as comment conventions vary significantly between projects. It is recommended to manually audit a sample of predictions before integrating the model into automated decision-making pipelines.

Training Details

Training Data

The model utilizes the NLBSE'26 Code Comment Classification dataset (hosted on Hugging Face).

For detailed information about the dataset, see the Dataset Card.

  • Source: The dataset consists of source code comments extracted from various open-source repositories.
  • Composition: It contains labeled data for multi-label classification, covering multiple programming languages (Java, Python, Pharo).
  • Data Quality: The training set is a curated subset of the original data. It has been processed to remove duplicates, ambiguous labels, and non-informative comments (noise) to ensure high-quality supervision.
  • Class Balance: Strategies were employed to mitigate class imbalance, ensuring that under-represented categories have sufficient examples for training.

Training Procedure

Preprocessing

The training pipeline implements a "Smart Cleaning" and "Safe Augmentation" strategy to prepare the data before it reaches the model:

1. Data Cleaning & Filtering: Prior to tokenization, the raw text undergoes heuristic filtering:

  • Deduplication: Removal of exact duplicates and semantic conflicts (identical text with different labels).
  • Noise Reduction: Comments are filtered out if they are too short (< 2 tokens), too long, or consist primarily of code symbols rather than natural language.
  • Normalization: Text is converted to lowercase and comment markers (e.g., //, /*) are stripped.

2. Data Augmentation: To handle imbalance, a Safe Augmentation technique is applied to minority classes:

  • Method: Synonym replacement via WordNet and random case injection.
  • Safety: A strict "protected list" of code keywords (e.g., return, if, void) is used to prevent the augmentation process from corrupting the semantic logic of code snippets.

3. Model Input Processing:

  • Tokenization: Uses the AutoTokenizer for graphcodebert-base.
  • Encoding: Labels are transformed into one-hot encoded vectors to support multi-label classification.

Training Hyperparameters

  • Model: microsoft/graphcodebert-base.
  • Optimizer: AdamW (adamw_torch).
  • Learning Rate: 3e-05.
  • Batch Size: 32 (Train), 64 (Eval).
  • Epochs: 20 (with Early Stopping patience of 3).
  • Precision: Mixed Precision (fp16) enabled for CUDA devices.
  • Max Length: 128 tokens.

Evaluation

Testing Data, Factors & Metrics

Testing Data

The testing data corresponds to the official test split of the NLBSE'26 dataset. Critically, no augmentation was applied to the test set to ensure evaluation against realistic, unmodified data distributions.

Factors

Evaluation is performed separately for each target programming language, as the label definitions and counts differ between languages.

Metrics

  • F1 Score: The primary metric used for checkpoint selection.
  • Accuracy: Subset accuracy.
  • Decision Threshold: A sigmoid threshold of 0.5 is used for binary predictions.

Results

Label Category Name Precision Recall F1-Score
0 Usage 0.72 0.71 0.72
1 Parameters 0.78 0.71 0.74
2 Development Notes 0.67 0.12 0.21
3 Expand 0.42 0.49 0.45
4 Summary 0.80 0.64 0.71
Weighted Avg 0.70 0.60 0.63

Python Test Metrics:

  • Accuracy: 0.5138
  • F1-Score: 0.5666

Efficiency & Computational Cost

  • Average Runtime: 1.770s
  • Computational Cost: 14,781 GFLOPs

Environmental Impact

  • Hardware Type: The code is designed to run on NVIDIA GPUs (CUDA) to leverage hardware acceleration.
  • Energy Optimization: Mixed Precision training (fp16) was enabled to reduce memory usage and energy consumption on compatible hardware.

Citation

BibTeX:

@inproceedings{nlbse26,
  title={NLBSE'26 Tool Competition: Code Comment Classification},
  author={NLBSE Organizers},
  booktitle={Proceedings of the 2026 International Workshop on Natural Language-based Software Engineering},
  year={2026}
}

Model Card Authors

Turing Group (SE4AI Course)

Model Card Contact

GitHub: https://github.com/se4ai2526-uniba/Turing