NLBSE'26 Code Comment Classification Dataset
Dataset Description
Purpose and Overview
This dataset is provided for the NLBSE 2026 Tool Competition. The objective is to classify source code comments into specific software engineering categories based on their semantic content. The problem is formulated as a Multi-Label Text Classification task, where a single comment may belong to one or more categories simultaneously.
Dataset: https://huggingface.co/datasets/NLBSE/nlbse26-code-comment-classification
The dataset contains 1,733 manually labeled class comments decomposed into 9,361 sentences extracted from 20 open-source projects. It covers three distinct programming languages: Java, Python, and Pharo.
For each sentence, the dataset includes:
- The natural language text of the comment
- The associated class name from the source code
- The source code snippet context (in the combo field)
- Manual annotations validating the information type
A total of 487 sentences are strictly multi-label.
Supported Tasks
- Multi-Label Classification: Predicting the correct set of tags for a given comment
Dataset Structure
Training and Test Splits
The dataset is divided into six splits, providing a training and testing set for each language:
- Java:
java_train,java_test - Python:
python_train,python_test - Pharo:
pharo_train,pharo_test
Data Instances
An example of a data instance is as follows:
{
"index": 1,
"comment_sentence": "this impl delegates to the old filesystem",
"combo": "this impl delegates to the old filesystem | Abfss.java",
"partition": "Abfss.java",
"labels": "[0, 0, 1, 0, 0, 0, 0]",
"partition": 0
}
Data Fields
- comment_sentence: the actual sentence string, which is part of a (multi-line) class comment
- combo: the class name appended to the sentence string, used to train the baselines
- partition: the dataset split in training and testing; 0 identifies training instances, and 1 identifies testing instances
- labels: the ground-truth category, it is a binary list indicating which categories the sample belongs to. Each sample can belong to one or more categories
- class: the class name referring to the source code file where the sentence comes from
- index: unique identifier for the sample within the dataset
Label Taxonomy
Each programming language has a specific set of categories. The labels vector in the data instances corresponds to the indices listed below.
Java (7 Classes)
The Java dataset contains 6,595 sentences distributed across the following categories:
- Index 0 - Summary: High-level description of the method or class
- Index 1 - Ownership: License, author, or copyright information
- Index 2 - Expand: Detailed explanation or elaboration
- Index 3 - Usage: Examples or instructions on how to use the code
- Index 4 - Pointer: References to other parts of the code or external docs
- Index 5 - Deprecation: Information about deprecated methods
- Index 6 - Rational: Design rationale or reasons for implementation choices
Python (5 Classes)
The Python dataset contains 1,658 sentences distributed across the following categories:
- Index 0 - Usage: How to use the function or class
- Index 1 - Parameters: Description of arguments and parameters
- Index 2 - DevelopmentNotes: Notes for developers, TODOs, or fixmes
- Index 3 - Expand: Detailed descriptions
- Index 4 - Summary: Brief overview
Pharo (6 Classes)
The Pharo dataset contains 1,108 sentences distributed across the following categories:
- Index 0 - Keyimplementationpoints: Critical implementation details
- Index 1 - Example: Code examples
- Index 2 - Responsibilities: What the class or method is responsible for
- Index 3 - Intent: The purpose of the code
- Index 4 - Keymessages: Important messages sent or received
- Index 5 - Collaborators: Other classes this code interacts with
Data Statistics and Quality
The quality analysis of the dataset was conducted on the raw data using Deepchecks, a framework for validating datasets and detecting anomalies in machine learning pipelines. This analysis helped identify potential issues such as class imbalance, duplicate entries, conflicting labels, and distributional inconsistencies before preprocessing. It also provided insights on the variability in text lengths, comment structures, and multi-label occurrences.
Class Distribution
The dataset is highly imbalanced across all languages:
- Java: Summary and Usage are the majority classes. Deprecation and Rational are rare minority classes
- Python: Usage dominates the distribution
- Pharo: Example is the most frequent class
Known Issues
- Duplicate Entries: The training set contains redundant comments
- Conflicting Labels: A small percentage of identical comments may map to different labels
For example:
- Code/Text Mix: Some entries in the text field may consist primarily of commented-out code rather than natural language description
- Length Outliers: The dataset contains text lengths ranging from 1 token to over 800 chars
Citation
@inproceedings{nlbse26,
title={NLBSE'26 Tool Competition: Code Comment Classification},
author={NLBSE Organizers},
booktitle={Proceedings of the 2026 International Workshop on Natural Language-based Software Engineering},
year={2026}
}
Dataset Card Authors
Turing Group (SE4AI Course)