Data

The NLBSE Code Comment Classification Dataset is a multilingual (Java, Python, Pharo) multi-label text classification dataset of 9,361 English code comment sentences annotated with semantic categories describing each comment’s purpose. It is split into six CSV files (train/test per language), with language-specific label taxonomies (7 Java labels, 5 Python labels, 6 Pharo labels) encoded as multi-hot vectors, and is intended for research on automatic code comment classification and software documentation analysis under an MIT license.

For additional information see data/README.md