Models Overview

This project implements multiple state-of-the-art transformer models for multi-label code comment classification. Each model is optimized for its target programming language, with CodeBERTa and GraphCodeBERT serving as primary models alongside a RandomForest-TFIDF baseline.

Model Architecture

Baseline Model

RandomForest-TFIDF

The RandomForest-TFIDF model combines:

TF-IDF (Term Frequency–Inverse Document Frequency) for converting text into numerical features
Random Forest Classifier, an ensemble of decision trees trained on these features

Key Characteristics:

Simple and interpretable baseline
Works well for keyword-driven text classification
Fast to train with minimal computational resources
Used mainly for benchmarking and comparison to evaluate improvements offered by transformer models

Transformer-Based Models

CodeBERTa

Primary model for Java code comment classification

Architecture: Fine-tuned from huggingface/CodeBERTa-small-v1
Task: Multi-label classification
Capabilities:
Captures semantic context and relationships beyond simple keywords
Provides higher accuracy and F1 scores compared to classical models
Ideal for production use in scenarios requiring deep understanding of code comments

Details: See Java Model Card

GraphCodeBERT

Primary model for Python and Pharo code comment classification

Architecture: Fine-tuned from Graph-based Code representation model
Task: Multi-label classification
Advantages:
Captures code structure and semantic relationships in code graph representation
Better performance on structural code patterns
Improved context understanding for domain-specific programming languages

Details: See Python Model Card and Pharo Model Card

Language-Specific Models

Java (CodeBERTa)

Model: CodeBERTa-Comment-Classification-Java
Classes: 7 label categories
Base Model: huggingface/CodeBERTa-small-v1
Documentation: Java Model Card

Python (GraphCodeBERT)

Model: GraphCodeBERT-Comment-Classification-Python
Classes: 5 label categories
Base Model: GraphCodeBERT-small
Documentation: Python Model Card

Pharo (GraphCodeBERT)

Model: GraphCodeBERT-Comment-Classification-Pharo
Classes: 6 label categories
Base Model: GraphCodeBERT-small
Documentation: Pharo Model Card

Model Comparison

Model	Type	Languages	Labels	Use Case
CodeBERTa	Transformer	Java	7	Production classification
GraphCodeBERT	Graph-based	Python, Pharo	5, 6	Advanced code understanding
RandomForest-TFIDF	Classical	All	N/A	Baseline comparison

For detailed information about each model, see the individual model cards: