Skip to content

Models Overview

This project implements multiple state-of-the-art transformer models for multi-label code comment classification. Each model is optimized for its target programming language, with CodeBERTa and GraphCodeBERT serving as primary models alongside a RandomForest-TFIDF baseline.


Model Architecture

Baseline Model

RandomForest-TFIDF

The RandomForest-TFIDF model combines:

  • TF-IDF (Term Frequency–Inverse Document Frequency) for converting text into numerical features
  • Random Forest Classifier, an ensemble of decision trees trained on these features

Key Characteristics:

  • Simple and interpretable baseline
  • Works well for keyword-driven text classification
  • Fast to train with minimal computational resources
  • Used mainly for benchmarking and comparison to evaluate improvements offered by transformer models

Transformer-Based Models

CodeBERTa

Primary model for Java code comment classification

  • Architecture: Fine-tuned from huggingface/CodeBERTa-small-v1
  • Task: Multi-label classification
  • Capabilities:
  • Captures semantic context and relationships beyond simple keywords
  • Provides higher accuracy and F1 scores compared to classical models
  • Ideal for production use in scenarios requiring deep understanding of code comments

Details: See Java Model Card

GraphCodeBERT

Primary model for Python and Pharo code comment classification

  • Architecture: Fine-tuned from Graph-based Code representation model
  • Task: Multi-label classification
  • Advantages:
  • Captures code structure and semantic relationships in code graph representation
  • Better performance on structural code patterns
  • Improved context understanding for domain-specific programming languages

Details: See Python Model Card and Pharo Model Card


Language-Specific Models

Java (CodeBERTa)

  • Model: CodeBERTa-Comment-Classification-Java
  • Classes: 7 label categories
  • Base Model: huggingface/CodeBERTa-small-v1
  • Documentation: Java Model Card

Python (GraphCodeBERT)

  • Model: GraphCodeBERT-Comment-Classification-Python
  • Classes: 5 label categories
  • Base Model: GraphCodeBERT-small
  • Documentation: Python Model Card

Pharo (GraphCodeBERT)

  • Model: GraphCodeBERT-Comment-Classification-Pharo
  • Classes: 6 label categories
  • Base Model: GraphCodeBERT-small
  • Documentation: Pharo Model Card

Model Comparison

Model Type Languages Labels Use Case
CodeBERTa Transformer Java 7 Production classification
GraphCodeBERT Graph-based Python, Pharo 5, 6 Advanced code understanding
RandomForest-TFIDF Classical All N/A Baseline comparison

For detailed information about each model, see the individual model cards: