Models Overview
This project implements multiple state-of-the-art transformer models for multi-label code comment classification. Each model is optimized for its target programming language, with CodeBERTa and GraphCodeBERT serving as primary models alongside a RandomForest-TFIDF baseline.
Model Architecture
Baseline Model
RandomForest-TFIDF
The RandomForest-TFIDF model combines:
- TF-IDF (Term Frequency–Inverse Document Frequency) for converting text into numerical features
- Random Forest Classifier, an ensemble of decision trees trained on these features
Key Characteristics:
- Simple and interpretable baseline
- Works well for keyword-driven text classification
- Fast to train with minimal computational resources
- Used mainly for benchmarking and comparison to evaluate improvements offered by transformer models
Transformer-Based Models
CodeBERTa
Primary model for Java code comment classification
- Architecture: Fine-tuned from
huggingface/CodeBERTa-small-v1 - Task: Multi-label classification
- Capabilities:
- Captures semantic context and relationships beyond simple keywords
- Provides higher accuracy and F1 scores compared to classical models
- Ideal for production use in scenarios requiring deep understanding of code comments
Details: See Java Model Card
GraphCodeBERT
Primary model for Python and Pharo code comment classification
- Architecture: Fine-tuned from Graph-based Code representation model
- Task: Multi-label classification
- Advantages:
- Captures code structure and semantic relationships in code graph representation
- Better performance on structural code patterns
- Improved context understanding for domain-specific programming languages
Details: See Python Model Card and Pharo Model Card
Language-Specific Models
Java (CodeBERTa)
- Model: CodeBERTa-Comment-Classification-Java
- Classes: 7 label categories
- Base Model: huggingface/CodeBERTa-small-v1
- Documentation: Java Model Card
Python (GraphCodeBERT)
- Model: GraphCodeBERT-Comment-Classification-Python
- Classes: 5 label categories
- Base Model: GraphCodeBERT-small
- Documentation: Python Model Card
Pharo (GraphCodeBERT)
- Model: GraphCodeBERT-Comment-Classification-Pharo
- Classes: 6 label categories
- Base Model: GraphCodeBERT-small
- Documentation: Pharo Model Card
Model Comparison
| Model | Type | Languages | Labels | Use Case |
|---|---|---|---|---|
| CodeBERTa | Transformer | Java | 7 | Production classification |
| GraphCodeBERT | Graph-based | Python, Pharo | 5, 6 | Advanced code understanding |
| RandomForest-TFIDF | Classical | All | N/A | Baseline comparison |
For detailed information about each model, see the individual model cards: