Notebooks

These notebooks implement the main experimental baselines for the NLBSE Code Comment Classification task and use MLflow (via DagsHub) to track all training runs, metrics, and artifacts for reproducibility.

General approach

Both notebooks load the NLBSE Code Comment Classification dataset (Java, Python, Pharo) and treat the problem as multi‑label sentence classification at comment level. - The SetFit_baseline notebook builds SetFit models on top of a sentence-transformer backbone, using the Hugging Face datasets library and the SetFitModel/Trainer API to fine‑tune a shared encoder across languages and evaluate per‑label precision, recall, and F1. - The RandomForest_baseline notebook creates a classical ML baseline by vectorizing comments with TfidfVectorizer, training language‑specific RandomForestClassifier models (with GridSearchCV for hyperparameter search), and computing custom multi‑label metrics for each documentation category.

Use of MLflow

MLflow is initialized through dagshub.init(..., mlflow=True), which connects the notebooks to the DagsHub-hosted repository and enables remote experiment logging. In the SetFit notebook, each training run logs key information such as training loss progression and the per‑language, per‑category precision/recall/F1 table, allowing direct comparison of SetFit performance across Java, Python, and Pharo labels in the MLflow UI. In the RandomForest notebook, MLflow is used more extensively: the best hyperparameters from GridSearchCV are logged as parameters, per‑class metrics (precision, recall, F1 for every label) and aggregate metrics like mean F1 and average inference time are recorded as metrics, and the fitted RandomForest model is stored as an artifact (either via mlflow.sklearn.log_model or mlflow.log_artifact as a fallback).

Each notebook execution results in a fully tracked experiment run: configurations, scores, runtimes, and model binaries are versioned automatically, and can later be inspected, compared, or promoted directly from the MLflow/DagsHub interface. For finer details on architectures, hyperparameters, and per‑label results, the corresponding MLflow runs and artifacts created by these notebooks provide the authoritative reference.