Model Selection with MLflow Tags

System for tracking datasets and selecting best models per language using MLflow tags.

What It Does

Automatically logs which dataset was used to train each model
Tracks best-performing models per language
Enables reproducible model selection without hardcoding values
Maintains clean separation between different training runs

Quick Start

Tag Best Models

After training, identify and tag the best model per language:

python scripts/tag_best_models.py tag

This: - Searches all training runs in the experiment - Finds the best model for each language based on f1_score - Applies tags: best_model=true, best_model_{language}=true - Records dataset name and selection timestamp - Removes tags from other runs

View Tagged Models

python scripts/tag_best_models.py show

Shows all tagged models with: - Language and run information - Dataset used for training - Complete metrics

Start API

uvicorn turing.api.app:app --reload

The API automatically loads and uses the tagged best models.

How It Works

Training Phase

During training, the system logs:

dataset = DatasetManager()
dataset_name = dataset.get_dataset_name()

with mlflow.start_run(run_name=f"{model_name}_{lang}"):
    mlflow.set_tag("Language", lang)
    mlflow.set_tag("dataset_name", dataset_name)
    mlflow.log_params(model.params)
    # training...

Tags applied: - Language: Language code - dataset_name: Name of dataset folder - model_name: Model identifier

Selection Phase

The tagging script processes runs:

def tag_best_models(experiment_name, metric="f1_score", languages=None):
    # For each language:
    # 1. Find all runs with that language tag
    # 2. Order by metric (highest first)
    # 3. Tag the best run
    # 4. Remove tags from others

Dataset Name Resolution

Dataset names are extracted dynamically from the file system:

def get_dataset_name(self) -> str:
    return self.base_interim_path.name

No hardcoded values - changes to the dataset path automatically propagate.

Model Loading

The inference engine uses this selection strategy:

Query MLflow for models tagged best_model=true
Fall back to hardcoded registry if no tags found
Use metric-based selection as last resort

from turing.modeling.predict import ModelInference

inference = ModelInference(use_best_model_tags=True)
response = inference.predict_payload(request)

Configuration

Use tags for dynamic selection (default):

inference = ModelInference(use_best_model_tags=True)

Or use hardcoded fallback:

inference = ModelInference(use_best_model_tags=False)

Advanced Usage

Custom Metric

Select best models using a different metric:

python scripts/tag_best_models.py tag --metric precision

Programmatic Tagging

from scripts.tag_best_models import tag_best_models

tag_best_models(
    experiment_name="fine-tuned-CodeBERTa",
    metric="f1_score",
    languages=["java", "python", "pharo"]
)

Tags Reference

Applied During Training

Tag	Purpose
`Language`	Language identifier
`dataset_name`	Dataset folder name
`model_name`	Model identifier

Applied During Selection

Tag	Purpose
`best_model`	Marks the best model
`best_model_{lang}`	Language-specific best indicator
`selection_metric`	Metric used for selection
`selection_date`	When selection occurred

Data Flow

Training Runs (MLflow)
        ↓
  tag_best_models.py
        ↓
  Tagged Best Models
        ↓
  model_selector.py
        ↓
  ModelInference
        ↓
  API /predict

Features

Dynamic Dataset Tracking: Dataset name auto-detected from path
Per-Language Selection: Separate best model for each language
Flexible Metrics: Choose f1_score, precision, recall, or accuracy
Fallback Safety: Works even if MLflow is unavailable
Clean Metadata: All selection decisions recorded with timestamps

Troubleshooting

No tagged models found

python scripts/tag_best_models.py show
mlflow ui  # Check experiment directly

Wrong model selected

Re-run tagging with correct metric:

python scripts/tag_best_models.py tag --metric f1_score
python scripts/tag_best_models.py show

Dataset name not captured

Verify dataset name resolution:

from turing.dataset import DatasetManager
print(DatasetManager().get_dataset_name())

Files

turing/modeling/train.py - Logs Language and dataset_name tags
turing/dataset.py - Provides get_dataset_name() method
turing/modeling/model_selector.py - Queries and selects models
turing/modeling/predict.py - Loads selected models
turing/api/app.py - Serves predictions
scripts/tag_best_models.py - CLI for tagging operations