Model Selection with MLflow Tags
System for tracking datasets and selecting best models per language using MLflow tags.
What It Does
- Automatically logs which dataset was used to train each model
- Tracks best-performing models per language
- Enables reproducible model selection without hardcoding values
- Maintains clean separation between different training runs
Quick Start
Tag Best Models
After training, identify and tag the best model per language:
This:
- Searches all training runs in the experiment
- Finds the best model for each language based on f1_score
- Applies tags: best_model=true, best_model_{language}=true
- Records dataset name and selection timestamp
- Removes tags from other runs
View Tagged Models
Shows all tagged models with: - Language and run information - Dataset used for training - Complete metrics
Start API
The API automatically loads and uses the tagged best models.
How It Works
Training Phase
During training, the system logs:
dataset = DatasetManager()
dataset_name = dataset.get_dataset_name()
with mlflow.start_run(run_name=f"{model_name}_{lang}"):
mlflow.set_tag("Language", lang)
mlflow.set_tag("dataset_name", dataset_name)
mlflow.log_params(model.params)
# training...
Tags applied:
- Language: Language code
- dataset_name: Name of dataset folder
- model_name: Model identifier
Selection Phase
The tagging script processes runs:
def tag_best_models(experiment_name, metric="f1_score", languages=None):
# For each language:
# 1. Find all runs with that language tag
# 2. Order by metric (highest first)
# 3. Tag the best run
# 4. Remove tags from others
Dataset Name Resolution
Dataset names are extracted dynamically from the file system:
No hardcoded values - changes to the dataset path automatically propagate.
Model Loading
The inference engine uses this selection strategy:
- Query MLflow for models tagged
best_model=true - Fall back to hardcoded registry if no tags found
- Use metric-based selection as last resort
from turing.modeling.predict import ModelInference
inference = ModelInference(use_best_model_tags=True)
response = inference.predict_payload(request)
Configuration
Use tags for dynamic selection (default):
Or use hardcoded fallback:
Advanced Usage
Custom Metric
Select best models using a different metric:
Programmatic Tagging
from scripts.tag_best_models import tag_best_models
tag_best_models(
experiment_name="fine-tuned-CodeBERTa",
metric="f1_score",
languages=["java", "python", "pharo"]
)
Tags Reference
Applied During Training
| Tag | Purpose |
|---|---|
Language |
Language identifier |
dataset_name |
Dataset folder name |
model_name |
Model identifier |
Applied During Selection
| Tag | Purpose |
|---|---|
best_model |
Marks the best model |
best_model_{lang} |
Language-specific best indicator |
selection_metric |
Metric used for selection |
selection_date |
When selection occurred |
Data Flow
Training Runs (MLflow)
↓
tag_best_models.py
↓
Tagged Best Models
↓
model_selector.py
↓
ModelInference
↓
API /predict
Features
- Dynamic Dataset Tracking: Dataset name auto-detected from path
- Per-Language Selection: Separate best model for each language
- Flexible Metrics: Choose f1_score, precision, recall, or accuracy
- Fallback Safety: Works even if MLflow is unavailable
- Clean Metadata: All selection decisions recorded with timestamps
Troubleshooting
No tagged models found
Wrong model selected
Re-run tagging with correct metric:
Dataset name not captured
Verify dataset name resolution:
Files
turing/modeling/train.py- Logs Language and dataset_name tagsturing/dataset.py- Provides get_dataset_name() methodturing/modeling/model_selector.py- Queries and selects modelsturing/modeling/predict.py- Loads selected modelsturing/api/app.py- Serves predictionsscripts/tag_best_models.py- CLI for tagging operations