Get Started

This section explains how to download the dataset and run the full model training and evaluation workflow. You can reproduce the pipeline either using DVC or by running each step manually.

Prerequisites

Before getting started, ensure you have completed the following:

Python 3.12+ installed on your system
Project cloned from GitHub
Dependencies installed via pip
DVC configured for remote storage access

For detailed setup instructions, see the Installation Guide.

Download the Dataset

To begin training the model, you must first download the required dataset.

Navigate to the project's root directory
Use DVC to pull the raw dataset:

dvc pull

Note

Ensure DVC is installed and properly configured with access to the remote storage. This will download the dataset files to the data/raw/ directory.

Train and Evaluate the Model

Model training can be executed in two ways:

Using the DVC pipeline (recommended for reproducibility)
Running each step manually (for custom workflows)

Both approaches will produce the extracted features, train the model, and generate evaluation metrics.

Option 1: Run the DVC Pipeline (Recommended)

To reproduce the entire workflow automatically, simply run:

dvc repro

This command executes every stage defined in the DVC pipeline, from data preprocessing to model training and evaluation.

Pipeline stages executed:

Dataset conversion and preprocessing
Feature extraction
Model training
Model evaluation

Tip

Use dvc dag to visualize the dependency graph of the pipeline:

dvc dag

Info

The pipeline is defined in dvc.yaml and includes all necessary configuration for reproducible runs.

Option 2: Run the Steps Manually

If you prefer to execute each stage independently, follow the steps below.

1. Convert Dataset to Parquet/CSV

Converts the dataset into the required format:

python turing/CLI_runner/run_dataset.py parquet-to-csv

Note

This step processes the raw data and saves intermediate outputs to data/interim/.

2. Extract Features

Runs the feature extraction process:

python -m turing.features --use-combo-feature

Available options:

--use-combo-feature: Enables combined feature extraction for improved model performance
--output-dir: Specify custom output directory (default: data/interim/features/)

3. Train and Evaluate the Model

Finally, train the model and generate evaluation results:

python turing/modeling/train.py

What happens during training:

Model training on the processed dataset
Automatic best model tagging in MLflow
Generation of evaluation metrics and reports
Saving of model artifacts and predictions

After Training

Once training is complete, you'll find the following outputs:

Model Artifacts

Location: /models/mlflow_temp_models/{language}/{model_name}/
Formats: Model weights, configuration files, and tokenizer data
Languages: Java, Python, Pharo (one folder per language)

Metrics and Reports

MLflow Dashboard: Access at http://localhost:5000 (if MLflow server is running)
Unit Tests: reports/unit_tests/report.md
Behavioral Tests: reports/behavioral_tests/report.md
Data Analysis: reports/data/clean-k5000/

Logs

Training logs: Check terminal output or MLflow UI
DVC logs: Review dvc.yaml execution history

Next Steps

After successfully training and evaluating the model, explore these resources:

Deployment Workflow

REST API: Model API User Guide
HuggingFace Spaces: [Deploy automatically with GitHub Actions][hf-space-sync]

Try the Interactive GUI

Live Demo: Try the Classifier
GUI Documentation: GUI User Guide

Explore Model Details

Model Overview: Models Documentation
Language-Specific Models:
Java Model Card
Python Model Card
Pharo Model Card

Add New Models

Developer Guide: Adding a New Model
Model Selection: Model Tagging & Selection

Full Documentation

Home: Documentation Hub
Complete Guides: https://se4ai2526-uniba.github.io/Turing/

Troubleshooting

DVC Pull Fails

# Initialize DVC if not already done
dvc init

# Configure remote storage
dvc remote add -d myremote <remote-url>

# Try pulling again
dvc pull

Training Out of Memory

Reduce batch size in turing/config.py
Use GPU acceleration (see GPU Support)
Split training into smaller chunks

Module Import Errors

# Reinstall dependencies
pip install -e pyproject.toml

# Or in development mode
pip install -e ".[dev]"

MLflow Server Not Found

# Start MLflow server manually
mlflow ui --host 127.0.0.1 --port 5000

Questions?

For more information, refer to: