Get Started
This section explains how to download the dataset and run the full model training and evaluation workflow. You can reproduce the pipeline either using DVC or by running each step manually.
Prerequisites
Before getting started, ensure you have completed the following:
- Python 3.12+ installed on your system
- Project cloned from GitHub
- Dependencies installed via pip
- DVC configured for remote storage access
For detailed setup instructions, see the Installation Guide.
Download the Dataset
To begin training the model, you must first download the required dataset.
- Navigate to the project's root directory
- Use DVC to pull the raw dataset:
Note
Ensure DVC is installed and properly configured with access to the remote storage. This will download the dataset files to the data/raw/ directory.
Train and Evaluate the Model
Model training can be executed in two ways:
- Using the DVC pipeline (recommended for reproducibility)
- Running each step manually (for custom workflows)
Both approaches will produce the extracted features, train the model, and generate evaluation metrics.
Option 1: Run the DVC Pipeline (Recommended)
To reproduce the entire workflow automatically, simply run:
This command executes every stage defined in the DVC pipeline, from data preprocessing to model training and evaluation.
Pipeline stages executed:
- Dataset conversion and preprocessing
- Feature extraction
- Model training
- Model evaluation
Info
The pipeline is defined in dvc.yaml and includes all necessary configuration for reproducible runs.
Option 2: Run the Steps Manually
If you prefer to execute each stage independently, follow the steps below.
1. Convert Dataset to Parquet/CSV
Converts the dataset into the required format:
Note
This step processes the raw data and saves intermediate outputs to data/interim/.
2. Extract Features
Runs the feature extraction process:
Available options:
--use-combo-feature: Enables combined feature extraction for improved model performance--output-dir: Specify custom output directory (default:data/interim/features/)
3. Train and Evaluate the Model
Finally, train the model and generate evaluation results:
What happens during training:
- Model training on the processed dataset
- Automatic best model tagging in MLflow
- Generation of evaluation metrics and reports
- Saving of model artifacts and predictions
After Training
Once training is complete, you'll find the following outputs:
Model Artifacts
- Location:
/models/mlflow_temp_models/{language}/{model_name}/ - Formats: Model weights, configuration files, and tokenizer data
- Languages: Java, Python, Pharo (one folder per language)
Metrics and Reports
- MLflow Dashboard: Access at
http://localhost:5000(if MLflow server is running) - Unit Tests:
reports/unit_tests/report.md - Behavioral Tests:
reports/behavioral_tests/report.md - Data Analysis:
reports/data/clean-k5000/
Logs
- Training logs: Check terminal output or MLflow UI
- DVC logs: Review
dvc.yamlexecution history
Next Steps
After successfully training and evaluating the model, explore these resources:
Deployment Workflow
- REST API: Model API User Guide
- HuggingFace Spaces: [Deploy automatically with GitHub Actions][hf-space-sync]
Try the Interactive GUI
- Live Demo: Try the Classifier
- GUI Documentation: GUI User Guide
Explore Model Details
- Model Overview: Models Documentation
- Language-Specific Models:
- Java Model Card
- Python Model Card
- Pharo Model Card
Add New Models
- Developer Guide: Adding a New Model
- Model Selection: Model Tagging & Selection
Full Documentation
- Home: Documentation Hub
- Complete Guides: https://se4ai2526-uniba.github.io/Turing/
Troubleshooting
DVC Pull Fails
# Initialize DVC if not already done
dvc init
# Configure remote storage
dvc remote add -d myremote <remote-url>
# Try pulling again
dvc pull
Training Out of Memory
- Reduce batch size in
turing/config.py - Use GPU acceleration (see GPU Support)
- Split training into smaller chunks
Module Import Errors
# Reinstall dependencies
pip install -e pyproject.toml
# Or in development mode
pip install -e ".[dev]"
MLflow Server Not Found
Questions?
For more information, refer to: