Skip to content

Get Started

This section explains how to download the dataset and run the full model training and evaluation workflow. You can reproduce the pipeline either using DVC or by running each step manually.


Prerequisites

Before getting started, ensure you have completed the following:

  • Python 3.12+ installed on your system
  • Project cloned from GitHub
  • Dependencies installed via pip
  • DVC configured for remote storage access

For detailed setup instructions, see the Installation Guide.


Download the Dataset

To begin training the model, you must first download the required dataset.

  1. Navigate to the project's root directory
  2. Use DVC to pull the raw dataset:
dvc pull

Note

Ensure DVC is installed and properly configured with access to the remote storage. This will download the dataset files to the data/raw/ directory.


Train and Evaluate the Model

Model training can be executed in two ways:

  • Using the DVC pipeline (recommended for reproducibility)
  • Running each step manually (for custom workflows)

Both approaches will produce the extracted features, train the model, and generate evaluation metrics.


To reproduce the entire workflow automatically, simply run:

dvc repro

This command executes every stage defined in the DVC pipeline, from data preprocessing to model training and evaluation.

Pipeline stages executed:

  1. Dataset conversion and preprocessing
  2. Feature extraction
  3. Model training
  4. Model evaluation

Tip

Use dvc dag to visualize the dependency graph of the pipeline:

dvc dag

Info

The pipeline is defined in dvc.yaml and includes all necessary configuration for reproducible runs.


Option 2: Run the Steps Manually

If you prefer to execute each stage independently, follow the steps below.

1. Convert Dataset to Parquet/CSV

Converts the dataset into the required format:

python turing/CLI_runner/run_dataset.py parquet-to-csv

Note

This step processes the raw data and saves intermediate outputs to data/interim/.

2. Extract Features

Runs the feature extraction process:

python -m turing.features --use-combo-feature

Available options:

  • --use-combo-feature: Enables combined feature extraction for improved model performance
  • --output-dir: Specify custom output directory (default: data/interim/features/)

3. Train and Evaluate the Model

Finally, train the model and generate evaluation results:

python turing/modeling/train.py

What happens during training:

  • Model training on the processed dataset
  • Automatic best model tagging in MLflow
  • Generation of evaluation metrics and reports
  • Saving of model artifacts and predictions

After Training

Once training is complete, you'll find the following outputs:

Model Artifacts

  • Location: /models/mlflow_temp_models/{language}/{model_name}/
  • Formats: Model weights, configuration files, and tokenizer data
  • Languages: Java, Python, Pharo (one folder per language)

Metrics and Reports

  • MLflow Dashboard: Access at http://localhost:5000 (if MLflow server is running)
  • Unit Tests: reports/unit_tests/report.md
  • Behavioral Tests: reports/behavioral_tests/report.md
  • Data Analysis: reports/data/clean-k5000/

Logs

  • Training logs: Check terminal output or MLflow UI
  • DVC logs: Review dvc.yaml execution history

Next Steps

After successfully training and evaluating the model, explore these resources:

Deployment Workflow

  • REST API: Model API User Guide
  • HuggingFace Spaces: [Deploy automatically with GitHub Actions][hf-space-sync]

Try the Interactive GUI

Explore Model Details

Add New Models

Full Documentation


Troubleshooting

DVC Pull Fails

# Initialize DVC if not already done
dvc init

# Configure remote storage
dvc remote add -d myremote <remote-url>

# Try pulling again
dvc pull

Training Out of Memory

  • Reduce batch size in turing/config.py
  • Use GPU acceleration (see GPU Support)
  • Split training into smaller chunks

Module Import Errors

# Reinstall dependencies
pip install -e pyproject.toml

# Or in development mode
pip install -e ".[dev]"

MLflow Server Not Found

# Start MLflow server manually
mlflow ui --host 127.0.0.1 --port 5000

Questions?

For more information, refer to: