Locust Load Testing

To ensure the API is robust and capable of handling concurrent requests, we implemented a load testing strategy using Locust. This allows us to simulate realistic user traffic patterns, identify bottlenecks, and verify the stability of the inference engine under stress.

Key Improvements & Rationale

During initial stress testing, we identified critical concurrency issues

To address this, we refactored the inference pipeline: 1. Pure PyTorch Inference: Removed dependencies on HuggingFace Trainer for prediction, switching to direct PyTorch execution. 2. Caching: Implemented ModelInference caching (self.loaded_models). Models are now loaded once per language and reused, rather than reloading from disk for every request. 3. Dynamic Language Switching: The tests are designed to randomly switch between languages (Python, Java, Pharo) to stress-test the memory management

The Locustfile

The test logic is defined in locustfile.py. It simulates a user who: Selects a Language: Randomly picks a supported language (python, java, pharo). Sends Payload: Sends a language-specific code snippet to the /predict endpoint.

How to Run the Load Tests

Running the tests requires two separate terminal windows.

Start the API

In the first terminal, launch the FastAPI application:

uvicorn turing.api.app:app --reload

In the second terminal, navigate to the monitoring directory and start locust:

cd turing/monitoring/
locust

Access the Dashboard

Open your browser and navigate to http://localhost:8089.
Set the Number of users (e.g., 50).
Set the Spawn rate (e.g., 5 users/second).
Click Start Swarming.

Outputs and Reports

Locust provides real-time statistics in its web interface.

Some reports are in the reports directory: Location: reports/monitoring/locust/