BenchLoop Documentation

BenchLoop is a Python library for managing, processing, and benchmarking datasets in SQLite databases, designed for AI pipelines, LLM prompt engineering, and dataset curation.

Introduction
Installation
Project Structure
Quickstart
API Reference
Advanced Usage
Best Practices
Troubleshooting
FAQ

Introduction

BenchLoop is a Python library for managing, processing, and benchmarking datasets in SQLite databases, designed for AI pipelines, LLM prompt engineering, and dataset curation. It enables you to:

Load and update structured data in SQLite.
Run prompts row-by-row, substitute variables, and store LLM responses.
Filter, export, and benchmark datasets with ease.
Build reproducible, scalable data loops for AI/ML workflows.

Installation

Requirements:

Python 3.7+
Standard libraries: sqlite3, csv, json, os, difflib
For LLM calls: openai (install via pip install openai)

Install BenchLoop:

pip install benchloop
pip install openai  # For LLM prompt execution

Project Structure

benchloop/
├── __init__.py
├── loader.py
├── db_manager.py
├── prompt_runner.py
├── dataset_exporter.py
├── benchmarker.py
docs/
└── benchloop.md
main.py

loader.py: Data ingestion (CSV, JSON, dicts) into SQLite.
db_manager.py: Core DB operations and flexible filtering.
prompt_runner.py: Row-wise prompt execution and LLM integration.
dataset_exporter.py: Export datasets for training (JSONL, etc.).
benchmarker.py: Benchmarking AI responses vs. ground truth.
main.py: Example/test script for all features.

Quickstart

from benchloop.loader import load_table
from benchloop.prompt_runner import execute_prompt_on_table
from benchloop.dataset_exporter import export_training_dataset
from benchloop.benchmarker import benchmark_responses

# 1. Load data
load_table(
    table_name="products",
    data_source=[{"id": 1, "name": "Zapato", "price": "50"}],
    db_path="mydb.sqlite"
)

# 2. Run prompts and store LLM responses
execute_prompt_on_table(
    table_name="products",
    prompt_template="Describe the product {name} that costs {price} dollars.",
    columns_variables=["name", "price"],
    result_mapping={"response": "llm_response"},
    db_path="mydb.sqlite",
    model="gpt-4o",
    api_key="sk-...",
)

# 3. Export dataset for training
export_training_dataset(
    table_name="products",
    prompt_template="Describe the product {name} that costs {price} dollars.",
    response_column="llm_response",
    output_file="dataset.jsonl",
    db_path="mydb.sqlite",
    format="messages"
)

# 4. Benchmark responses
benchmark_responses(
    table_name="products",
    column_ai="llm_response",
    column_ground_truth="ground_truth",
    db_path="mydb.sqlite",
    benchmark_tag=None
)

API Reference

load_table

Purpose: Load structured data into an SQLite table from CSV, JSON, or a list of dicts.
Location: benchloop/loader.py

load_table(table_name: str, data_source: Union[str, List[Dict]], db_path: str)

table_name: Name of the table.
data_source: Path to CSV/JSON file or a list of dicts.
db_path: Path to the SQLite database.

Features:

Auto-creates table and columns as needed.
Adds new columns on the fly.
Prevents duplicate insertions (if id is present).

Example:

load_table("mytable", "data.csv", "mydb.sqlite")

filter_rows

Purpose: Flexible filtering of rows with support for operators and include/exclude modes.
Location: benchloop/db_manager.py

DBManager.filter_rows(table_name: str, filters: dict, mode: str, db_path: str) -> list

table_name: Table to filter.
filters: Dict of conditions (supports =, !=, >, <, contains, startswith, IN).
mode: 'include' or 'exclude'.
db_path: Path to the SQLite database.

Example:

rows = DBManager.filter_rows("mytable", {"price": {">": 100}}, "include", "mydb.sqlite")

execute_prompt_on_table

Purpose: Run prompts row-by-row, substitute variables, call LLM, and store responses.
Location: benchloop/prompt_runner.py

execute_prompt_on_table(
    table_name: str,
    prompt_template: str,
    columns_variables: List[str],
    result_mapping: Dict[str, str],
    db_path: str,
    filters: Optional[Dict] = None,
    limit: Optional[int] = None,
    model: str = "gpt-4o",
    api_key: str = ""
)

table_name: Table to process.
prompt_template: Prompt with {} placeholders.
columns_variables: List of columns to substitute.
result_mapping: Dict mapping LLM response fields to DB columns.
db_path: Path to the SQLite database.
filters: Optional row filters.
limit: Max rows to process.
model: LLM model name.
api_key: OpenAI API key.

Example:

execute_prompt_on_table(
    "products",
    "Describe {name}",
    ["name"],
    {"response": "llm_response"},
    "mydb.sqlite",
    model="gpt-4o",
    api_key="sk-..."
)

export_training_dataset

Purpose: Export a JSONL dataset for training, with prompt/response pairs.
Location: benchloop/dataset_exporter.py

export_training_dataset(
    table_name: str,
    prompt_template: str,
    response_column: str,
    output_file: str,
    db_path: str,
    filters: Optional[Dict] = None,
    format: str = "messages"
)

table_name: Table to export from.
prompt_template: Prompt template with placeholders.
response_column: Column with LLM response.
output_file: Output JSONL file path.
db_path: Path to the SQLite database.
filters: Optional row filters.
format: "messages" (OpenAI style) or "input/output".

Example:

export_training_dataset(
    "products",
    "Describe {name}",
    "llm_response",
    "dataset.jsonl",
    "mydb.sqlite"
)

benchmark_responses

Purpose: Compare AI responses vs. ground truth and compute metrics.
Location: benchloop/benchmarker.py

benchmark_responses(
    table_name: str,
    column_ai: str,
    column_ground_truth: str,
    db_path: str,
    benchmark_tag: Optional[str] = None,
    benchmark_column: str = "benchmark",
    similarity_threshold: float = 0.9
) -> Dict

table_name: Table to benchmark.
column_ai: Column with AI responses.
column_ground_truth: Column with ground truth.
db_path: Path to the SQLite database.
benchmark_tag: Only include rows with this tag.
benchmark_column: Name of the tag column.
similarity_threshold: Fuzzy match threshold.

Example:

metrics = benchmark_responses(
    "products",
    "llm_response",
    "ground_truth",
    "mydb.sqlite",
    benchmark_tag="PrecioTest"
)

Advanced Usage

Custom Prompt Templates: Use any columns in your prompt template.
Chaining: Use filter_rows to select rows, then run prompts or export.
Multiple Benchmarks: Use the benchmark column to tag and compare different subsets.
Export Formats: Use format="input/output" for simple datasets, or "messages" for OpenAI-style.

Best Practices

Always use unique id fields for deduplication.
Use filters to avoid re-processing already-annotated rows.
Store ground truth in a dedicated column for benchmarking.
Version your datasets and keep track of prompt templates used.

Troubleshooting

No rows exported? Check your filters and that the response column is populated.
Prompt errors? Ensure all variables in the template exist as columns.
Benchmark returns 0 rows? Check the benchmark tag and that ground truth is present.

FAQ

Q: Can I use this with any LLM provider?
A: The prompt runner is designed for OpenAI, but you can adapt it for other APIs.
Q: How do I add new columns?
A: Just include new keys in your data dicts or CSV/JSON files; columns are added automatically.
Q: Can I export only a subset of rows?
A: Yes, use the filters parameter in export_training_dataset.
Q: How do I benchmark only a specific subset?
A: Use the benchmark_tag parameter in benchmark_responses.

BenchLoop is designed to make dataset curation, prompt engineering, and benchmarking fast, reproducible, and robust.
For more examples, see main.py or open an issue on the project repository.

BenchLoop Documentation

Table of Contents

Introduction

Installation

Project Structure

Quickstart

API Reference

load_table

filter_rows

execute_prompt_on_table

export_training_dataset

benchmark_responses

Advanced Usage

Best Practices

Troubleshooting

FAQ