Example: ML Research¶

This example shows how Ouro Loop's BOUND system applies to an autonomous ML experiment framework based on karpathy/autoresearch. The BOUND definition reframes single-metric optimization (val_bpb) as a formal constraint system, protecting the evaluation harness and data pipeline while granting the agent full autonomy over the training code. This is the closest example to autoresearch's original paradigm, demonstrating how Ouro Loop generalizes the ML experiment loop with explicit constraints.

Project Overview¶


Project	autoresearch — Autonomous ML Experiment Framework
Language	Python
Architecture	Single-file training (`train.py`), fixed evaluation harness (`prepare.py`), TSV result log
Domain	LLM training, autonomous experiment iteration

BOUND Definition¶

DANGER ZONES¶

Path	Risk
`prepare.py`	Fixed evaluation harness. Modifying this invalidates all comparisons.
`evaluate_bpb()`	The ground truth metric. Must never change.
Data pipeline	Tokenizer and dataloader are fixed constants.

NEVER DO¶

Never modify prepare.py — it is read-only
Never install new packages or add dependencies
Never modify the evaluation harness
Never skip logging results to results.tsv
Never use more than 5 minutes training time budget
Never commit results.tsv to git (keep untracked)

IRON LAWS¶

val_bpb is the only metric that matters — lower is better
Training always runs for exactly 5 minutes wall clock
Only train.py is modified — everything else is fixed
Every experiment is logged: commit hash, val_bpb, memory, status
Improvements keep the commit, regressions revert to previous
Simplicity wins: equal val_bpb with less code is a positive result

Development Workflow¶

# Run experiment
uv run train.py > run.log 2>&1
grep "^val_bpb:" run.log

# Log result
# commit    val_bpb memory_gb   status  description
# a1b2c3d   0.997900    44.0    keep    baseline

How autoresearch Maps to BOUND¶

The autoresearch paradigm maps cleanly onto the BOUND system:

autoresearch Concept	BOUND Equivalent
`prepare.py` is read-only	DANGER ZONE: `prepare.py`
Only `train.py` is modified	IRON LAW: only `train.py` is modified
5-minute training budget	IRON LAW: training time exactly 5 minutes
val_bpb is the metric	IRON LAW: val_bpb is the only metric
Regression reverts	IRON LAW: regressions revert to previous
No new dependencies	NEVER DO: never install new packages
Log every experiment	IRON LAW: every experiment is logged

This mapping demonstrates that autoresearch was implicitly using a BOUND system all along — Ouro Loop makes that structure explicit and enforceable.

What the BOUND Teaches¶

This ML research BOUND demonstrates several patterns specific to experiment-driven development:

Fixed Evaluation Harness¶

The evaluation function (evaluate_bpb()) and data pipeline are DANGER ZONES because modifying them invalidates all previous comparisons. In ML research, the integrity of the evaluation is more important than any individual experiment result. If the agent could modify the evaluation function, it could "improve" val_bpb by lowering the bar rather than improving the model.

Single Modifiable File¶

Constraining the agent to only modify train.py is an extreme form of BOUND — it limits the entire creative space to a single file. This constraint is powerful because it forces architectural creativity within a narrow scope. The agent must find better training strategies, not better infrastructure.

Time Budget as an IRON LAW¶

The 5-minute training budget serves the same function as a financial test coverage threshold: it prevents the agent from trading compute for quality. Without this constraint, the agent might "improve" val_bpb by simply training for longer, which would not be a genuine algorithmic improvement.

Revert-on-Regression¶

The IRON LAW "regressions revert to previous" implements autoresearch's core loop: try something, measure, keep or discard. This is autonomous remediation in its simplest form — the agent does not need to diagnose why a regression happened, it just reverts and tries something different.

Simplicity Metric¶

The rule "equal val_bpb with less code is a positive result" is unusual — most systems optimize for a single numeric metric. This IRON LAW adds a second dimension: code simplicity. It prevents the agent from adding complexity that does not improve the metric, which is a common failure mode in ML experiment iteration.

Applicable Domains¶

This BOUND pattern applies to any autonomous experiment loop:

ML model training and hyperparameter search
Compiler optimization passes
Algorithm benchmarking and comparison
Performance regression testing
A/B test experiment frameworks

The key pattern is: fix the evaluation, fix the budget, let the agent iterate freely on the implementation.