Process RL · Poolside Laguna Hackathon · May 2026
Can we improve agent behavior by training the control loop, not just the task skill layer? Process RL turns observed terminal-agent failures into executable environments designed for training the control loop over skills the model already partially expresses. This release covers the baseline analysis, generated environments, executable gates, and reward design.
Motivation
Terminal agents can often perform the local operations a task requires, but they do not reliably bind those operations into task-level control. Meta-control is the behavioral layer that binds atomic skills into task completion: tracking progress, pivoting after failed approaches, verifying semantic state, and stopping at the right time. Process RL targets recurring meta-control failures by converting observed traces into executable environments where the shortest reliable solution path forces the missing control behavior. The point is concrete: Laguna can expose the right local move, then fail to carry it through terminal state. The trace below shows one instance. The model names the right control move, but the action never changes.
if ':' in stripped and not stripped.endswith(',') a…if ':' in stripped and not stripped.endswith(',') a…if ':' in stripped and not stripped.endswith(',') a…if ':' in stripped and not stripped.endswith(',') a…if ':' in stripped and not stripped.endswith(',') a…if ':' in stripped and not stripped.endswith(',') a…Baseline
Across the 100-task Laguna-XS.2 baseline, failures cluster around meta-control transitions the model can often describe but does not reliably execute.
The baseline separates two capabilities. Latent, atomic capability is whether the model can write a parser, inspect a file, run tests, edit code, and read an error. Realized terminal-agent capability is whether it maintains a progress ledger, changes action class after a failed approach, verifies the actual deliverable, and stops at the right time. The 28 clean solves are the positive control. Basic terminal operations are present. The failures sit in the binding between those operations and task-level control.
At the command level Laguna runs a closed loop and often fixes an immediate error on the next turn. At the task level the loop is open. No durable progress ledger spans the whole task, and the same six failure shapes recur. The repeated pivot above is the most common. A second is premature completion, where in bash-log-processor-fix the model writes “the script is complete and working” and stops without the check that would show the work is incomplete.
exit_code=0 as successTreats a clean command exit as task completion without inspecting the requested deliverable.
Re-issues the same or byte-identical action after the observation has stopped changing, until the turn cap.
Verbalizes a pivot while emitting the same action class, so state never advances.
Ends after one or two turns when the evidence so far cannot justify completion.
Issues many plausible commands with no maintained progress ledger to converge them.
After a migration or rewrite, verifies the stale source of truth instead of the new one.
These shapes appear across parsing, search, servers, migrations, estimation, and code repair. Because the same control failures recur across task categories, the deficit reads like a control behavior rather than one missing task skill, which is what makes it a candidate for training.
Behavior axes
The failure taxonomy becomes the axes a generated environment must exercise. Each axis is a control behavior the model already expresses sometimes. The training hypothesis is that targeted, gated environments can make those behaviors more reliable.
Inspect the semantic deliverable; never treat a clean command exit as success.
After a dead end, change the action class in a way that advances state.
Convert high action diversity into convergence rather than wandering.
Continue while feedback is insufficient; stop only after verified completion.
Verify the new source of truth, not stale state.
Track satisfied and unsatisfied constraints across turns.
Method
Use observed terminal-agent failure traces to generate behavior-conditioned executable environments for meta-control training. The pipeline conditions the Endless Terminals generation funnel on cards abstracted from Laguna's own failures (not hand-written tasks) and admits only environments that actually run.
The conditioning unit is a behavior card. Each of the six capability axes is one card that records the observed Laguna behavior, the TBLite traces it was drawn from, the affordance a generated task must contain, the required pivot, and the reward observable. The card content comes from the LLM trace analysis above. The card is then frozen, and the funnel samples cards by a fixed seed, so a given seed reproduces the same conditioning. No model invents a card at generation time.
The generation model conditions on the card rather than the raw rollouts. It writes a realistic task that a capable terminal agent can solve, where the shortest reliable path has to exercise the target capability. The full pipeline lives in the endless-terminals repository, and each stage gates the next.
Read Laguna-XS.2 TBLite rollouts turn by turn and extract quote-grounded meta-control failures.
Distill each failure axis into a card with observed behavior, source traces, required pivot, task affordance, and reward observable.
Condition on the card, not raw traces, so each task isolates one recurring control behavior.
Emit task instructions, container setup, initial-state checks, final-state verifiers, and privileged checkpoint truth.
Build the container, start it under Apptainer, pass initial state, run a benign command, and exercise the final verifier.
Run repeated pass@k and keep the trainable band: neither always solved nor always impossible.
Use a stronger reference only for Laguna-zero tasks to separate too-hard from broken.
Admit training tasks only when grouped rollouts produce live reward variance.
Export Harbor / Prime-RL wrappers only after executable admission and calibration pass.
The reward is a conservative spine. Final success dominates; shaping is gated so a partial-progress harvester that never finishes scores well below a completed task that stops correctly.
Anti-hack rule. Positive checkpoint reward requires ordered prefix advancement, not independent checkpoint farming, and final success must remain reachable. Final-state assertions are ordered and prefix-gated:
The released training configuration uses Prime-RL for async grouped rollouts, Verifiers for terminal-harness scoring, the
Laguna renderer for native tool formatting, a vLLM backend with weight sync, and W&B plus Weave
for metrics and rollout lineage. The training environment is environments/meta_control.
It varies the reward shape, not the optimizer: a gated baseline, a stronger anti-inertia
penalty, and a stop-verification emphasis, at a conservative learning rate near 1e-6.
Gate zero is the XML execution contract. Laguna emits XML-like native tool calls in assistant content. The harness must prove execution, not merely renderer availability:
# raw Laguna XML must drive real state change, end to end raw Laguna XML content → parsed tool-call metadata → sandbox command execution → trajectory records action and observation → verifier / reward signal changes after state mutation
If XML is treated as plain text, rollouts become “no tools,” state never changes, and reward variance collapses. The trajectory invariant follows: the assistant's emitted XML is preserved byte-for-byte in assistant content, and parsed tool metadata is added alongside it, never substituted in a way that mutates conversation history Prime-RL relies on for exact-prefix trajectories.
Training signal
We did not complete the full training run inside the hackathon window. The early curve below is included to show the first proof point: ProcessRL produces a non-flat reward signal under Laguna native tool execution.
Read this narrowly. The curve tests whether the released environments, wrappers, and verifier rewards can produce usable RL signal for Laguna. It is not a held-out transfer result, and it is not a claim that a trained policy has improved.
Artifacts
The released split is a candidate RL corpus for calibration and reward-design work. Each task packages a terminal environment, public instructions, hidden final-state checks, and verifier-facing state, with privileged truth kept out of the policy-visible prompt.
53 training and 14 held-out behavior-conditioned terminal environments generated from the trace analysis on this page.
releasedGenerator, behavior cards, executable-admission and reward-variance filters, calibration
scripts, export code, and the environments/meta_control training environment.
End-to-end path from traces to behavior cards to generated environments, executable gates, pass@k calibration, and the trainability filter.
releasedThe 100-task baseline, behavior taxonomy, and per-trace analyst reports, regenerated from
jobs/ by reports/build_index.py.
Training configs ship in the repo training environment.
Full traces
The headline charts above summarize the baseline. The complete per-task table and per-trace notes are behind the toggles below.
100 reconciled task result files; 92 reward-bearing trials; mean reward 0.336; 28 clean solves, 4 partials, 60 zero-reward, 8 runtime / infrastructure. Native tool calling, Harbor concurrency 8, 40-turn cap.
| task | score | turns | mode |
|---|---|---|---|
| amuse-install | 1 | 19 | Clean solve |
| anomaly-detection-ranking | 1 | 29 | Clean solve |
| auth_token_race_condition | 1 | 5 | Clean solve |
| basic-message-queue | 1 | 35 | Clean solve |
| broken-python | 1 | 15 | Clean solve |
| california-housing-api | 1 | 26 | Clean solve |
| chained-forensic-extraction_20260101_011957 | 1 | 23 | Clean solve |
| convolutional-layers | 1 | 13 | Clean solve |
| csv-json-jsonl-merger | 1 | 10 | Clean solve |
| grpc-plant-position-server | 1 | 12 | Clean solve |
| iris-dataset-classification | 1 | 6 | Clean solve |
| jsonl-aggregator | 1 | 5 | Clean solve |
| log-summary | 1 | 9 | Clean solve |
| maven-slf4j-conflict | 1 | 27 | Clean solve |
| mlflow-register | 1 | 12 | Clean solve |
| mtls-cert-rotation | 1 | 12 | Clean solve |
| okhttp-trailers-crash | 1 | 5 | Clean solve |
| prediction-model-evaluation | 1 | 21 | Clean solve |
| protein-sequence | 1 | 9 | Clean solve |
| raft-log-repair-concurrent-access | 1 | 3 | Clean solve |
| reproducibility-and-envsetup | 1 | 30 | Clean solve |
| schedule-vacation | 1 | 13 | Clean solve |
| sign-vector-game | 1 | 18 | Clean solve |
| simple-database-query-tool | 1 | 28 | Clean solve |
| smiles-data-lab | 1 | 13 | Clean solve |
| sql-injection-forensics | 1 | 8 | Clean solve |
| submission_a63937a5_20251224_152124 | 1 | 23 | Clean solve |
| sympy-bug-fix | 1 | 40 | Clean solve |
| bash-log-processor-fix | 0.747 | 33 | Partial progress |
| security-breach-incident-response | 0.683 | 17 | Partial progress |
| security-incident-log-analysis | 0.692 | 11 | Partial progress |
| tsl-test-case-generation | 0.780 | 40 | Partial progress |
| bracket-sequence-restoration | 0 | 40 | Genuine loop (40-turn cap) |
| pdf-table-parsing | 0 | 40 | Genuine loop (40-turn cap) |
| pgn-chess-repair-puzzles | 0 | 40 | Genuine loop (40-turn cap) |
| acl-permissions-inheritance | 0 | 40 | Unbounded search (40-turn cap) |
| api-endpoint-permission-canonicalizer | 0 | 40 | Unbounded search (40-turn cap) |
| book-portfolio-analysis | 0 | 40 | Unbounded search (40-turn cap) |
| corrupted-filesystem-recovery | 0 | 40 | Unbounded search (40-turn cap) |
| cpp-daemon-sighup-segfault | 0 | 40 | Unbounded search (40-turn cap) |
| db-migration-local-storage | 0 | 40 | Unbounded search (40-turn cap) |
| floor-plan-geometry | 0 | 40 | Unbounded search (40-turn cap) |
| game-of-stones | 0 | 40 | Unbounded search (40-turn cap) |
| git-repo-forensics | 0 | 40 | Unbounded search (40-turn cap) |
| multi-server-configuration | 0 | 40 | Unbounded search (40-turn cap) |
| parking-lot-pathfinding | 0 | 40 | Unbounded search (40-turn cap) |
| playing-card-recognition | 0 | 40 | Unbounded search (40-turn cap) |
| publisher-market-analysis | 0 | 40 | Unbounded search (40-turn cap) |
| react-typescript-debugg | 0 | 40 | Unbounded search (40-turn cap) |
| symlink-chain-traversal | 0 | 40 | Unbounded search (40-turn cap) |
| systemd-log-monitoring | 0 | 40 | Unbounded search (40-turn cap) |
| token-auth-websocket | 0 | 40 | Unbounded search (40-turn cap) |
| build-system-task-ordering | 0 | 2 | Early stop / protocol abort |
| hydra-debug-slurm-mode | 0 | 1 | Early stop / protocol abort |
| multi-labeller | 0 | 3 | Early stop / protocol abort |
| todos-api | 0 | 3 | Early stop / protocol abort |
| breast-cancer-mlflow | 0 | 15 | Runtime / infrastructure |
| container-registry-optimization | n/a | 0 | Runtime / infrastructure |
| cosign-keyless-signing | 0 | 17 | Runtime / infrastructure |
| ekf-localization | 0 | 38 | Runtime / infrastructure |
| etl_checkpoint_resume_bug | n/a | 0 | Runtime / infrastructure |
| fix-js-network-controller | 0 | 5 | Runtime / infrastructure |
| grid-pathfinding | 0 | 15 | Runtime / infrastructure |
| image-tile-identification | n/a | 22 | Runtime / infrastructure |
| legal-summary-extraction | n/a | 94 | Runtime / infrastructure |
| mech-system | 0 | 10 | Runtime / infrastructure |
| monorepo-changelog-cli | 0 | 11 | Runtime / infrastructure |
| network-log-normalization | n/a | 0 | Runtime / infrastructure |
| pandas-numpy-data-analysis | n/a | 13 | Runtime / infrastructure |
| permutation-construction-100k | 0 | 12 | Runtime / infrastructure |
| publisher-market-analysis-v2 | n/a | 64 | Runtime / infrastructure |
| python-api-rate-limit | 0 | 12 | Runtime / infrastructure |
| reverse-engineer-stack-vm | 0 | 5 | Runtime / infrastructure |
| rsa-jwt-token-api-redis-blacklist | 0 | 16 | Runtime / infrastructure |
| scan-linux-persistence-artifacts | 0 | 12 | Runtime / infrastructure |
| task-xxe-exploit | 0 | 6 | Runtime / infrastructure |
| vimscript-vim-quine | 0 | 10 | Runtime / infrastructure |
| word-derangement-mapping | n/a | 48 | Runtime / infrastructure |
| application-debug | 0 | 18 | Unresolved (no clean loop) |
| bandit-delayed-feedback | 0 | 8 | Unresolved (no clean loop) |
| battery-charging-optimization | 0 | 4 | Unresolved (no clean loop) |
| bloom-filter-cache-penetration-prevention | 0 | 6 | Unresolved (no clean loop) |
| build-merkle-tree-cli-sha512 | 0 | 5 | Unresolved (no clean loop) |
| competitive-programming-solver | 0 | 15 | Unresolved (no clean loop) |
| cryptographic-protocol-verifier | 0 | 6 | Unresolved (no clean loop) |
| distributed-test-execution-scheduler | 0 | 17 | Unresolved (no clean loop) |
| fix_async_worker_queue | 0 | 13 | Unresolved (no clean loop) |
| html-index-analysis | 0 | 5 | Unresolved (no clean loop) |
| industrial-kiln-controller | 0 | 21 | Unresolved (no clean loop) |
| iot-device-registration-server | 0 | 25 | Unresolved (no clean loop) |
| jq-data-processing | 0 | 15 | Unresolved (no clean loop) |
| malicious-package-forensics | 0 | 14 | Unresolved (no clean loop) |
| neural-architecture-search-final | 0 | 6 | Unresolved (no clean loop) |
| neutron-submission | 0 | 6 | Unresolved (no clean loop) |
| pandas-etl | 0 | 12 | Unresolved (no clean loop) |
| sakila-sqlite-queries | 0 | 20 | Unresolved (no clean loop) |
| sales-data-csv-analysis | 0 | 4 | Unresolved (no clean loop) |
| server-log-analysis | 0 | 4 | Unresolved (no clean loop) |
| service-deployment-wave-planner | 0 | 7 | Unresolved (no clean loop) |
| supply-chain-fulfillment | 0 | 11 | Unresolved (no clean loop) |
Inspect, build, verify, clean stop. These runs show the latent skills are present.
Solved in 19 turns with 17 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 29 turns with 28 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 5 turns with 3 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 35 turns with 31 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 15 turns with 14 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 26 turns with 25 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 23 turns with 0 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 13 turns with 12 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 10 turns with 9 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 12 turns with 10 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 6 turns with 0 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 5 turns with 4 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 9 turns with 8 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 27 turns with 23 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 12 turns with 11 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 12 turns with 0 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 5 turns with 4 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 21 turns with 0 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 9 turns with 7 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 3 turns with 0 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 30 turns with 25 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 13 turns with 12 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 18 turns with 0 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 28 turns with 26 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 13 turns with 12 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 8 turns with 0 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 23 turns with 21 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Solved in 40 turns with 40 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.
Verifier gives partial credit: useful curriculum material if the missing checks are explicit.
Partial verifier reward 0.780 in 40 turns. This is high-value curriculum material because the trace likely contains real progress plus a missing final verification, edge-case check, or completion criterion.
Partial verifier reward 0.747 in 33 turns. This is high-value curriculum material because the trace likely contains real progress plus a missing final verification, edge-case check, or completion criterion.
Partial verifier reward 0.692 in 11 turns. This is high-value curriculum material because the trace likely contains real progress plus a missing final verification, edge-case check, or completion criterion.
Partial verifier reward 0.683 in 17 turns. This is high-value curriculum material because the trace likely contains real progress plus a missing final verification, edge-case check, or completion criterion.
Burns the turn cap with repeated or near-repeated actions; the recognition-to-action coupling target.
Zero reward after 40 turns. The dominant action accounts for 50% of tool calls with 0 adjacent exact repeats, so the failure is action inertia under stale evidence rather than missing atomic terminal skill.
Zero reward after 40 turns. The dominant action accounts for 75% of tool calls with 29 adjacent exact repeats, so the failure is action inertia under stale evidence rather than missing atomic terminal skill.
Zero reward after 40 turns. The dominant action accounts for 75% of tool calls with 29 adjacent exact repeats, so the failure is action inertia under stale evidence rather than missing atomic terminal skill.
Uses the whole budget without enough repetition to be a pure loop; progress-ledger and search-discipline target.
Zero reward at the turn cap with 40 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.
Zero reward at the turn cap with 26 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.
Zero reward at the turn cap with 13 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.
Zero reward at the turn cap with 0 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.
Zero reward at the turn cap with 23 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.
Zero reward at the turn cap with 31 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.
Zero reward at the turn cap with 0 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.
Zero reward at the turn cap with 0 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.
Zero reward at the turn cap with 39 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.
Zero reward at the turn cap with 38 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.
Zero reward at the turn cap with 40 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.
Zero reward at the turn cap with 0 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.
Zero reward at the turn cap with 0 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.
Zero reward at the turn cap with 39 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.
Zero reward at the turn cap with 34 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.
Zero reward at the turn cap with 37 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.
Zero reward at the turn cap with 0 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.
Stops or aborts too early; stop-quality and tool-call robustness target.
Zero reward after only 2 turns. Treat as stop-quality or protocol-abort evidence until the raw trace proves it was a genuine task-level decision.
Zero reward after only 1 turns. Treat as stop-quality or protocol-abort evidence until the raw trace proves it was a genuine task-level decision.
Zero reward after only 3 turns. Treat as stop-quality or protocol-abort evidence until the raw trace proves it was a genuine task-level decision.
Zero reward after only 3 turns. Treat as stop-quality or protocol-abort evidence until the raw trace proves it was a genuine task-level decision.
Trial or runtime exceptions; useful for harness triage, not model-behavior claims.
Trial produced no clean behavioral verdict after 15 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.
Trial produced no clean behavioral verdict after 0 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.
Trial produced no clean behavioral verdict after 17 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.
Trial produced no clean behavioral verdict after 38 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.
Trial produced no clean behavioral verdict after 0 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.
Trial produced no clean behavioral verdict after 5 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.
Trial produced no clean behavioral verdict after 15 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.
Trial produced no clean behavioral verdict after 22 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.
Trial produced no clean behavioral verdict after 94 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.
Trial produced no clean behavioral verdict after 10 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.
Trial produced no clean behavioral verdict after 11 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.
Trial produced no clean behavioral verdict after 0 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.
Trial produced no clean behavioral verdict after 13 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.
Trial produced no clean behavioral verdict after 12 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.
Trial produced no clean behavioral verdict after 64 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.
Trial produced no clean behavioral verdict after 12 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.
Trial produced no clean behavioral verdict after 5 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.
Trial produced no clean behavioral verdict after 16 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.
Trial produced no clean behavioral verdict after 12 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.
Trial produced no clean behavioral verdict after 6 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.
Trial produced no clean behavioral verdict after 10 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.
Trial produced no clean behavioral verdict after 48 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.
Zero reward before the cap without a clean loop signature; adjacent capability expression target.
Zero reward after 18 turns with 15 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.
Zero reward after 8 turns with 7 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.
Zero reward after 4 turns with 3 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.
Zero reward after 6 turns with 5 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.
Zero reward after 5 turns with 4 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.
Zero reward after 15 turns with 14 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.
Zero reward after 6 turns with 4 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.
Zero reward after 17 turns with 16 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.
Zero reward after 13 turns with 0 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.
Zero reward after 5 turns with 4 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.
Zero reward after 21 turns with 20 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.
Zero reward after 25 turns with 22 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.
Zero reward after 15 turns with 14 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.
Zero reward after 14 turns with 13 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.
Zero reward after 6 turns with 5 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.
Zero reward after 6 turns with 5 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.
Zero reward after 12 turns with 8 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.
Zero reward after 20 turns with 17 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.
Zero reward after 4 turns with 3 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.
Zero reward after 4 turns with 3 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.
Zero reward after 7 turns with 6 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.
Zero reward after 11 turns with 10 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.