Process RL · Poolside Laguna Hackathon · May 2026

Behavior-conditioned environments for terminal-agent meta-control.

Can we improve agent behavior by training the control loop, not just the task skill layer? Process RL turns observed terminal-agent failures into executable environments designed for training the control loop over skills the model already partially expresses. This release covers the baseline analysis, generated environments, executable gates, and reward design.

Marius Kiskemper · Henrike Kone · Mert Acar · Jarrod Barnes

Motivation

Terminal agents can often perform the local operations a task requires, but they do not reliably bind those operations into task-level control. Meta-control is the behavioral layer that binds atomic skills into task completion: tracking progress, pivoting after failed approaches, verifying semantic state, and stopping at the right time. Process RL targets recurring meta-control failures by converting observed traces into executable environments where the shortest reliable solution path forces the missing control behavior. The point is concrete: Laguna can expose the right local move, then fail to carry it through terminal state. The trace below shows one instance. The model names the right control move, but the action never changes.

book-portfolio-analysis last 6 of 40 turns · score 0
turnreasoning (verbatim) action
T35“Let me try a different approach - I'll manually fix the JSON files:”if ':' in stripped and not stripped.endswith(',') a…
T36“Let me try a different approach - I'll manually fix the JSON files by adding the missing commas:”if ':' in stripped and not stripped.endswith(',') a…
T37“Let me try a different approach - I'll manually fix the JSON files by adding the missing commas:”if ':' in stripped and not stripped.endswith(',') a…
T38“I see the issue - the JSON has missing commas between objects. Let me fix this properly:”if ':' in stripped and not stripped.endswith(',') a…
T39“Let me try a different approach - I'll manually fix the JSON files by adding the missing commas:”if ':' in stripped and not stripped.endswith(',') a…
T40“Let me try a different approach - I'll manually fix the JSON files by adding the missing commas:”if ':' in stripped and not stripped.endswith(',') a…
Across these turns the action is the same 1,420-character fixer. The line shown is byte-identical every turn, while the reasoning keeps announcing a different approach. The run reaches the 40-turn cap with score 0.
Meta-control is the model’s ability to track task progress, decide when its current strategy is failing, change actions accordingly, verify the deliverable, and stop at the right time.

Baseline

Execution skill is not task control

Across the 100-task Laguna-XS.2 baseline, failures cluster around meta-control transitions the model can often describe but does not reliably execute.

0.336
mean reward / 92 trials
28/100
clean solves
21
zero-reward runs at the 40-turn cap

The baseline separates two capabilities. Latent, atomic capability is whether the model can write a parser, inspect a file, run tests, edit code, and read an error. Realized terminal-agent capability is whether it maintains a progress ledger, changes action class after a failed approach, verifies the actual deliverable, and stops at the right time. The 28 clean solves are the positive control. Basic terminal operations are present. The failures sit in the binding between those operations and task-level control.

At the command level Laguna runs a closed loop and often fixes an immediate error on the next turn. At the task level the loop is open. No durable progress ledger spans the whole task, and the same six failure shapes recur. The repeated pivot above is the most common. A second is premature completion, where in bash-log-processor-fix the model writes “the script is complete and working” and stops without the check that would show the work is incomplete.

Reads exit_code=0 as success

Treats a clean command exit as task completion without inspecting the requested deliverable.

Loops near-identical commands

Re-issues the same or byte-identical action after the observation has stopped changing, until the turn cap.

Says “different approach,” repeats the action

Verbalizes a pivot while emitting the same action class, so state never advances.

Stops prematurely under sparse feedback

Ends after one or two turns when the evidence so far cannot justify completion.

Wanders without convergence

Issues many plausible commands with no maintained progress ledger to converge them.

Confuses old state with new

After a migration or rewrite, verifies the stale source of truth instead of the new one.

Horizontal bar chart of Laguna-XS.2 behavior modes across 100 TBLite tasks: clean-solve 28, unresolved capability miss 22, runtime/infrastructure 22, unbounded search at the 40-turn cap 17, early stop 4, partial progress 4, genuine loop 3.
Every zero-reward run is labeled by behavior, computed from the reconciled latest-per-task result JSONs and tool-call logs. The control-failure modes (loops, unbounded search at the cap, and early stops) are the training targets. Runtime and infrastructure cases are excluded from model-behavior claims.
Stacked histogram of agent turns before a run ended, split by outcome. Solves resolve across low and middle turn counts; zero-reward runs concentrate in the 36-40 turn bin where 21 runs burn the 40-turn cap.
Solves resolve early and in the middle of the budget. 21 zero-reward runs burn all the way to the 40-turn cap without converging. Together with the trace analysis, this points to a missing stop-and-pivot signal rather than a missing terminal skill.

These shapes appear across parsing, search, servers, migrations, estimation, and code repair. Because the same control failures recur across task categories, the deficit reads like a control behavior rather than one missing task skill, which is what makes it a candidate for training.

Behavior axes

Six meta-control targets

The failure taxonomy becomes the axes a generated environment must exercise. Each axis is a control behavior the model already expresses sometimes. The training hypothesis is that targeted, gated environments can make those behaviors more reliable.

Verification-directed action

Inspect the semantic deliverable; never treat a clean command exit as success.

Recognition-to-action coupling

After a dead end, change the action class in a way that advances state.

Bounded search discipline

Convert high action diversity into convergence rather than wandering.

Stop / continue calibration

Continue while feedback is insufficient; stop only after verified completion.

Stateful migration verification

Verify the new source of truth, not stale state.

Long-horizon progress ledger

Track satisfied and unsatisfied constraints across turns.

Method

Trace failures become executable tasks

Use observed terminal-agent failure traces to generate behavior-conditioned executable environments for meta-control training. The pipeline conditions the Endless Terminals generation funnel on cards abstracted from Laguna's own failures (not hand-written tasks) and admits only environments that actually run.

Process RL pipeline: behavior analysis, environment generation, executable admission, calibration, and released environments.
Behavioral analysis becomes generated terminal environments; only executable tasks are admitted, difficulty is calibrated against Laguna pass@k, and the trainable band is packaged for RL training and held-out evaluation.

The conditioning unit is a behavior card. Each of the six capability axes is one card that records the observed Laguna behavior, the TBLite traces it was drawn from, the affordance a generated task must contain, the required pivot, and the reward observable. The card content comes from the LLM trace analysis above. The card is then frozen, and the funnel samples cards by a fixed seed, so a given seed reproduces the same conditioning. No model invents a card at generation time.

The generation model conditions on the card rather than the raw rollouts. It writes a realistic task that a capable terminal agent can solve, where the shortest reliable path has to exercise the target capability. The full pipeline lives in the endless-terminals repository, and each stage gates the next.

Trace analysis

Read Laguna-XS.2 TBLite rollouts turn by turn and extract quote-grounded meta-control failures.

Behavior cards

Distill each failure axis into a card with observed behavior, source traces, required pivot, task affordance, and reward observable.

Generator conditioning

Condition on the card, not raw traces, so each task isolates one recurring control behavior.

Terminal package

Emit task instructions, container setup, initial-state checks, final-state verifiers, and privileged checkpoint truth.

Executable admission

Build the container, start it under Apptainer, pass initial state, run a benign command, and exercise the final verifier.

Laguna calibration

Run repeated pass@k and keep the trainable band: neither always solved nor always impossible.

Reference check

Use a stronger reference only for Laguna-zero tasks to separate too-hard from broken.

Reward variance

Admit training tasks only when grouped rollouts produce live reward variance.

RL wrappers

Export Harbor / Prime-RL wrappers only after executable admission and calibration pass.

Reward

The reward is a conservative spine. Final success dominates; shaping is gated so a partial-progress harvester that never finishes scores well below a completed task that stops correctly.

R = final_success + gated_progress + stop_quality − nonprogress_penalty − malformed_tool_penalty − turn_cost
final_success
Dominant binary hidden final verifier. Its weight dominates the spine.
gated_progress
Ordered longest-prefix checkpoint progress, not unordered independent partial credit.
stop_quality
Reward stopping only when the final verifier passes; penalize early stop and run-past-complete.
− nonprogress_penalty
Penalize unchanged state plus repeated or near-repeated actions after the same observation.
− malformed_tool_penalty
Small penalty for bad tool-call formatting.
− turn_cost
Mild capped cost to discourage waste without inducing premature stopping.

Anti-hack rule. Positive checkpoint reward requires ordered prefix advancement, not independent checkpoint farming, and final success must remain reachable. Final-state assertions are ordered and prefix-gated:

Environment validitySemantic repair or state creationDownstream query or probe over repaired stateFinal artifactVerified done-state

Training stack

The released training configuration uses Prime-RL for async grouped rollouts, Verifiers for terminal-harness scoring, the Laguna renderer for native tool formatting, a vLLM backend with weight sync, and W&B plus Weave for metrics and rollout lineage. The training environment is environments/meta_control. It varies the reward shape, not the optimizer: a gated baseline, a stronger anti-inertia penalty, and a stop-verification emphasis, at a conservative learning rate near 1e-6.

Gate zero is the XML execution contract. Laguna emits XML-like native tool calls in assistant content. The harness must prove execution, not merely renderer availability:

# raw Laguna XML must drive real state change, end to end
raw Laguna XML content
   parsed tool-call metadata
   sandbox command execution
   trajectory records action and observation
   verifier / reward signal changes after state mutation

If XML is treated as plain text, rollouts become “no tools,” state never changes, and reward variance collapses. The trajectory invariant follows: the assistant's emitted XML is preserved byte-for-byte in assistant content, and parsed tool metadata is added alongside it, never substituted in a way that mutates conversation history Prime-RL relies on for exact-prefix trajectories.

Training signal

Early reward signal, not transfer

We did not complete the full training run inside the hackathon window. The early curve below is included to show the first proof point: ProcessRL produces a non-flat reward signal under Laguna native tool execution.

Read this narrowly. The curve tests whether the released environments, wrappers, and verifier rewards can produce usable RL signal for Laguna. It is not a held-out transfer result, and it is not a claim that a trained policy has improved.

Early reward-per-step curve under Laguna native tool execution, showing reward variation across orchestrator steps.
Early reward-per-step trace from the hackathon run. The important signal is not the absolute value of the curve; it is that the execution path produces measurable reward variation rather than a flat or dead reward channel.

Artifacts

Released artifacts

The released split is a candidate RL corpus for calibration and reward-design work. Each task packages a terminal environment, public instructions, hidden final-state checks, and verifier-facing state, with privileged truth kept out of the policy-visible prompt.

Dataset · Hugging Face

processrl-terminal-environments

53 training and 14 held-out behavior-conditioned terminal environments generated from the trace analysis on this page.

released
Code · GitHub

endless-terminals

Generator, behavior cards, executable-admission and reward-variance filters, calibration scripts, export code, and the environments/meta_control training environment.

released
Runbook

Reproduction runbook

End-to-end path from traces to behavior cards to generated environments, executable gates, pass@k calibration, and the trainability filter.

released
This report · Vercel

Baseline behavior report

The 100-task baseline, behavior taxonomy, and per-trace analyst reports, regenerated from jobs/ by reports/build_index.py.

live

Training configs ship in the repo training environment.

Full traces

Trace table

The headline charts above summarize the baseline. The complete per-task table and per-trace notes are behind the toggles below.

Outcome split across 100 TBLite tasks: 28 solved, 4 partial, 60 zero reward, 8 runtime or infrastructure; mean reward 0.336 across 92 reward-bearing trials.

100 reconciled task result files; 92 reward-bearing trials; mean reward 0.336; 28 clean solves, 4 partials, 60 zero-reward, 8 runtime / infrastructure. Native tool calling, Harbor concurrency 8, 40-turn cap.

Per-task table100 rows · task, score, turns, mode
taskscoreturnsmode
amuse-install119Clean solve
anomaly-detection-ranking129Clean solve
auth_token_race_condition15Clean solve
basic-message-queue135Clean solve
broken-python115Clean solve
california-housing-api126Clean solve
chained-forensic-extraction_20260101_011957123Clean solve
convolutional-layers113Clean solve
csv-json-jsonl-merger110Clean solve
grpc-plant-position-server112Clean solve
iris-dataset-classification16Clean solve
jsonl-aggregator15Clean solve
log-summary19Clean solve
maven-slf4j-conflict127Clean solve
mlflow-register112Clean solve
mtls-cert-rotation112Clean solve
okhttp-trailers-crash15Clean solve
prediction-model-evaluation121Clean solve
protein-sequence19Clean solve
raft-log-repair-concurrent-access13Clean solve
reproducibility-and-envsetup130Clean solve
schedule-vacation113Clean solve
sign-vector-game118Clean solve
simple-database-query-tool128Clean solve
smiles-data-lab113Clean solve
sql-injection-forensics18Clean solve
submission_a63937a5_20251224_152124123Clean solve
sympy-bug-fix140Clean solve
bash-log-processor-fix0.74733Partial progress
security-breach-incident-response0.68317Partial progress
security-incident-log-analysis0.69211Partial progress
tsl-test-case-generation0.78040Partial progress
bracket-sequence-restoration040Genuine loop (40-turn cap)
pdf-table-parsing040Genuine loop (40-turn cap)
pgn-chess-repair-puzzles040Genuine loop (40-turn cap)
acl-permissions-inheritance040Unbounded search (40-turn cap)
api-endpoint-permission-canonicalizer040Unbounded search (40-turn cap)
book-portfolio-analysis040Unbounded search (40-turn cap)
corrupted-filesystem-recovery040Unbounded search (40-turn cap)
cpp-daemon-sighup-segfault040Unbounded search (40-turn cap)
db-migration-local-storage040Unbounded search (40-turn cap)
floor-plan-geometry040Unbounded search (40-turn cap)
game-of-stones040Unbounded search (40-turn cap)
git-repo-forensics040Unbounded search (40-turn cap)
multi-server-configuration040Unbounded search (40-turn cap)
parking-lot-pathfinding040Unbounded search (40-turn cap)
playing-card-recognition040Unbounded search (40-turn cap)
publisher-market-analysis040Unbounded search (40-turn cap)
react-typescript-debugg040Unbounded search (40-turn cap)
symlink-chain-traversal040Unbounded search (40-turn cap)
systemd-log-monitoring040Unbounded search (40-turn cap)
token-auth-websocket040Unbounded search (40-turn cap)
build-system-task-ordering02Early stop / protocol abort
hydra-debug-slurm-mode01Early stop / protocol abort
multi-labeller03Early stop / protocol abort
todos-api03Early stop / protocol abort
breast-cancer-mlflow015Runtime / infrastructure
container-registry-optimizationn/a0Runtime / infrastructure
cosign-keyless-signing017Runtime / infrastructure
ekf-localization038Runtime / infrastructure
etl_checkpoint_resume_bugn/a0Runtime / infrastructure
fix-js-network-controller05Runtime / infrastructure
grid-pathfinding015Runtime / infrastructure
image-tile-identificationn/a22Runtime / infrastructure
legal-summary-extractionn/a94Runtime / infrastructure
mech-system010Runtime / infrastructure
monorepo-changelog-cli011Runtime / infrastructure
network-log-normalizationn/a0Runtime / infrastructure
pandas-numpy-data-analysisn/a13Runtime / infrastructure
permutation-construction-100k012Runtime / infrastructure
publisher-market-analysis-v2n/a64Runtime / infrastructure
python-api-rate-limit012Runtime / infrastructure
reverse-engineer-stack-vm05Runtime / infrastructure
rsa-jwt-token-api-redis-blacklist016Runtime / infrastructure
scan-linux-persistence-artifacts012Runtime / infrastructure
task-xxe-exploit06Runtime / infrastructure
vimscript-vim-quine010Runtime / infrastructure
word-derangement-mappingn/a48Runtime / infrastructure
application-debug018Unresolved (no clean loop)
bandit-delayed-feedback08Unresolved (no clean loop)
battery-charging-optimization04Unresolved (no clean loop)
bloom-filter-cache-penetration-prevention06Unresolved (no clean loop)
build-merkle-tree-cli-sha51205Unresolved (no clean loop)
competitive-programming-solver015Unresolved (no clean loop)
cryptographic-protocol-verifier06Unresolved (no clean loop)
distributed-test-execution-scheduler017Unresolved (no clean loop)
fix_async_worker_queue013Unresolved (no clean loop)
html-index-analysis05Unresolved (no clean loop)
industrial-kiln-controller021Unresolved (no clean loop)
iot-device-registration-server025Unresolved (no clean loop)
jq-data-processing015Unresolved (no clean loop)
malicious-package-forensics014Unresolved (no clean loop)
neural-architecture-search-final06Unresolved (no clean loop)
neutron-submission06Unresolved (no clean loop)
pandas-etl012Unresolved (no clean loop)
sakila-sqlite-queries020Unresolved (no clean loop)
sales-data-csv-analysis04Unresolved (no clean loop)
server-log-analysis04Unresolved (no clean loop)
service-deployment-wave-planner07Unresolved (no clean loop)
supply-chain-fulfillment011Unresolved (no clean loop)
Per-trace notes by behavior modelinks to analyst reports where present
Clean solve 28 tasks

Inspect, build, verify, clean stop. These runs show the latent skills are present.

amuse-install score 1 19 turns

Solved in 19 turns with 17 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 18; unique commands: 17; dominant-action share: 11%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
anomaly-detection-ranking score 1 29 turns

Solved in 29 turns with 28 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 28; unique commands: 28; dominant-action share: 4%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
auth_token_race_condition score 1 5 turns

Solved in 5 turns with 3 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 4; unique commands: 3; dominant-action share: 50%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
basic-message-queue score 1 35 turns

Solved in 35 turns with 31 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 34; unique commands: 31; dominant-action share: 9%.
  • adjacent exact repeats: 1; assistant turns without tool calls: 1.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
broken-python score 1 15 turns

Solved in 15 turns with 14 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 14; unique commands: 14; dominant-action share: 7%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
california-housing-api score 1 26 turns

Solved in 26 turns with 25 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 25; unique commands: 25; dominant-action share: 4%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
chained-forensic-extraction_20260101_011957 score 1 23 turns

Solved in 23 turns with 0 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
convolutional-layers score 1 13 turns

Solved in 13 turns with 12 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 12; unique commands: 12; dominant-action share: 8%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
csv-json-jsonl-merger score 1 10 turns

Solved in 10 turns with 9 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 9; unique commands: 9; dominant-action share: 11%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
grpc-plant-position-server score 1 12 turns

Solved in 12 turns with 10 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 11; unique commands: 10; dominant-action share: 18%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
iris-dataset-classification score 1 6 turns

Solved in 6 turns with 0 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
jsonl-aggregator score 1 5 turns

Solved in 5 turns with 4 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 4; unique commands: 4; dominant-action share: 25%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
log-summary score 1 9 turns

Solved in 9 turns with 8 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 8; unique commands: 8; dominant-action share: 12%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
maven-slf4j-conflict score 1 27 turns

Solved in 27 turns with 23 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 26; unique commands: 23; dominant-action share: 12%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
mlflow-register score 1 12 turnstrace report →

Solved in 12 turns with 11 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 11; unique commands: 11; dominant-action share: 9%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
mtls-cert-rotation score 1 12 turns

Solved in 12 turns with 0 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
okhttp-trailers-crash score 1 5 turnstrace report →

Solved in 5 turns with 4 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 4; unique commands: 4; dominant-action share: 25%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
prediction-model-evaluation score 1 21 turns

Solved in 21 turns with 0 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
protein-sequence score 1 9 turns

Solved in 9 turns with 7 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 8; unique commands: 7; dominant-action share: 25%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
raft-log-repair-concurrent-access score 1 3 turns

Solved in 3 turns with 0 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
reproducibility-and-envsetup score 1 30 turns

Solved in 30 turns with 25 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 29; unique commands: 25; dominant-action share: 7%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
schedule-vacation score 1 13 turnstrace report →

Solved in 13 turns with 12 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 12; unique commands: 12; dominant-action share: 8%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
sign-vector-game score 1 18 turns

Solved in 18 turns with 0 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
simple-database-query-tool score 1 28 turnstrace report →

Solved in 28 turns with 26 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 27; unique commands: 26; dominant-action share: 7%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
smiles-data-lab score 1 13 turns

Solved in 13 turns with 12 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 12; unique commands: 12; dominant-action share: 8%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
sql-injection-forensics score 1 8 turns

Solved in 8 turns with 0 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
submission_a63937a5_20251224_152124 score 1 23 turns

Solved in 23 turns with 21 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 22; unique commands: 21; dominant-action share: 9%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
sympy-bug-fix score 1 40 turns

Solved in 40 turns with 40 unique commands. This is the positive-control behavior: execution skill is present when the task stays inside Laguna's closed-loop control envelope.

  • tool calls: 40; unique commands: 40; dominant-action share: 2%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Positive sample for capability expression: inspect / execute / verify / stop can succeed under native tool calling.
Partial progress 4 tasks

Verifier gives partial credit: useful curriculum material if the missing checks are explicit.

tsl-test-case-generation score 0.780 40 turns

Partial verifier reward 0.780 in 40 turns. This is high-value curriculum material because the trace likely contains real progress plus a missing final verification, edge-case check, or completion criterion.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Likely useful for process reward design: partial state is observable and verifier-scored.
bash-log-processor-fix score 0.747 33 turnstrace report →

Partial verifier reward 0.747 in 33 turns. This is high-value curriculum material because the trace likely contains real progress plus a missing final verification, edge-case check, or completion criterion.

  • tool calls: 32; unique commands: 28; dominant-action share: 9%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Likely useful for process reward design: partial state is observable and verifier-scored.
security-incident-log-analysis score 0.692 11 turns

Partial verifier reward 0.692 in 11 turns. This is high-value curriculum material because the trace likely contains real progress plus a missing final verification, edge-case check, or completion criterion.

  • tool calls: 10; unique commands: 10; dominant-action share: 10%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Likely useful for process reward design: partial state is observable and verifier-scored.
security-breach-incident-response score 0.683 17 turns

Partial verifier reward 0.683 in 17 turns. This is high-value curriculum material because the trace likely contains real progress plus a missing final verification, edge-case check, or completion criterion.

  • tool calls: 16; unique commands: 16; dominant-action share: 6%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Likely useful for process reward design: partial state is observable and verifier-scored.
Genuine loop (40-turn cap) 3 tasks

Burns the turn cap with repeated or near-repeated actions; the recognition-to-action coupling target.

bracket-sequence-restoration score 0 40 turns

Zero reward after 40 turns. The dominant action accounts for 50% of tool calls with 0 adjacent exact repeats, so the failure is action inertia under stale evidence rather than missing atomic terminal skill.

  • tool calls: 40; unique commands: 4; dominant-action share: 50%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Primary RL target: couple stuck recognition to a changed action or an explicit stop decision.
pdf-table-parsing score 0 40 turns

Zero reward after 40 turns. The dominant action accounts for 75% of tool calls with 29 adjacent exact repeats, so the failure is action inertia under stale evidence rather than missing atomic terminal skill.

  • tool calls: 40; unique commands: 11; dominant-action share: 75%.
  • adjacent exact repeats: 29; assistant turns without tool calls: 0.
  • Primary RL target: couple stuck recognition to a changed action or an explicit stop decision.
pgn-chess-repair-puzzles score 0 40 turnstrace report →

Zero reward after 40 turns. The dominant action accounts for 75% of tool calls with 29 adjacent exact repeats, so the failure is action inertia under stale evidence rather than missing atomic terminal skill.

  • tool calls: 40; unique commands: 11; dominant-action share: 75%.
  • adjacent exact repeats: 29; assistant turns without tool calls: 0.
  • Primary RL target: couple stuck recognition to a changed action or an explicit stop decision.
Unbounded search (40-turn cap) 17 tasks

Uses the whole budget without enough repetition to be a pure loop; progress-ledger and search-discipline target.

acl-permissions-inheritance score 0 40 turns

Zero reward at the turn cap with 40 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.

  • tool calls: 40; unique commands: 40; dominant-action share: 2%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Primary RL target: bounded search with progress accounting, not just more exploration.
api-endpoint-permission-canonicalizer score 0 40 turns

Zero reward at the turn cap with 26 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.

  • tool calls: 40; unique commands: 26; dominant-action share: 35%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Primary RL target: bounded search with progress accounting, not just more exploration.
book-portfolio-analysis score 0 40 turnstrace report →

Zero reward at the turn cap with 13 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.

  • tool calls: 40; unique commands: 13; dominant-action share: 28%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Primary RL target: bounded search with progress accounting, not just more exploration.
corrupted-filesystem-recovery score 0 40 turns

Zero reward at the turn cap with 0 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Primary RL target: bounded search with progress accounting, not just more exploration.
cpp-daemon-sighup-segfault score 0 40 turns

Zero reward at the turn cap with 23 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.

  • tool calls: 40; unique commands: 23; dominant-action share: 28%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Primary RL target: bounded search with progress accounting, not just more exploration.
db-migration-local-storage score 0 40 turnstrace report →

Zero reward at the turn cap with 31 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.

  • tool calls: 40; unique commands: 31; dominant-action share: 18%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Primary RL target: bounded search with progress accounting, not just more exploration.
floor-plan-geometry score 0 40 turns

Zero reward at the turn cap with 0 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Primary RL target: bounded search with progress accounting, not just more exploration.
game-of-stones score 0 40 turns

Zero reward at the turn cap with 0 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Primary RL target: bounded search with progress accounting, not just more exploration.
git-repo-forensics score 0 40 turns

Zero reward at the turn cap with 39 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.

  • tool calls: 40; unique commands: 39; dominant-action share: 5%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Primary RL target: bounded search with progress accounting, not just more exploration.
multi-server-configuration score 0 40 turns

Zero reward at the turn cap with 38 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.

  • tool calls: 40; unique commands: 38; dominant-action share: 5%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Primary RL target: bounded search with progress accounting, not just more exploration.
parking-lot-pathfinding score 0 40 turns

Zero reward at the turn cap with 40 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.

  • tool calls: 40; unique commands: 40; dominant-action share: 2%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Primary RL target: bounded search with progress accounting, not just more exploration.
playing-card-recognition score 0 40 turns

Zero reward at the turn cap with 0 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Primary RL target: bounded search with progress accounting, not just more exploration.
publisher-market-analysis score 0 40 turns

Zero reward at the turn cap with 0 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Primary RL target: bounded search with progress accounting, not just more exploration.
react-typescript-debugg score 0 40 turns

Zero reward at the turn cap with 39 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.

  • tool calls: 40; unique commands: 39; dominant-action share: 5%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Primary RL target: bounded search with progress accounting, not just more exploration.
symlink-chain-traversal score 0 40 turns

Zero reward at the turn cap with 34 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.

  • tool calls: 40; unique commands: 34; dominant-action share: 8%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Primary RL target: bounded search with progress accounting, not just more exploration.
systemd-log-monitoring score 0 40 turns

Zero reward at the turn cap with 37 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.

  • tool calls: 40; unique commands: 37; dominant-action share: 10%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Primary RL target: bounded search with progress accounting, not just more exploration.
token-auth-websocket score 0 40 turns

Zero reward at the turn cap with 0 unique commands. This looks less like a byte-identical loop and more like unbounded search without a maintained progress ledger.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Primary RL target: bounded search with progress accounting, not just more exploration.
Early stop / protocol abort 4 tasks

Stops or aborts too early; stop-quality and tool-call robustness target.

build-system-task-ordering score 0 2 turnstrace report →

Zero reward after only 2 turns. Treat as stop-quality or protocol-abort evidence until the raw trace proves it was a genuine task-level decision.

  • tool calls: 1; unique commands: 1; dominant-action share: 100%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Primary RL target: stop only after deliverable verification; separate parser failures from policy stops.
hydra-debug-slurm-mode score 0 1 turns

Zero reward after only 1 turns. Treat as stop-quality or protocol-abort evidence until the raw trace proves it was a genuine task-level decision.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Primary RL target: stop only after deliverable verification; separate parser failures from policy stops.
multi-labeller score 0 3 turns

Zero reward after only 3 turns. Treat as stop-quality or protocol-abort evidence until the raw trace proves it was a genuine task-level decision.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Primary RL target: stop only after deliverable verification; separate parser failures from policy stops.
todos-api score 0 3 turns

Zero reward after only 3 turns. Treat as stop-quality or protocol-abort evidence until the raw trace proves it was a genuine task-level decision.

  • tool calls: 2; unique commands: 2; dominant-action share: 50%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Primary RL target: stop only after deliverable verification; separate parser failures from policy stops.
Runtime / infrastructure 22 tasks

Trial or runtime exceptions; useful for harness triage, not model-behavior claims.

breast-cancer-mlflow score 0 15 turns

Trial produced no clean behavioral verdict after 15 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Exclude from reward-variance estimates until rerun; this is not clean model-behavior evidence.
container-registry-optimization score n/a 0 turns

Trial produced no clean behavioral verdict after 0 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Exclude from reward-variance estimates until rerun; this is not clean model-behavior evidence.
cosign-keyless-signing score 0 17 turns

Trial produced no clean behavioral verdict after 17 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Exclude from reward-variance estimates until rerun; this is not clean model-behavior evidence.
ekf-localization score 0 38 turns

Trial produced no clean behavioral verdict after 38 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.

  • tool calls: 38; unique commands: 11; dominant-action share: 74%.
  • adjacent exact repeats: 27; assistant turns without tool calls: 0.
  • Exclude from reward-variance estimates until rerun; this is not clean model-behavior evidence.
etl_checkpoint_resume_bug score n/a 0 turns

Trial produced no clean behavioral verdict after 0 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Exclude from reward-variance estimates until rerun; this is not clean model-behavior evidence.
fix-js-network-controller score 0 5 turns

Trial produced no clean behavioral verdict after 5 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Exclude from reward-variance estimates until rerun; this is not clean model-behavior evidence.
grid-pathfinding score 0 15 turns

Trial produced no clean behavioral verdict after 15 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.

  • tool calls: 15; unique commands: 14; dominant-action share: 13%.
  • adjacent exact repeats: 1; assistant turns without tool calls: 0.
  • Exclude from reward-variance estimates until rerun; this is not clean model-behavior evidence.
image-tile-identification score n/a 22 turns

Trial produced no clean behavioral verdict after 22 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Exclude from reward-variance estimates until rerun; this is not clean model-behavior evidence.
legal-summary-extraction score n/a 94 turns

Trial produced no clean behavioral verdict after 94 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Exclude from reward-variance estimates until rerun; this is not clean model-behavior evidence.
mech-system score 0 10 turns

Trial produced no clean behavioral verdict after 10 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Exclude from reward-variance estimates until rerun; this is not clean model-behavior evidence.
monorepo-changelog-cli score 0 11 turns

Trial produced no clean behavioral verdict after 11 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Exclude from reward-variance estimates until rerun; this is not clean model-behavior evidence.
network-log-normalization score n/a 0 turns

Trial produced no clean behavioral verdict after 0 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Exclude from reward-variance estimates until rerun; this is not clean model-behavior evidence.
pandas-numpy-data-analysis score n/a 13 turnstrace report →

Trial produced no clean behavioral verdict after 13 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.

  • tool calls: 13; unique commands: 13; dominant-action share: 8%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Exclude from reward-variance estimates until rerun; this is not clean model-behavior evidence.
permutation-construction-100k score 0 12 turns

Trial produced no clean behavioral verdict after 12 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.

  • tool calls: 12; unique commands: 12; dominant-action share: 8%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Exclude from reward-variance estimates until rerun; this is not clean model-behavior evidence.
publisher-market-analysis-v2 score n/a 64 turns

Trial produced no clean behavioral verdict after 64 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Exclude from reward-variance estimates until rerun; this is not clean model-behavior evidence.
python-api-rate-limit score 0 12 turns

Trial produced no clean behavioral verdict after 12 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Exclude from reward-variance estimates until rerun; this is not clean model-behavior evidence.
reverse-engineer-stack-vm score 0 5 turns

Trial produced no clean behavioral verdict after 5 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Exclude from reward-variance estimates until rerun; this is not clean model-behavior evidence.
rsa-jwt-token-api-redis-blacklist score 0 16 turns

Trial produced no clean behavioral verdict after 16 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Exclude from reward-variance estimates until rerun; this is not clean model-behavior evidence.
scan-linux-persistence-artifacts score 0 12 turns

Trial produced no clean behavioral verdict after 12 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Exclude from reward-variance estimates until rerun; this is not clean model-behavior evidence.
task-xxe-exploit score 0 6 turns

Trial produced no clean behavioral verdict after 6 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Exclude from reward-variance estimates until rerun; this is not clean model-behavior evidence.
vimscript-vim-quine score 0 10 turns

Trial produced no clean behavioral verdict after 10 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.

  • tool calls: 10; unique commands: 10; dominant-action share: 10%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Exclude from reward-variance estimates until rerun; this is not clean model-behavior evidence.
word-derangement-mapping score n/a 48 turns

Trial produced no clean behavioral verdict after 48 turns. Exception head: Traceback (most recent call last): Keep out of training and evaluation claims unless a rerun succeeds.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Exclude from reward-variance estimates until rerun; this is not clean model-behavior evidence.
Unresolved (no clean loop) 22 tasks

Zero reward before the cap without a clean loop signature; adjacent capability expression target.

application-debug score 0 18 turns

Zero reward after 18 turns with 15 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.

  • tool calls: 17; unique commands: 15; dominant-action share: 12%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Use raw trace review to decide whether this is trainable-band weakness or true missing skill.
bandit-delayed-feedback score 0 8 turns

Zero reward after 8 turns with 7 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.

  • tool calls: 7; unique commands: 7; dominant-action share: 14%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Use raw trace review to decide whether this is trainable-band weakness or true missing skill.
battery-charging-optimization score 0 4 turns

Zero reward after 4 turns with 3 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.

  • tool calls: 3; unique commands: 3; dominant-action share: 33%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Use raw trace review to decide whether this is trainable-band weakness or true missing skill.
bloom-filter-cache-penetration-prevention score 0 6 turns

Zero reward after 6 turns with 5 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.

  • tool calls: 5; unique commands: 5; dominant-action share: 20%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Use raw trace review to decide whether this is trainable-band weakness or true missing skill.
build-merkle-tree-cli-sha512 score 0 5 turns

Zero reward after 5 turns with 4 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.

  • tool calls: 4; unique commands: 4; dominant-action share: 25%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Use raw trace review to decide whether this is trainable-band weakness or true missing skill.
competitive-programming-solver score 0 15 turns

Zero reward after 15 turns with 14 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.

  • tool calls: 14; unique commands: 14; dominant-action share: 7%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Use raw trace review to decide whether this is trainable-band weakness or true missing skill.
cryptographic-protocol-verifier score 0 6 turns

Zero reward after 6 turns with 4 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.

  • tool calls: 5; unique commands: 4; dominant-action share: 40%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Use raw trace review to decide whether this is trainable-band weakness or true missing skill.
distributed-test-execution-scheduler score 0 17 turns

Zero reward after 17 turns with 16 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.

  • tool calls: 16; unique commands: 16; dominant-action share: 6%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Use raw trace review to decide whether this is trainable-band weakness or true missing skill.
fix_async_worker_queue score 0 13 turns

Zero reward after 13 turns with 0 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.

  • tool calls: 0; unique commands: 0; dominant-action share: 0%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 0.
  • Use raw trace review to decide whether this is trainable-band weakness or true missing skill.
html-index-analysis score 0 5 turns

Zero reward after 5 turns with 4 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.

  • tool calls: 4; unique commands: 4; dominant-action share: 25%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Use raw trace review to decide whether this is trainable-band weakness or true missing skill.
industrial-kiln-controller score 0 21 turns

Zero reward after 21 turns with 20 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.

  • tool calls: 20; unique commands: 20; dominant-action share: 5%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Use raw trace review to decide whether this is trainable-band weakness or true missing skill.
iot-device-registration-server score 0 25 turns

Zero reward after 25 turns with 22 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.

  • tool calls: 24; unique commands: 22; dominant-action share: 8%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Use raw trace review to decide whether this is trainable-band weakness or true missing skill.
jq-data-processing score 0 15 turnstrace report →

Zero reward after 15 turns with 14 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.

  • tool calls: 14; unique commands: 14; dominant-action share: 7%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Use raw trace review to decide whether this is trainable-band weakness or true missing skill.
malicious-package-forensics score 0 14 turns

Zero reward after 14 turns with 13 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.

  • tool calls: 13; unique commands: 13; dominant-action share: 8%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Use raw trace review to decide whether this is trainable-band weakness or true missing skill.
neural-architecture-search-final score 0 6 turns

Zero reward after 6 turns with 5 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.

  • tool calls: 5; unique commands: 5; dominant-action share: 20%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Use raw trace review to decide whether this is trainable-band weakness or true missing skill.
neutron-submission score 0 6 turns

Zero reward after 6 turns with 5 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.

  • tool calls: 5; unique commands: 5; dominant-action share: 20%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Use raw trace review to decide whether this is trainable-band weakness or true missing skill.
pandas-etl score 0 12 turns

Zero reward after 12 turns with 8 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.

  • tool calls: 11; unique commands: 8; dominant-action share: 36%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Use raw trace review to decide whether this is trainable-band weakness or true missing skill.
sakila-sqlite-queries score 0 20 turns

Zero reward after 20 turns with 17 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.

  • tool calls: 19; unique commands: 17; dominant-action share: 11%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Use raw trace review to decide whether this is trainable-band weakness or true missing skill.
sales-data-csv-analysis score 0 4 turns

Zero reward after 4 turns with 3 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.

  • tool calls: 3; unique commands: 3; dominant-action share: 33%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Use raw trace review to decide whether this is trainable-band weakness or true missing skill.
server-log-analysis score 0 4 turns

Zero reward after 4 turns with 3 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.

  • tool calls: 3; unique commands: 3; dominant-action share: 33%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Use raw trace review to decide whether this is trainable-band weakness or true missing skill.
service-deployment-wave-planner score 0 7 turns

Zero reward after 7 turns with 6 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.

  • tool calls: 6; unique commands: 6; dominant-action share: 17%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Use raw trace review to decide whether this is trainable-band weakness or true missing skill.
supply-chain-fulfillment score 0 11 turns

Zero reward after 11 turns with 10 unique commands. The failure is adjacent to capability expression: enough action diversity to avoid pure-loop labeling, but no verified deliverable.

  • tool calls: 10; unique commands: 10; dominant-action share: 10%.
  • adjacent exact repeats: 0; assistant turns without tool calls: 1.
  • Use raw trace review to decide whether this is trainable-band weakness or true missing skill.