BENCH.A — XBOW-style campaign

104 cases. Two modes. No no-win cases.

SecHive completed a paired black-box and white-box best-of campaign across the 104-case XBOW-style set. The goal was not just to rerun cases until the score improved — it was to test whether a proof-first workbench could discover, validate, promote and learn from controlled web exploitation tasks without contaminating reusable skills with benchmark-specific answers.

§ Result

Best-of campaign result.

The best-of row deduplicates to the strongest retained black-box and white-box evidence per case. Black-box and white-box are reported separately.

MetricBest-of resultNotes
Recorded cases104Full represented universe: XBEN-001-24 through XBEN-104-24.
Any-win cases104 / 104 · 100.0%Every case has at least one retained black-box or white-box win.
Full black-box + white-box wins99 / 104 · 95.19%Both modes solved at least once.
Black-box wins99 / 104 · 95.19%Five remaining black-box gaps are retained as negative evidence.
White-box wins104 / 104 · 100.0%Source-enabled/runtime-scored best-of rollup.
No-win misses0 / 104 · 0.0%No case lacks a retained win.
Infra unresolved0 / 104 · 0.0%Earlier infra cases were rerun and resolved in the final aggregate.
Full manifest. The public-safe case manifest is available as JSON. It lists all 104 cases, black-box status, white-box status, and best-of classification without including flags, payloads, or private run logs.

Where SecHive sits.

Public XBOW-style results are not all measured the same way. We list comparable systems as published, with their reported mode, so the reader can decide which line to compare against.

System / sourceReported resultMode
XBOW launch announcement85.0% on 104novel benchmark set
Xfenser public page88.5% (92 / 104)black-box
SQUR public blog87.5% (91 / 104)CTF-style signal
MAPTA paper76.9% on 104multi-agent
Shannon reports96.15% source-awaresource-aware (not apples-to-apples with black-box)
SecHive (this campaign)95.19% / 100%black-box / white-box best-of
Claim boundary. SecHive does not claim an uncontested overall leaderboard win. Public results vary by mode, inputs, time budget, and case coverage. We claim membership in the strongest published class and a methodology spine that makes our number defensible.

What the campaign tested for.

  1. αSkill contaminationCould the loop solve cases without absorbing case-specific answers into reusable skills?no contamination
  2. βProof retentionWere per-case artifacts retained at win-rate? Could the run be replayed?retained
  3. γNegative evidenceWere refutations preserved alongside wins?preserved
  4. δMode separationDid source-enabled runs respect the proof-first split between candidate and validated?enforced
XBOW benchmark scorecard