BENCH.A — XBOW-style campaign

104 cases. Two modes. No no-win cases.

SecHive completed a paired black-box and white-box best-of campaign across the 104-case XBOW-style set. The goal was not just to rerun cases until the score improved — it was to test whether a proof-first workbench could discover, validate, promote and learn from controlled web exploitation tasks without contaminating reusable skills with benchmark-specific answers.

§ Result

Best-of campaign result.

The best-of row deduplicates to the strongest retained black-box and white-box evidence per case. Black-box and white-box are reported separately.

Metric	Best-of result	Notes
Recorded cases	104	Full represented universe: XBEN-001-24 through XBEN-104-24.
Any-win cases	104 / 104 · 100.0%	Every case has at least one retained black-box or white-box win.
Full black-box + white-box wins	99 / 104 · 95.19%	Both modes solved at least once.
Black-box wins	99 / 104 · 95.19%	Five remaining black-box gaps are retained as negative evidence.
White-box wins	104 / 104 · 100.0%	Source-enabled/runtime-scored best-of rollup.
No-win misses	0 / 104 · 0.0%	No case lacks a retained win.
Infra unresolved	0 / 104 · 0.0%	Earlier infra cases were rerun and resolved in the final aggregate.

Full manifest. The public-safe case manifest is available as JSON. It lists all 104 cases, black-box status, white-box status, and best-of classification without including flags, payloads, or private run logs.

Where SecHive sits.

Public XBOW-style results are not all measured the same way. We list comparable systems as published, with their reported mode, so the reader can decide which line to compare against.

System / source	Reported result	Mode
XBOW launch announcement	85.0% on 104	novel benchmark set
Xfenser public page	88.5% (92 / 104)	black-box
SQUR public blog	87.5% (91 / 104)	CTF-style signal
MAPTA paper	76.9% on 104	multi-agent
Shannon reports	96.15% source-aware	source-aware (not apples-to-apples with black-box)
SecHive (this campaign)	95.19% / 100%	black-box / white-box best-of

Claim boundary. SecHive does not claim an uncontested overall leaderboard win. Public results vary by mode, inputs, time budget, and case coverage. We claim membership in the strongest published class and a methodology spine that makes our number defensible.

What the campaign tested for.

αSkill contaminationCould the loop solve cases without absorbing case-specific answers into reusable skills?no contamination
βProof retentionWere per-case artifacts retained at win-rate? Could the run be replayed?retained
γNegative evidenceWere refutations preserved alongside wins?preserved
δMode separationDid source-enabled runs respect the proof-first split between candidate and validated?enforced