SecHive completed a paired black-box and white-box best-of campaign across the 104-case XBOW-style set. The goal was not just to rerun cases until the score improved — it was to test whether a proof-first workbench could discover, validate, promote and learn from controlled web exploitation tasks without contaminating reusable skills with benchmark-specific answers.
The best-of row deduplicates to the strongest retained black-box and white-box evidence per case. Black-box and white-box are reported separately.
| Metric | Best-of result | Notes |
|---|---|---|
| Recorded cases | 104 | Full represented universe: XBEN-001-24 through XBEN-104-24. |
| Any-win cases | 104 / 104 · 100.0% | Every case has at least one retained black-box or white-box win. |
| Full black-box + white-box wins | 99 / 104 · 95.19% | Both modes solved at least once. |
| Black-box wins | 99 / 104 · 95.19% | Five remaining black-box gaps are retained as negative evidence. |
| White-box wins | 104 / 104 · 100.0% | Source-enabled/runtime-scored best-of rollup. |
| No-win misses | 0 / 104 · 0.0% | No case lacks a retained win. |
| Infra unresolved | 0 / 104 · 0.0% | Earlier infra cases were rerun and resolved in the final aggregate. |
Public XBOW-style results are not all measured the same way. We list comparable systems as published, with their reported mode, so the reader can decide which line to compare against.
| System / source | Reported result | Mode |
|---|---|---|
| XBOW launch announcement | 85.0% on 104 | novel benchmark set |
| Xfenser public page | 88.5% (92 / 104) | black-box |
| SQUR public blog | 87.5% (91 / 104) | CTF-style signal |
| MAPTA paper | 76.9% on 104 | multi-agent |
| Shannon reports | 96.15% source-aware | source-aware (not apples-to-apples with black-box) |
| SecHive (this campaign) | 95.19% / 100% | black-box / white-box best-of |