§ Benchmarks

Benchmarks that label their mode.

SecHive publishes benchmark results with explicit methodology: which run mode, what target version, how long, what counted as a win, what artifacts back the claim.

BENCH.AWeb · 104 cases

XBOW-style campaign

Paired black-box and source-enabled best-of campaign across 104 cases. 95.19% black-box wins, 100% white-box wins, and no no-win cases.

BENCH.BWeb · 111 challenges

OWASP Juice Shop

Full unredacted black-box and white-box reports against the current 111-challenge tree. Reproducible.

Live runtime findings35

Source-aware items58

CVE references20

Read report →

BENCH.CActive Directory · framework

GOAD // COMING

Game of Active Directory. Lab framework loaded, scoring schema published. First public results scheduled for 2026.Q3.

Lab statusLoaded

ScoringPublished

Public results2026.Q3

See framework →

BENCH.DSecrets · framework

OWASP WrongSecrets // COMING

OWASP WrongSecrets challenge tree. Aggregator and runner integrated; first public scorecard scheduled for 2026.Q3.

RunnerWired

AggregatorWired

Public results2026.Q3

See framework →

§ Methodology

How a benchmark counts.

Every result published on this site uses the same methodology spine: scope, run mode, target version, time budget, scoring rule, and proof retention.

Field	Definition	Why it matters
Run mode	black-box, source-enabled, fully source-aware, CTF	Comparing across modes is the most common public-data error.
Target version	commit hash, image digest, lab tag	Targets evolve; "Juice Shop" alone is not a benchmark.
Time budget	wall-clock and per-case caps	Speed claims without a budget are not comparable.
Scoring rule	any-win, full-win, source-aware-objective	The same campaign can show different rates under different rules.
Proof retention	artifact set retained per case	Without retention there is no audit trail to inspect.
Negative evidence	refutations kept with the case	Failed attempts inform the next run; they are not noise.