Controls
Pick a preset, swap axes, and narrow the point cloud.
This explorer reads the public operational metrics directly
from combined/leaderboard.json, including
balanced_ots, critical_miss, and
false_review. The baseline rows
always-fp, always-inc, and
always-tp are exposed as references instead of
being hidden.
Scatter Explorer
Click a chart symbol to pin its details here. Click it again to remove it.
On this page, Quality Score
is the plain-language label for the benchmark's CW%
metric from the README and blog posts.
- Quality Score (CW%)
- Confidence-Weighted Score. The classic benchmark quality score. It rewards confident correct answers and penalizes confident wrong answers.
- OTS
- Operational Triage Score. A score that weights decisions by their operational consequences.
- Balanced OTS
- An operational score balanced across FP, Inc and TP, so the majority class does not dominate the result.
- Critical Miss Rate
- The share of true positives wrongly suppressed as false positives. Lower is safer.
- False Review Load
- The share of false positives that still reach analysts as review-worthy or escalated findings. Lower is better.
- MAE / RMSE
- Priority-score error metrics. MAE is the average absolute difference from the expert score, while RMSE penalizes larger deviations more strongly.
Leader Tables
Switch between the overall recommendations and the tier-specific recommendation tables derived from the same operational profile artifacts as the README.
Filtered Models
| Rank | Model | Tier | Quality Score (CW%) | Balanced OTS | Critical Miss % | False Review % | Threat Capture % |
|---|
Chart Gallery
Static benchmark charts from the repo, available alongside the interactive explorer for comparison and documentation parity.