THOR Finding Triage Benchmark

Controls

Pick a preset, swap axes, and narrow the point cloud.

X Axis Y Axis Search Models Show labels on chart Hide incomplete runs

Operational note

This explorer reads the public operational metrics directly from combined/leaderboard.json, including balanced_ots, critical_miss, and false_review. The baseline rows always-fp, always-inc, and always-tp are exposed as references instead of being hidden.

Scatter Explorer

Desktop view recommended

The interactive scatter plots are too dense for a phone-sized screen. Open this page on a desktop browser for hover details, zooming, and pinned model comparisons.

The leader tables, model table, metric glossary, and static chart gallery remain available on mobile.

Pinned Model Details

Click a chart symbol to pin its details here. Click it again to remove it.

Metric Glossary

On this page, Quality Score is the plain-language label for the benchmark's CW% metric from the README and blog posts.

Quality Score (CW%): Confidence-Weighted Score. The classic benchmark quality score. It rewards confident correct answers and penalizes confident wrong answers.
OTS: Operational Triage Score. A score that weights decisions by their operational consequences.
Balanced OTS: An operational score balanced across FP, Inc and TP, so the majority class does not dominate the result.

Critical Miss Rate: The share of true positives wrongly suppressed as false positives. Lower is safer.
False Review Load: The share of false positives that still reach analysts as review-worthy or escalated findings. Lower is better.
MAE / RMSE: Priority-score error metrics. MAE is the average absolute difference from the expert score, while RMSE penalizes larger deviations more strongly.

Leader Tables

Switch between the overall recommendations and the tier-specific recommendation tables derived from the same operational profile artifacts as the README.

Filtered Models

Rank	Model	Tier	Quality Score (CW%)	Balanced OTS	Critical Miss %	False Review %	Threat Capture %

Chart Gallery

Static benchmark charts from the repo, available alongside the interactive explorer for comparison and documentation parity.