THOR AI Benchmarks

Benchmark Control Deck

The static README is still the reference narrative, but this control deck turns the same public benchmark artifacts into a navigable dashboard with an interactive explorer, recommendation tables, filtered model views, and a chart gallery.

Controls

Pick a preset, swap axes, and narrow the point cloud.

Operational note

This explorer reads the public operational metrics directly from combined/leaderboard.json, including balanced_ots, critical_miss, and false_review. The baseline rows always-fp, always-inc, and always-tp are exposed as references instead of being hidden.

Scatter Explorer

Pinned Model Details

Click a chart symbol to pin its details here. Click it again to remove it.

Metric Glossary

On this page, Quality Score is the plain-language label for the benchmark's CW% metric from the README and blog posts.

Quality Score (CW%)
Confidence-Weighted Score. The classic benchmark quality score. It rewards confident correct answers and penalizes confident wrong answers.
OTS
Operational Triage Score. A score that weights decisions by their operational consequences.
Balanced OTS
An operational score balanced across FP, Inc and TP, so the majority class does not dominate the result.
Critical Miss Rate
The share of true positives wrongly suppressed as false positives. Lower is safer.
False Review Load
The share of false positives that still reach analysts as review-worthy or escalated findings. Lower is better.
MAE / RMSE
Priority-score error metrics. MAE is the average absolute difference from the expert score, while RMSE penalizes larger deviations more strongly.

Leader Tables

Switch between the overall recommendations and the tier-specific recommendation tables derived from the same operational profile artifacts as the README.

Filtered Models

Rank Model Tier Quality Score (CW%) Balanced OTS Critical Miss % False Review % Threat Capture %

Chart Gallery

Static benchmark charts from the repo, available alongside the interactive explorer for comparison and documentation parity.