Tracking · 47 LLMs · 22 agents · 9 harnesses
Issue 18 · Updated 2026-05-08 · GMT
VOL. 18 · METHODOLOGY-FIRST · CROSS-SOURCE

The standing record of who is fastest, cheapest, and most correct in AI.

Triad is a three-layer benchmark intelligence service. We track every meaningful LLM, every meaningful agent, and the harness each one runs inside. Every score traces to a published source row. The cross-source aggregate is published weekly and audited continuously.

Read this week's digestView the scoreboard47 LLMs · 22 agents · 9 harnesses · 11 sources
Tracking47 LLMs22 agents9 harnesses11 sourcesUpdated hourlyNo vendor sponsorship
FIG. 01 · CROSS-SOURCE SCOREBOARD · WEEK 18

The cross-source scoreboard, three layers, footnoted.

Δ vs last week · Sparkline = last 12 weeks
Layer
Model · harness
Score
$ / 1M
Speed
Δ wk
Trend
L1
Claude Sonnet 4.6
Anthropic[1]
76.4
$3.0
112 t/s
+1.4
L1
GPT-5 mini
OpenAI[2]
73.1
$1.8
138 t/s
+0.6
L1
Gemini 2.5 Pro
Google[3]
71.8
$1.5
164 t/s
−0.3
L2
Sonnet 4.6 + Claude Code
Anthropic / Claude Code[4]
62.1
$3.0
+2.1
L2
GPT-5 mini + Codex CLI
OpenAI / Codex[5]
58.6
$1.8
+1.4
L2
Sonnet 4.6 + Cursor agent
Anthropic / Cursor[6]
57.4
$3.0
+0.9
L3
Claude Code (Sonnet 4.6)
Anthropic[7]
Δ +14.0
$0
+1.6
L3
Codex CLI (GPT-5 mini)
OpenAI[8]
Δ +10.6
$0
+1.0
L3
Cursor (Sonnet 4.6)
Cursor[9]
Δ +8.7
$0
+0.4

Score = cross-source mean across the upstream sources cited per row. $ / 1M = USD per million output tokens. Speed = output tokens/second when reported by the upstream. Δ wk = week-over-week change in cross-source mean. Layer-3 rows report Δ vs plain SDK, not absolute score. Source list at § Methodology · Sources.

FIG. 02 · WHAT WE COVER · THE THREE LAYERS

We measure each layer separately, because the same model in two scaffolds is two different products.

Layer 01

LLMs

Frontier and open-weights language models, scored across quality, speed, and price.

  • Anthropic — Claude Sonnet 4.6, Claude Opus 4.5
  • OpenAI — GPT-5, GPT-5 mini, GPT-4.1
  • Google — Gemini 2.5 Pro, Gemini 2.5 Flash
  • Meta — Llama 4 405B, Llama 4 70B
  • xAI — Grok 4 Heavy
  • DeepSeek — V3, R2
  • Mistral / Qwen / open-weights
47 models tracked across 11 vendors
Layer 02

Agents

Agent benchmarks across coding, browsing, OS, and tool-use tasks. Includes product-mode agents.

  • SWE-bench Verified + Lite
  • AgentBench v3 — 8 environments
  • GAIA — browsing + tool-use QA
  • terminal-bench — coding-task harnesses
  • OSWorld — real-OS desktop tasks
  • Product-mode agents — Claude Code, Codex CLI, Cursor agent, Aider, Devin-class
22 agents across 6 benchmark suites
Layer 03

Harnesses

Same model, same benchmark, different scaffold. The layer no public site measures.

  • Aider — git-aware coding scaffold
  • Claude Code — Anthropic's CLI agent
  • Cursor agent — IDE-embedded scaffold
  • Codex CLI — OpenAI's CLI agent
  • Plain SDK — vanilla provider client
  • Custom — your own internal scaffold (Enterprise)
9 harnesses, scored as Δ vs plain SDK
FIG. 03 · METHODOLOGY · CROSS-SOURCE NORMALISATION

We ingest, we normalise, we ground, we verify, we publish.

Every cell on the scoreboard footnotes the upstream row and the retrieval timestamp. We never accept benchmark sponsorship — funded by subscription only.

Step 01 · Ingest

We pull from a fixed list of upstream sources on a per-source cadence.

Hourly for ArtificialAnalysis. Six-hourly for lmsys arena and SWE-bench. Daily for AgentBench, GAIA, terminal-bench, OSWorld. Weekly for HELM. Vendor blogs and paper PDFs are ingested manually on detection. Every fetch is logged with the source URL, the response hash, and the retrieval timestamp.

Step 02 · Normalise

We cast every raw row into the canonical Triad schema.

Canonical row: { model_id, harness_id, benchmark_id, score, score_kind, n, ci, source_url, retrieved_at }. Score kinds are not mixed across rows — pass-at-1 patch correctness, Elo, and quality-index points are stored separately. The aggregator reads the canonical table; the upstream is never consumed directly by a public surface.

Step 03 · Ground

We pin a source_version on every row.

When an upstream changes its scoring rubric, we re-ingest the affected history and surface a 'methodology bumped' footnote on every cell affected for 30 days. We never silently restate a previous week's number; if it changes, the change is footnoted.

Step 04 · Verify

We require independent verification before a vendor-quoted row appears.

Vendor blog posts are treated as sources, not as results. A vendor claim must be matched on at least one independent benchmark before it is allowed onto the cross-source scoreboard. The methodology page lists the verification rule and every cell links to the row that satisfied it.

Step 05 · Publish

We publish weekly with a permanent archive.

The digest goes out Mondays 09:00 GMT. Every digest item is a single-sentence delta with a citation. Custom-slice subscribers can request on-demand cuts of the canonical table; turnaround is 24h. Every published number, in every channel, footnotes its upstream row and the retrieval timestamp.

Methodology · Sources

AA
ArtificialAnalysis
Quality, speed, and price index for frontier LLMs.
Hourly
ARENA
lmsys / chatbot-arena
Pairwise human preference Elo across LLMs.
6h
HELM
Stanford HELM
Holistic evaluation across multiple academic suites.
Weekly
SWE
SWE-bench (Verified + Lite)
Real-world software engineering tasks; pass-at-1 patch correctness.
6h
AGB
AgentBench v3
Multi-environment agent benchmark across 8 environments.
Daily
GAIA
GAIA
General AI Assistant evaluation; question answering with browsing.
Daily
TBN
terminal-bench
Terminal-style task completion across coding agents.
Daily
OSW
OSWorld
Real-OS desktop tasks; agent operates a screen.
Daily
VEND
Vendor blogs / paper PDFs
Manual ingest; never published as cross-source rows without independent verification.
On detection
  1. 1. Cross-source mean: arithmetic mean of normalised score values across the upstream sources cited per row.
  2. 2. Δ vs plain SDK (Layer 3): the harness gain on SWE-bench Verified, computed as (model + harness) − (model + plain SDK).
  3. 3. We pin a source_version on every ingested row; methodology bumps surface a 30-day footnote on affected cells.
  4. 4. Triad accepts zero compensation from any model lab, agent vendor, harness vendor, or benchmark organisation. Funded by subscription only.
FIG. 04 · WEEKLY DIGEST · MONDAYS 09:00 GMT

The digest is the weekly delta. One sentence per item, every claim footnoted.

Subscribe — $29 / month
FROM · TRIAD <digest@benchmark-intel.prin7r.com>ISSUE 18 · WK 18 · 2026
Sonnet 4.6 widens its lead, but only inside Claude Code — three layer-aware deltas this week.
LLM · Layer 1
Sonnet 4.6 widens its cross-source lead over GPT-5 mini to 3.3 points on the quality index, driven by a +1.4 movement on lmsys week 18 and a +0.9 on SWE-bench Verified.
+1.4
Agent · Layer 2
Sonnet 4.6 + Claude Code is now 14 points ahead of GPT-5 mini + Codex CLI on SWE-bench Verified, but the gap closes to 3 points when both are run inside plain SDK.
+2.1
Harness · Layer 3
Cursor's harness gain over plain SDK on Sonnet 4.6 is now 8.7 points; smaller than Claude Code's +14.0 and Codex CLI's +10.6 — the gap is widening, not narrowing.
+0.4
Methodology
AgentBench v3 bumped its task split this week. We have re-ingested the affected history; rows show a 'methodology bumped' footnote for the next 30 days.
Note
Permalink: benchmark-intel.prin7r.com/archive/wk18 · Sources cited inline · No vendor sponsorship.
FIG. 05 · PRICING · CRYPTO RAILS · NAMED-PAYER

Three tiers. NOWPayments crypto rails for individual buyers; wire or USDT for enterprise.

All amounts USD
Tier 01

Reader

The weekly digest, the cross-source scoreboard, and the archive. For the eval lead replacing a private spreadsheet.

$29/ month
  • Weekly digest, Mondays 09:00 GMT
  • Read-only access to the three-layer scoreboard
  • 12-month digest archive
  • Footnote-level provenance on every score
  • Cancel any time, pro-rated refund on unused days

Pay in USDT, USDC, or with a credit card on the NOWPayments hosted page.

Tier 03

Enterprise

Single-tenant scoreboard URL, custom report cadence, named-payer invoice, methodology audit pack. Wire or USDT direct.

$1,499/ month
  • Everything in Custom-slices
  • Single-tenant scoreboard URL on your subdomain
  • Custom report cadence (weekly / monthly / quarterly)
  • Methodology audit pack — defensible against legal/risk review
  • Quarterly methodology review call
  • Wire transfer or USDT direct invoice
Talk to founder — Enterprise pilot

Reply received within 48h with a single-tenant pilot URL.

FIG. 06 · FAQ · METHODOLOGY-FIRST

Common questions about sources, refunds, and audits.

If your question isn't here, the methodology page covers everything in long form, or email founder@prin7r.com.

How is this different from ArtificialAnalysis?

ArtificialAnalysis covers Layer 1 (LLMs) extremely well. Triad covers Layer 1 plus Layer 2 (Agents) and Layer 3 (Harnesses), and we cross-source against multiple upstreams including ArtificialAnalysis itself. We cite them when their data feeds a row.

How is this different from Stanford HELM?

HELM is rigorous but cycles every 4–8 weeks. We surface deltas hourly. We use HELM as one of our weekly sources rather than as the only source.

Where do you get the numbers?

We aggregate from a fixed list of upstreams: lmsys arena, ArtificialAnalysis, HELM, SWE-bench, AgentBench, GAIA, terminal-bench, OSWorld, plus vendor blogs and paper PDFs. The methodology page lists every source. Every cell on the scoreboard footnotes the upstream row and the retrieval timestamp.

What happens when a source changes its methodology?

We pin a source_version on every ingested row. When an upstream bumps its methodology, we re-ingest the affected history and surface a 'methodology bumped' footnote on every cell affected for 30 days.

Do you run your own benchmarks?

Not in Wave 2. We aggregate published benchmarks. If we ever decide to run our own — to fill a gap — we will publish the methodology and the conflict-of-interest implications first.

What is your conflict-of-interest policy?

We accept zero compensation from any model lab, agent vendor, harness vendor, or benchmark organisation. Funded by subscription only. Published as a permanent page on the site.

How do I cancel?

Reader and Custom-slices: cancel any time, pro-rated refund on unused days within the same billing month. Enterprise: per signed contract.

Why is checkout crypto-only on the landing?

NOWPayments hosted invoice converts at the merchant level — readers can pay with USDT, USDC, or with a credit card on the NOWPayments hosted page. Enterprise tier accepts wire transfer or USDT direct invoice via founder@prin7r.com.