VOL. 18 · METHODOLOGY-FIRST · CROSS-SOURCE

The standing record of who is fastest, cheapest, and most correct in AI.

Triad is a three-layer benchmark intelligence service. We track every meaningful LLM, every meaningful agent, and the harness each one runs inside. Every score traces to a published source row. The cross-source aggregate is published weekly and audited continuously.

Read this week's digest View the scoreboard47 LLMs · 22 agents · 9 harnesses · 11 sources

Tracking47 LLMs22 agents9 harnesses11 sourcesUpdated hourlyNo vendor sponsorship

FIG. 01 · CROSS-SOURCE SCOREBOARD · WEEK 18

The cross-source scoreboard, three layers, footnoted.

Δ vs last week · Sparkline = last 12 weeks

Layer

Model · harness

Score

$ / 1M

Speed

Δ wk

Trend

Claude Sonnet 4.6

Anthropic[1]

76.4

$3.0

112 t/s

+1.4

GPT-5 mini

OpenAI[2]

73.1

$1.8

138 t/s

+0.6

Gemini 2.5 Pro

Google[3]

71.8

$1.5

164 t/s

−0.3

Sonnet 4.6 + Claude Code

Anthropic / Claude Code[4]

62.1

$3.0

—

+2.1

GPT-5 mini + Codex CLI

OpenAI / Codex[5]

58.6

$1.8

—

+1.4

Sonnet 4.6 + Cursor agent

Anthropic / Cursor[6]

57.4

$3.0

—

+0.9

Claude Code (Sonnet 4.6)

Anthropic[7]

Δ +14.0

—

+1.6

Codex CLI (GPT-5 mini)

OpenAI[8]

Δ +10.6

—

+1.0

Cursor (Sonnet 4.6)

Cursor[9]

Δ +8.7

—

+0.4

Score = cross-source mean across the upstream sources cited per row. $ / 1M = USD per million output tokens. Speed = output tokens/second when reported by the upstream. Δ wk = week-over-week change in cross-source mean. Layer-3 rows report Δ vs plain SDK, not absolute score. Source list at § Methodology · Sources.

FIG. 02 · WHAT WE COVER · THE THREE LAYERS

We measure each layer separately, because the same model in two scaffolds is two different products.

Layer 01

LLMs

Frontier and open-weights language models, scored across quality, speed, and price.

Anthropic — Claude Sonnet 4.6, Claude Opus 4.5
OpenAI — GPT-5, GPT-5 mini, GPT-4.1
Google — Gemini 2.5 Pro, Gemini 2.5 Flash
Meta — Llama 4 405B, Llama 4 70B
xAI — Grok 4 Heavy
DeepSeek — V3, R2
Mistral / Qwen / open-weights

47 models tracked across 11 vendors

Layer 02

Agents

Agent benchmarks across coding, browsing, OS, and tool-use tasks. Includes product-mode agents.

SWE-bench Verified + Lite
AgentBench v3 — 8 environments
GAIA — browsing + tool-use QA
terminal-bench — coding-task harnesses
OSWorld — real-OS desktop tasks
Product-mode agents — Claude Code, Codex CLI, Cursor agent, Aider, Devin-class

22 agents across 6 benchmark suites

Layer 03

Harnesses

Same model, same benchmark, different scaffold. The layer no public site measures.

Aider — git-aware coding scaffold
Claude Code — Anthropic's CLI agent
Cursor agent — IDE-embedded scaffold
Codex CLI — OpenAI's CLI agent
Plain SDK — vanilla provider client
Custom — your own internal scaffold (Enterprise)

9 harnesses, scored as Δ vs plain SDK

FIG. 03 · METHODOLOGY · CROSS-SOURCE NORMALISATION

We ingest, we normalise, we ground, we verify, we publish.

Every cell on the scoreboard footnotes the upstream row and the retrieval timestamp. We never accept benchmark sponsorship — funded by subscription only.

Step 01 · Ingest

We pull from a fixed list of upstream sources on a per-source cadence.

Hourly for ArtificialAnalysis. Six-hourly for lmsys arena and SWE-bench. Daily for AgentBench, GAIA, terminal-bench, OSWorld. Weekly for HELM. Vendor blogs and paper PDFs are ingested manually on detection. Every fetch is logged with the source URL, the response hash, and the retrieval timestamp.

Step 02 · Normalise

We cast every raw row into the canonical Triad schema.

Canonical row: { model_id, harness_id, benchmark_id, score, score_kind, n, ci, source_url, retrieved_at }. Score kinds are not mixed across rows — pass-at-1 patch correctness, Elo, and quality-index points are stored separately. The aggregator reads the canonical table; the upstream is never consumed directly by a public surface.

Step 03 · Ground

We pin a source_version on every row.

When an upstream changes its scoring rubric, we re-ingest the affected history and surface a 'methodology bumped' footnote on every cell affected for 30 days. We never silently restate a previous week's number; if it changes, the change is footnoted.

Step 04 · Verify

We require independent verification before a vendor-quoted row appears.

Vendor blog posts are treated as sources, not as results. A vendor claim must be matched on at least one independent benchmark before it is allowed onto the cross-source scoreboard. The methodology page lists the verification rule and every cell links to the row that satisfied it.

Step 05 · Publish

We publish weekly with a permanent archive.

The digest goes out Mondays 09:00 GMT. Every digest item is a single-sentence delta with a citation. Custom-slice subscribers can request on-demand cuts of the canonical table; turnaround is 24h. Every published number, in every channel, footnotes its upstream row and the retrieval timestamp.

Methodology · Sources

ArtificialAnalysis

Quality, speed, and price index for frontier LLMs.

Hourly

ARENA

lmsys / chatbot-arena

Pairwise human preference Elo across LLMs.

HELM

Stanford HELM

Holistic evaluation across multiple academic suites.

Weekly

SWE

SWE-bench (Verified + Lite)

Real-world software engineering tasks; pass-at-1 patch correctness.

AGB

AgentBench v3

Multi-environment agent benchmark across 8 environments.

Daily

GAIA

General AI Assistant evaluation; question answering with browsing.

Daily

TBN

terminal-bench

Terminal-style task completion across coding agents.

Daily

OSW

OSWorld

Real-OS desktop tasks; agent operates a screen.

Daily

VEND

Vendor blogs / paper PDFs

Manual ingest; never published as cross-source rows without independent verification.

On detection

1. Cross-source mean: arithmetic mean of normalised score values across the upstream sources cited per row.
2. Δ vs plain SDK (Layer 3): the harness gain on SWE-bench Verified, computed as (model + harness) − (model + plain SDK).
3. We pin a source_version on every ingested row; methodology bumps surface a 30-day footnote on affected cells.
4. Triad accepts zero compensation from any model lab, agent vendor, harness vendor, or benchmark organisation. Funded by subscription only.

FIG. 04 · WEEKLY DIGEST · MONDAYS 09:00 GMT

The digest is the weekly delta. One sentence per item, every claim footnoted.

Subscribe — $29 / month

FROM · TRIAD <digest@benchmark-intel.prin7r.com>ISSUE 18 · WK 18 · 2026

Sonnet 4.6 widens its lead, but only inside Claude Code — three layer-aware deltas this week.

LLM · Layer 1

Sonnet 4.6 widens its cross-source lead over GPT-5 mini to 3.3 points on the quality index, driven by a +1.4 movement on lmsys week 18 and a +0.9 on SWE-bench Verified.

+1.4

Agent · Layer 2

Sonnet 4.6 + Claude Code is now 14 points ahead of GPT-5 mini + Codex CLI on SWE-bench Verified, but the gap closes to 3 points when both are run inside plain SDK.

+2.1

Harness · Layer 3

Cursor's harness gain over plain SDK on Sonnet 4.6 is now 8.7 points; smaller than Claude Code's +14.0 and Codex CLI's +10.6 — the gap is widening, not narrowing.

+0.4

Methodology

AgentBench v3 bumped its task split this week. We have re-ingested the affected history; rows show a 'methodology bumped' footnote for the next 30 days.

Note

Permalink: benchmark-intel.prin7r.com/archive/wk18 · Sources cited inline · No vendor sponsorship.

FIG. 05 · PRICING · CRYPTO RAILS · NAMED-PAYER

Three tiers. NOWPayments crypto rails for individual buyers; wire or USDT for enterprise.

All amounts USD

Tier 01

Reader

The weekly digest, the cross-source scoreboard, and the archive. For the eval lead replacing a private spreadsheet.

$29/ month

Weekly digest, Mondays 09:00 GMT
Read-only access to the three-layer scoreboard
12-month digest archive
Footnote-level provenance on every score
Cancel any time, pro-rated refund on unused days

Pay in USDT, USDC, or with a credit card on the NOWPayments hosted page.

Most engineers start hereTier 02

Custom-slices

Everything in Reader, plus four custom-slice requests per month delivered as PDF + CSV with a 24h SLA.

$199/ month

Everything in Reader
Up to 4 custom-slice requests per month
24h SLA on every slice
Read-only API access to the scoreboard table
Named contact who reads your Slack thread
Three teammate seats included

Pay in USDT, USDC, or with a credit card on the NOWPayments hosted page.

Tier 03

Enterprise

Single-tenant scoreboard URL, custom report cadence, named-payer invoice, methodology audit pack. Wire or USDT direct.

$1,499/ month

Everything in Custom-slices
Single-tenant scoreboard URL on your subdomain
Custom report cadence (weekly / monthly / quarterly)
Methodology audit pack — defensible against legal/risk review
Quarterly methodology review call
Wire transfer or USDT direct invoice

Talk to founder — Enterprise pilot

Reply received within 48h with a single-tenant pilot URL.

FIG. 06 · FAQ · METHODOLOGY-FIRST

Common questions about sources, refunds, and audits.

If your question isn't here, the methodology page covers everything in long form, or email founder@prin7r.com.

How is this different from ArtificialAnalysis?

ArtificialAnalysis covers Layer 1 (LLMs) extremely well. Triad covers Layer 1 plus Layer 2 (Agents) and Layer 3 (Harnesses), and we cross-source against multiple upstreams including ArtificialAnalysis itself. We cite them when their data feeds a row.

How is this different from Stanford HELM?

HELM is rigorous but cycles every 4–8 weeks. We surface deltas hourly. We use HELM as one of our weekly sources rather than as the only source.

Where do you get the numbers?

We aggregate from a fixed list of upstreams: lmsys arena, ArtificialAnalysis, HELM, SWE-bench, AgentBench, GAIA, terminal-bench, OSWorld, plus vendor blogs and paper PDFs. The methodology page lists every source. Every cell on the scoreboard footnotes the upstream row and the retrieval timestamp.

What happens when a source changes its methodology?

We pin a source_version on every ingested row. When an upstream bumps its methodology, we re-ingest the affected history and surface a 'methodology bumped' footnote on every cell affected for 30 days.

Do you run your own benchmarks?

Not in Wave 2. We aggregate published benchmarks. If we ever decide to run our own — to fill a gap — we will publish the methodology and the conflict-of-interest implications first.

What is your conflict-of-interest policy?

We accept zero compensation from any model lab, agent vendor, harness vendor, or benchmark organisation. Funded by subscription only. Published as a permanent page on the site.

How do I cancel?

Reader and Custom-slices: cancel any time, pro-rated refund on unused days within the same billing month. Enterprise: per signed contract.

Why is checkout crypto-only on the landing?

NOWPayments hosted invoice converts at the merchant level — readers can pay with USDT, USDC, or with a credit card on the NOWPayments hosted page. Enterprise tier accepts wire transfer or USDT direct invoice via founder@prin7r.com.