Beyond Benchmarks

About the data

This data is aggregated from anonymized real AI coding sessions. These session transcripts are uploaded and analyzed by the Cadence app. Only model and harness combinations with sufficient recent activity are shown. The site updates daily.

How is this measured?

This metric divides top-level output tokens by active session time for each session, then averages those session-level values by day. It focuses on model output during work time and excludes sessions with no active-time measurement. For v1, harnesses are averaged away so the chart reads as model-level speed rather than tool-level speed. Differences in task mix, coding language, and session length can still influence the result, so this should be read as a field signal rather than a lab benchmark.

How is this measured?

This metric compares messages flagged as frustrated against all user messages in a session, then averages those session-level rates by day. It captures explicit signs of irritation in developer messages and does not attempt to infer hidden sentiment. The signal can be affected by team communication style and task difficulty, so it is most useful as a comparative friction indicator across sufficiently active model and harness combinations.

How is this measured?

This metric compares LLM-classified context-hunting work with LLM-classified implementation work for each session, then averages the ratio by day. A value of 1.0 means equal context-hunting and implementation time. Sessions without implementation time are excluded from this calculation. The classifier is designed to summarize broad workflow phases, so read small day-to-day differences cautiously and focus on persistent separation between series.

How is this measured?

This metric counts sessions classified as producing a meaningful outcome and divides by all classified sessions for the same model, harness, and day. It is intended to separate sessions that actually moved work forward from sessions that stalled, wandered, or ended without a useful result. It does not rank the size or business value of the outcome, and task difficulty can vary across teams and tools.

How is this measured?

This metric divides failed tool calls by total tool calls for each session, then averages those session-level rates by day. It includes tool failures visible in the recorded session stream and excludes sessions with no tool calls. A lower rate generally indicates smoother execution, but some harnesses expose more granular tool events than others, so compare sustained patterns rather than isolated spikes.

Cadence

About Cadence

Cadence helps engineering organizations measure the friction in their AI-assisted development. We aggregate session telemetry from our customer base to surface where AI tools succeed and where they break down in real engineering work.

Learn more

Beyond
Benchmarks

Output tokens per second

Frustration rate

Context-hunting vs implementation ratio

Meaningful outcome rate

Failed tool call rate

About Cadence