About the data

This data is aggregated from real AI coding sessions of Cadence customers' engineering teams. Sessions are recorded by the cadence-cli tool installed in customer engineering organizations and aggregated anonymously for this site. Only model and harness combinations with sufficient recent activity are shown. The site updates daily.

Output tokens per second

How fast each model generates output during active session time. Grouped by model. Higher = better.

Grouped by model. Higher = better.
How is this measured?

This metric divides top-level output tokens by active session time for each session, then averages those session-level values by day. It focuses on model output during work time and excludes sessions with no active-time measurement. For v1, harnesses are averaged away so the chart reads as model-level speed rather than tool-level speed. Differences in task mix, coding language, and session length can still influence the result, so this should be read as a field signal rather than a lab benchmark.

Frustration rate

Share of user messages where the developer expressed frustration. Grouped by model and harness. Lower = better.

Grouped by model and harness. Lower = better.
How is this measured?

This metric compares messages flagged as frustrated against all user messages in a session, then averages those session-level rates by day. It captures explicit signs of irritation in developer messages and does not attempt to infer hidden sentiment. The signal can be affected by team communication style and task difficulty, so it is most useful as a comparative friction indicator across sufficiently active model and harness combinations.

Context-hunting vs implementation ratio

Time spent finding context relative to time spent implementing. Grouped by model and harness. Lower = better.

Grouped by model and harness. Lower = better.
How is this measured?

This metric compares LLM-classified context-hunting work with LLM-classified implementation work for each session, then averages the ratio by day. A value of 1.0 means equal context-hunting and implementation time. Sessions without implementation time are excluded from this calculation. The classifier is designed to summarize broad workflow phases, so read small day-to-day differences cautiously and focus on persistent separation between series.

Meaningful outcome rate

Share of sessions that produced a meaningful outcome. Grouped by model and harness. Higher = better.

Grouped by model and harness. Higher = better.
How is this measured?

This metric counts sessions classified as producing a meaningful outcome and divides by all classified sessions for the same model, harness, and day. It is intended to separate sessions that actually moved work forward from sessions that stalled, wandered, or ended without a useful result. It does not rank the size or business value of the outcome, and task difficulty can vary across teams and tools.

Failed tool call rate

Share of tool calls that failed. Grouped by model and harness. Lower = better.

Grouped by model and harness. Lower = better.
How is this measured?

This metric divides failed tool calls by total tool calls for each session, then averages those session-level rates by day. It includes tool failures visible in the recorded session stream and excludes sessions with no tool calls. A lower rate generally indicates smoother execution, but some harnesses expose more granular tool events than others, so compare sustained patterns rather than isolated spikes.

Cadence

About Cadence

Cadence helps engineering organizations measure the friction in their AI-assisted development. We aggregate session telemetry from our customer base to surface where AI tools succeed and where they break down in real engineering work.

Learn more