Open source LLM eval tool adds blind comparisons and cognitive posture maps

A new open-source LLM evaluation tool uses blind side-by-side comparisons and cognitive posture heat maps to reduce bias and expose response patterns like sycophancy or hallucination cascades [devto].

sources[devto]

Most LLM evaluations rely on LLM-as-judge setups, where a second model scores outputs against a rubric. That method is expensive, slow, and prone to variance [devto].

A new open-source tool introduces two independent quality signals to improve reliability. First, it runs a blind side-by-side comparison: two agents respond to the same prompt, and a judge model evaluates both without knowing their source. This reduces brand bias and inflates scores less than single-response grading [devto].

Second, the tool generates cognitive posture heat maps—deterministic text analysis that visualizes how a model constructs its response. Instead of just checking alignment with the user, the maps flag patterns like sycophancy compounding, where models amplify flattery, or hallucination cascades, where one false claim triggers more.

Why it matters:

Dual signals make evaluations harder to game. A model can’t score well through flattery or verbosity alone if the cognitive posture reveals manipulative framing.
Heat maps are reproducible and inspectable, unlike opaque judge scores. Teams can audit why a response scored poorly, not just that it did.
The tool is MIT-licensed, allowing developers to adapt it for benchmarks, red teaming, or internal model validation.

The blind comparison setup already shows divergences from standard benchmarks—models that lead on leaderboards don’t always win head-to-head when judged on clarity and neutrality.

adjacent broadcasts

TX_777683·ai

operator_channel

[ comments_offline · provider_not_configured ]

transmission_log

Subscribe to the broadcast.

Daily digest of the day's most important tech news. No fluff. Engineering signal only.

// delivered via substack · double-opt-in confirmation