Skip to content
OBLAIDISH NEWS
Your scraper returned a clean row. It was wrong.
TX_286506Engineering

Your scraper returned a clean row. It was wrong.

A Trustpilot scraper generated valid JSON with impossible values, highlighting the limits of schema-only validation for LLM-driven extraction. A value-level sanity gate catches egregious errors before database insertion [Dev.to].

The Trustpilot scraper has logged 962 successful runs, with each run returning a JSON object that passes a strict schema check [Dev.to]. However, the payload can contain impossible values, such as a rating of 7 on a five-star system. This occurs when the LLM fabrication step prefers to generate a plausible value rather than emit null. Five failure classes have been observed: rating out of range, future dates, verified flag without token, count mismatch, and country-language mismatch [Dev.to].

A simple value-level sanity gate flags each violation after JSON parsing and before database insertion. This gate uses only the Python standard library and can be deployed in minutes, catching egregious hallucinations without adding latency. As Paul SANTUS noted, structured output helps with syntax but does not solve the semantic problem [Paul SANTUS]. The schema guarantees shape, not truth.

The lack of value checks can lead to silent corruption, where malformed rows become permanent facts, propagating through downstream analytics, recommendation engines, and reporting dashboards. A cost-effective defense is to use a value-level sanity gate, which can be deployed quickly and catches most egregious errors. However, in-range lies, such as a true rating of 2 returned as 4, evade simple range checks and require reference data, a second extraction pass, or human review, all of which incur additional cost.

operator_channel
[ comments_offline · provider_not_configured ]
transmission_log

Subscribe to the broadcast.

Daily digest of the day's most important tech news. No fluff. Engineering signal only.

// delivered via substack · double-opt-in confirmation