MedGemma model shows hardware-dependent nondeterminism

A 4-bit MedGemma model produced different triage levels for the same patient case on a CPU and a GPU, revealing hardware-dependent nondeterminism in on-device medical triage [Dev.to] [Thinking Machines].

sources[Dev.to][Thinking Machines]

A 4-bit MedGemma 1.5 4B model returned ATS-2 on a 4-vCPU Cloud Run instance and ATS-3 on an RTX 5070 Ti Mobile GPU for the same renal-colic case, exposing hardware-dependent nondeterminism in on-device triage [Dev.to][Thinking Machines]. The model is quantized to Q4_K_XL (≈3.4 GB) to fit consumer hardware. Two deployments were tested: a laptop GPU (RTX 5070 Ti, 12 GB) and a Cloud Run CPU node (4 vCPU). Both runs used identical model files, prompts, greedy decoding (temperature 0), and rule-based fallback logic [Dev.to].

The divergence stems from the arithmetic layer: GPU kernels dequantize Q4_K weights in fp16 and accumulate in fp16, while the CPU path expands to fp32 and accumulates in fp32. This yields slightly different logits at near-ties, and greedy argmax flips the chosen token, cascading into a different ATS category [Thinking Machines].

Clinical reproducibility collapses across hardware, as regulatory validation assumes deterministic behavior. A model that changes urgency based on processor cannot be reliably audited or certified. Quantization amplifies numeric noise, and a sub-ten-thousandth logit shift can flip the argmax, turning a safe over-triage into an unsafe under-triage. This problem extends beyond medicine, affecting any LLM that drives a discrete, high-stakes decision, such as loan tiering, content moderation, or routing [Dev.to].

To address this issue, safety-critical LLMs should adopt versioned, deterministic inference pipelines, such as vLLM's deterministic kernels or custom fp32-only backends, and lock the entire stack (model, quant format, backend version, target device) before validation [Thinking Machines].

adjacent broadcasts

TX_793687·ai

operator_channel

[ comments_offline · provider_not_configured ]

transmission_log

Subscribe to the broadcast.

Daily digest of the day's most important tech news. No fluff. Engineering signal only.

// delivered via substack · double-opt-in confirmation