Did you know that setting a large language model's
temperature to 0, which forces it to always pick the single most likely next word, still does not guarantee you get the same answer twice? A 2025 analysis from
Thinking Machines Lab found the real culprit is not random hardware math but a lack of
batch invariance: production servers bundle many users' requests into one batch, and the batch's changing size subtly shifts the arithmetic, so your output depends on how many strangers happened to hit the server at the same moment. Running one prompt 1,000 times at temperature 0 produced dozens of structurally different completions, and the fix was rewriting the underlying math to return identical results no matter the batch size.