GPT-5.5 Codex responses cluster at exact 516-token threshold
Get the Tech newsletter
Daily tech — startups, AI labs, chips, the launches that shape the next decade. Free.
- GPT-5.5 Codex responses disproportionately land at exactly 516 reasoning_output_tokens, with additional fixed-boundary spikes at 1034 and 1552 — values the author characterizes as looking like repeated threshold boundaries rather than a naturally varying distribution
- GPT-5.5 accounts for only 19.3% of Codex responses but 82.0% of exact-516 events, with its exact-516 / ≥516 ratio running roughly 33.6× higher than the non-GPT-5.5 baseline
- Monthly exact-516 clustering rose sharply from February through June 2026 while mean and P90 reasoning-token intensity fell in the same window, indicating GPT-5.5 responses are terminating earlier and at fixed points
- The anomaly is linked to issue #29353, which reproduced a gpt-5.5 xhigh Codex Desktop task that returned a wrong final_answer after ending at exactly 516 reasoning tokens
- The author explicitly stops short of claiming hidden chain-of-thought truncation, instead asking the Codex team to investigate whether a reasoning-budget cap, routing decision, or scheduler fallback is forcing responses to terminate around the 516/1034/1552 thresholds
- The analysis draws on Codex token_count metadata spanning February 1–June 27, 2026 UTC, and proposes internal validation steps including replaying matched complex tasks across GPT-5.2 and GPT-5.5 with quality evals that separate exact-516 responses from longer-reasoning ones
Why it matters: If GPT-5.5 is silently capping reasoning at fixed thresholds, the 33.6× concentration of exact-516 responses in a single model version — paired with a documented case (#29353) of a wrong answer at that exact boundary — gives the Codex team a concrete, reproducible signal that degraded complex-task performance is tied to a specific reasoning budget rather than to user error or task selection.




