-
Notifications
You must be signed in to change notification settings - Fork 203
Description
Title: Panic in retryDelay
due to integer overflow on x86 architecture
Bug Report
Description
When using the openai-go
client with a high number of retries, a panic can occur due to an integer overflow in the retryDelay
function. This issue appears to be specific to the x86 architecture; it does not occur on ARM (e.g., Apple M-series chips).
The panic occurs with the message: panic: invalid argument to Int63n
Here is the stack trace from our server logs:
panic: invalid argument to Int63n
goroutine 74428 [running]:
math/rand.(*Rand).Int63n(0x6a80?, 0x3ff0000000000000?)
/root/go/pkg/mod/golang.org/[email protected]/src/math/rand/rand.go:122 +0xcb
math/rand.Int63n(0xe000000000000000)
/root/go/pkg/mod/golang.org/[email protected]/src/math/rand/rand.go:454 +0x25
github.com/openai/openai-go/internal/requestconfig.retryDelay(0x1cca420?, 0x23)
/root/go/pkg/mod/github.com/openai/[email protected]/internal/requestconfig/requestconfig.go:373 +0x91
github.com/openai/openai-go/internal/requestconfig.(*RequestConfig).Execute(0xc004e2d040)
/root/go/pkg/mod/github.com/openai/[email protected]/internal/requestconfig/requestconfig.go:466 +0x5a5
github.com/openai/openai-go/internal/requestconfig.ExecuteNewRequest({0x21855b0?, 0xc0046b2fc0?}, {0x1e7df52?, 0x521518?}, {0x1e98cb3?, 0x5?}, {0x1e55720?, 0xc004964c08?}, {0x19c6780, 0xc00020a0a8}, ...)
/root/go/pkg/mod/github.com/openai/[email protected]/internal/requestconfig/requestconfig.go:562 +0x9b
github.com/openai/openai-go.(*ChatCompletionService).New(_, {_, _}, {{0xc006281400, 0x2, 0x2}, {0xc000813590, 0xe}, {0x0, 0x0, ...}, ...}, ...)
/root/go/pkg/mod/github.com/openai/[email protected]/chatcompletion.go:66 +0x16a
...
Root Cause Analysis
The panic originates in the retryDelay
function in internal/requestconfig/requestconfig.go
:
func retryDelay(res *http.Response, retryCount int) time.Duration {
// ...
maxDelay := 8 * time.Second
delay := time.Duration(0.5 * float64(time.Second) * math.Pow(2, float64(retryCount)))
if delay > maxDelay {
delay = maxDelay
}
jitter := rand.Int63n(int64(delay / 4)) // Panics here
delay -= time.Duration(jitter)
return delay
}
The math/rand.Int63n
function panics if its argument is less than or equal to 0. The issue arises from this line:
delay := time.Duration(0.5 * float64(time.Second) * math.Pow(2, float64(retryCount)))
When retryCount
is a large number (e.g., 48 or higher), math.Pow
returns a very large float64
. When this float is converted to time.Duration
(which is an int64
), the behavior differs by architecture:
- On x86 (amd64): The conversion overflows, resulting in a large negative
int64
value fordelay
. Consequently,delay / 4
is also negative, causingrand.Int63n
to panic. - On ARM (arm64): The conversion from a large float to
int64
"saturates" atmath.MaxInt64
instead of overflowing. This preventsdelay
from becoming negative, and the code does not panic.
We encountered this in a long-running offline processing service where we set MaxRetries
to a high value (e.g., 100) to ensure completion despite potential rate limiting.
Steps to Reproduce
This panic can be reliably reproduced on an x86 machine using a fuzz test.
-
Create a test file (e.g.,
retry_fuzz_test.go
):package requestconfig_test import ( "math" "math/rand" "testing" "time" ) // Simplified version of the internal retryDelay for testing func retryDelay(t *testing.T, retryCount uint) time.Duration { maxDelay := 8 * time.Second // This is the problematic line delay := time.Duration(0.5 * float64(time.Second) * math.Pow(2, float64(retryCount))) if delay > maxDelay { delay = maxDelay } if delay/4 <= 0 { // This demonstrates the overflow on x86. // On x86, for retryCount=48, delay becomes a large negative number. t.Logf("retryCount=%d, delay=%v, delay/4=%v", retryCount, delay, delay/4) } jitter := rand.Int63n(int64(delay / 4)) delay -= time.Duration(jitter) return delay } func FuzzRetryDelay(f *testing.F) { f.Fuzz(func(t *testing.T, a uint) { // Limit 'a' to a reasonable range to find the issue faster. retryCount := a % 100 retryDelay(t, retryCount) }) }
-
Run the fuzz test on an x86/amd64 machine. It will quickly fail.
go test -fuzz=Fuzz -fuzztime=10s .
-
Failing Output (on x86):
--- FAIL: FuzzRetryDelay (0.00s) --- FAIL: FuzzRetryDelay (0.00s) rand_fuzz_test.go:21: retryCount=48, delay=-2562047h47m16.854775808s, delay/4=-2305843009213693952 testing.go:1693: panic: invalid argument to Int63n goroutine 24 [running]: ... math/rand.Int63n(0xe000000000000000) ... FAIL exit status 1
Suggested Fix
The exponential backoff calculation should guard against this overflow. A simple fix would be to cap the retryCount
used in the math.Pow
calculation to a safe value that won't overflow int64
when converted to nanoseconds.
For example, capping retryCount
at 30 would prevent the overflow:
func retryDelay(res *http.Response, retryCount int) time.Duration {
// ...
effectiveRetryCount := retryCount
// Cap retryCount to prevent int64 overflow from math.Pow
if effectiveRetryCount > 30 {
effectiveRetryCount = 30
}
delay := time.Duration(0.5 * float64(time.Second) * math.Pow(2, float64(effectiveRetryCount)))
// ...
}
Alternatively, check if delay
is negative after the cast and clamp it to maxDelay
.
Environment
openai-go
version:v1.12.0
- Go version:
go1.24.1
- Failing Architecture:
linux/amd64
- Passing Architecture:
darwin/arm64
(Apple M1/M2/M3)