Skip to content

Panic in retryDelay due to integer overflow on x86 architecture #489

@eager7

Description

@eager7

Title: Panic in retryDelay due to integer overflow on x86 architecture

Bug Report

Description

When using the openai-go client with a high number of retries, a panic can occur due to an integer overflow in the retryDelay function. This issue appears to be specific to the x86 architecture; it does not occur on ARM (e.g., Apple M-series chips).

The panic occurs with the message: panic: invalid argument to Int63n

Here is the stack trace from our server logs:

panic: invalid argument to Int63n

goroutine 74428 [running]:
math/rand.(*Rand).Int63n(0x6a80?, 0x3ff0000000000000?)
/root/go/pkg/mod/golang.org/[email protected]/src/math/rand/rand.go:122 +0xcb
math/rand.Int63n(0xe000000000000000)
/root/go/pkg/mod/golang.org/[email protected]/src/math/rand/rand.go:454 +0x25
github.com/openai/openai-go/internal/requestconfig.retryDelay(0x1cca420?, 0x23)
/root/go/pkg/mod/github.com/openai/[email protected]/internal/requestconfig/requestconfig.go:373 +0x91
github.com/openai/openai-go/internal/requestconfig.(*RequestConfig).Execute(0xc004e2d040)
/root/go/pkg/mod/github.com/openai/[email protected]/internal/requestconfig/requestconfig.go:466 +0x5a5
github.com/openai/openai-go/internal/requestconfig.ExecuteNewRequest({0x21855b0?, 0xc0046b2fc0?}, {0x1e7df52?, 0x521518?}, {0x1e98cb3?, 0x5?}, {0x1e55720?, 0xc004964c08?}, {0x19c6780, 0xc00020a0a8}, ...)
/root/go/pkg/mod/github.com/openai/[email protected]/internal/requestconfig/requestconfig.go:562 +0x9b
github.com/openai/openai-go.(*ChatCompletionService).New(_, {_, _}, {{0xc006281400, 0x2, 0x2}, {0xc000813590, 0xe}, {0x0, 0x0, ...}, ...}, ...)
/root/go/pkg/mod/github.com/openai/[email protected]/chatcompletion.go:66 +0x16a
...

Root Cause Analysis

The panic originates in the retryDelay function in internal/requestconfig/requestconfig.go:

func retryDelay(res *http.Response, retryCount int) time.Duration {
    // ...
	maxDelay := 8 * time.Second
	delay := time.Duration(0.5 * float64(time.Second) * math.Pow(2, float64(retryCount)))
	if delay > maxDelay {
		delay = maxDelay
	}

	jitter := rand.Int63n(int64(delay / 4)) // Panics here
	delay -= time.Duration(jitter)
	return delay
}

The math/rand.Int63n function panics if its argument is less than or equal to 0. The issue arises from this line:
delay := time.Duration(0.5 * float64(time.Second) * math.Pow(2, float64(retryCount)))

When retryCount is a large number (e.g., 48 or higher), math.Pow returns a very large float64. When this float is converted to time.Duration (which is an int64), the behavior differs by architecture:

  • On x86 (amd64): The conversion overflows, resulting in a large negative int64 value for delay. Consequently, delay / 4 is also negative, causing rand.Int63n to panic.
  • On ARM (arm64): The conversion from a large float to int64 "saturates" at math.MaxInt64 instead of overflowing. This prevents delay from becoming negative, and the code does not panic.

We encountered this in a long-running offline processing service where we set MaxRetries to a high value (e.g., 100) to ensure completion despite potential rate limiting.

Steps to Reproduce

This panic can be reliably reproduced on an x86 machine using a fuzz test.

  1. Create a test file (e.g., retry_fuzz_test.go):

    package requestconfig_test
    
    import (
    	"math"
    	"math/rand"
    	"testing"
    	"time"
    )
    
    // Simplified version of the internal retryDelay for testing
    func retryDelay(t *testing.T, retryCount uint) time.Duration {
    	maxDelay := 8 * time.Second
    	// This is the problematic line
    	delay := time.Duration(0.5 * float64(time.Second) * math.Pow(2, float64(retryCount)))
    	if delay > maxDelay {
    		delay = maxDelay
    	}
    
    	if delay/4 <= 0 {
    		// This demonstrates the overflow on x86.
    		// On x86, for retryCount=48, delay becomes a large negative number.
    		t.Logf("retryCount=%d, delay=%v, delay/4=%v", retryCount, delay, delay/4)
    	}
    
    	jitter := rand.Int63n(int64(delay / 4))
    	delay -= time.Duration(jitter)
    	return delay
    }
    
    func FuzzRetryDelay(f *testing.F) {
    	f.Fuzz(func(t *testing.T, a uint) {
    		// Limit 'a' to a reasonable range to find the issue faster.
    		retryCount := a % 100 
    		retryDelay(t, retryCount)
    	})
    }
  2. Run the fuzz test on an x86/amd64 machine. It will quickly fail.

    go test -fuzz=Fuzz -fuzztime=10s .
  3. Failing Output (on x86):

    --- FAIL: FuzzRetryDelay (0.00s)
        --- FAIL: FuzzRetryDelay (0.00s)
            rand_fuzz_test.go:21: retryCount=48, delay=-2562047h47m16.854775808s, delay/4=-2305843009213693952
            testing.go:1693: panic: invalid argument to Int63n
                goroutine 24 [running]:
                ...
                math/rand.Int63n(0xe000000000000000)
                ...
    FAIL
    exit status 1
    

Suggested Fix

The exponential backoff calculation should guard against this overflow. A simple fix would be to cap the retryCount used in the math.Pow calculation to a safe value that won't overflow int64 when converted to nanoseconds.

For example, capping retryCount at 30 would prevent the overflow:

func retryDelay(res *http.Response, retryCount int) time.Duration {
	// ...
    effectiveRetryCount := retryCount
    // Cap retryCount to prevent int64 overflow from math.Pow
    if effectiveRetryCount > 30 {
        effectiveRetryCount = 30
    }
	delay := time.Duration(0.5 * float64(time.Second) * math.Pow(2, float64(effectiveRetryCount)))
	// ...
}

Alternatively, check if delay is negative after the cast and clamp it to maxDelay.

Environment

  • openai-go version: v1.12.0
  • Go version: go1.24.1
  • Failing Architecture: linux/amd64
  • Passing Architecture: darwin/arm64 (Apple M1/M2/M3)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions