Skip to content

Conversation

@gen1321
Copy link

@gen1321 gen1321 commented Feb 9, 2025

closes prometheus/prometheus#15186

Why this issue happened:

  • Tokens can expire mid-request. This causes a 403 Forbidden with "ExpiredTokenException". Which lead to Prometheus dropping samples
  • Actually, AWS-SDK should refresh credentials automatically, but it won't protect us from mid-request expiration. We can explore setting ExpiryWindow on AssumeRoleProvider to prevent expiration mid-flight, although I still think we should handle expiration as in this PR.

Changes:

  • On ExpiredTokenException, refresh credentials, re-sign, and retry automatically.
  • Refactor signRequest to ensure consistent signing in both initial and retry flows.
  • We add tests to verify retry on ExpiredTokenException and to confirm no retry on other 403 errors.

Testing:

  • Unit tests cover both scenarios.
  • For manual real-world testing, you can use this snippet to make a request with expired credentials::
func NewNoRefreshCredentials(creds *credentials.Credentials) *credentials.Credentials {
	return credentials.NewCredentials(&noRefreshProvider{creds})
}

type noRefreshProvider struct {
	creds *credentials.Credentials
}

func (p *noRefreshProvider) Retrieve() (credentials.Value, error) {
	fmt.Printf("\n=== noRefreshProvider.Retrieve() called at %s ===\n", time.Now().UTC())
	val, err := p.creds.Get()
	fmt.Printf("Retrieved credentials: Provider=%s, HasKeys=%v\n", val.ProviderName, val.HasKeys())
	return val, err
}

var (
	staticCounter struct {
		sync.Mutex
		count int
	}
)

func (p *noRefreshProvider) IsExpired() bool {
	staticCounter.Lock()
	staticCounter.count++
	count := staticCounter.count
	staticCounter.Unlock()

	fmt.Printf("IsExpired: count=%d\n", count)
	return count < 1
}

You can swap out stscreds.NewCredentials for NewNoRefreshCredentials, so you can reproduce expiration mid-flight :)

	cfg := &SigV4Config{
		Region:  "eu-north-1",
		RoleARN: "ROLE-ARN",
	}
	fmt.Println("Creating SigV4RoundTripper")
	rt, err := NewSigV4RoundTripper(cfg, nil)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println("Creating http.Client")
	client := &http.Client{Transport: rt}

	// Make initial request
	fmt.Println("\n=== Making initial request ===")
	makeRequest(client, t)

	// Sleep for 16 minutes
	fmt.Printf("\n=== Sleeping for 16 minutes at %s ===\n", time.Now().UTC())
	time.Sleep(16 * time.Minute)

	// Make request after sleep
	fmt.Printf("\n=== Making request after sleep at %s ===\n", time.Now().UTC())
	makeRequest(client, t)

func makeRequest(client *http.Client, t *testing.T) {
	fmt.Println("Making request")
	req, err := http.NewRequest("GET", "https://aps.eu-north-1.amazonaws.com/workspaces", nil)
	if err != nil {
		t.Fatal(err)
	}
	resp, err := client.Do(req)
	if err != nil {
		t.Fatal(err)
	}
	defer resp.Body.Close()

	body, _ := ioutil.ReadAll(resp.Body)
	fmt.Printf("Response - Status: %d\nHeaders: %+v\nBody: %s\n",
		resp.StatusCode, resp.Header, string(body))
}

- Implement automatic credential refresh for ExpiredTokenException
- Refactor signRequest method to improve code reusability
- Add test cases for expired token and different 403 error scenarios

Signed-off-by: Boris Beginin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remote write doesn't handle SigV4 expiration gracefully

1 participant