Skip to content

Conversation

@Xayc73
Copy link

@Xayc73 Xayc73 commented Dec 24, 2025

Problem

When a NATS server restarts, the KV watcher's heartbeat task (hb_task) detects that the consumer is gone and attempts to recreate it. However, the current implementation unconditionally sets:

ccreq[:deliver_policy] = "by_start_sequence"
ccreq[:opt_start_seq] = watcher._sseq

Issue: If the KV bucket is empty or no KV updates have been observed yet, watcher._sseq remains 0. JetStream rejects consumer creation with opt_start_seq=0, returning err_code=10094 ("optional start sequence is not set").

This causes the watcher to enter an infinite loop of ConsumerNotFound -> BadRequest (10094) errors, never recovering automatically.

Solution

Check if watcher._sseq > 0 before forcing the by_start_sequence deliver policy:

if watcher._sseq.to_i > 0
  ccreq[:deliver_policy] = "by_start_sequence"
  ccreq[:opt_start_seq] = watcher._sseq
else
  ccreq.delete(:opt_start_seq)
end

When _sseq is 0, the original deliver_policy from the initial subscription is preserved (typically last_per_subject), allowing the consumer to be recreated successfully.

Testing

Added a new test case "should reconnect watches on server restart with empty bucket" that:

  1. Creates an empty KV bucket
  2. Starts a watcher (ensuring _sseq stays at 0)
  3. Simulates server restart
  4. Verifies no BadRequest (10094) errors occur
  5. Confirms the watcher receives new entries after reconnection

Reproduction

Without this fix, the following scenario fails:

  1. Create empty KV bucket
  2. Start watcher with kv.watchall
  3. Kill NATS server
  4. Restart NATS server
  5. Observe repeated errors in error_cb:
    • ConsumerNotFound
    • BadRequest err_code=10094

Impact

  • Low risk: The change only affects the edge case where no KV messages have been received
  • Backward compatible: Existing behavior for non-empty buckets is preserved
  • No breaking changes: Only internal consumer recreation logic is modified

@Xayc73
Copy link
Author

Xayc73 commented Jan 20, 2026

Hi @wallyqs 👋

Friendly ping on this PR. It's been about a month, so just checking in.

Note: The CI failures are unrelated to my changes — the same flaky tests have been
failing on the main branch for months (Build #501, #500, etc. all show identical failures).

Happy to address any feedback or make changes if needed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant