Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consumer Last Delivered Message is way ahead of stream Last Sequence [v2.10.20, v2.10.22] #6124

Open
ewirch opened this issue Nov 13, 2024 · 7 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@ewirch
Copy link

ewirch commented Nov 13, 2024

Observed behavior

Consumer Last Delivered Message is way ahead of stream Last Sequence. This makes consumer stop delivering messages.

Expected behavior

Consumer delivered message should never exceed stream last message.

Server and client version

Server: 2.10.20
Client: io.nats:jnats:2.16.10

Host environment

No response

Steps to reproduce

Stream info:

Information for Stream delivery-storage-email_promotional created 2022-04-27 13:59:57

              Subjects: delivery.storage.email_promotional.*
              Replicas: 3
               Storage: File

Options:

             Retention: WorkQueue
       Acknowledgments: true
        Discard Policy: Old
      Duplicate Window: 2m0s
     Allows Msg Delete: true
          Allows Purge: true
        Allows Rollups: false

Limits:

      Maximum Messages: unlimited
   Maximum Per Subject: unlimited
         Maximum Bytes: unlimited
           Maximum Age: unlimited
  Maximum Message Size: unlimited
     Maximum Consumers: unlimited

Cluster Information:

                  Name: nats-prod-cluster
                Leader: nats-2
               Replica: nats-0, current, seen 8ms ago
               Replica: nats-1, current, seen 7ms ago

State:

              Messages: 86,600
                 Bytes: 3.8 GiB
        First Sequence: 134,909 @ 2024-10-02 16:40:44
         Last Sequence: 221,508 @ 2024-11-13 08:05:55
      Active Consumers: 1
    Number of Subjects: 164

Consumer info:

Information for Consumer delivery-storage-email_promotional > storage-service-1 created 2024-08-01T11:01:48Z

Configuration:

                    Name: storage-service-1
               Pull Mode: true
          Filter Subject: delivery.storage.email_promotional.*
          Deliver Policy: All
              Ack Policy: Explicit
                Ack Wait: 30.00s
           Replay Policy: Instant
         Max Ack Pending: 200
       Max Waiting Pulls: 512

Cluster Information:

                    Name: nats-prod-cluster
                  Leader: nats-2
                 Replica: nats-0, current, seen 418ms ago
                 Replica: nats-1, current, seen 418ms ago

State:

  Last Delivered Message: Consumer sequence: 134,907 Stream sequence: 595,118
    Acknowledgment Floor: Consumer sequence: 134,907 Stream sequence: 134,907
        Outstanding Acks: 0 out of maximum 200
    Redelivered Messages: 0
    Unprocessed Messages: 0
           Waiting Pulls: 1 of maximum 512

Note stream last sequence: 221,508, and consumer last delivered message: 595,118. The ack-floor at the same time is 134,907.

How is this even possible?

@ewirch ewirch added the defect Suspected defect such as a bug or regression label Nov 13, 2024
@kylemcc
Copy link

kylemcc commented Nov 13, 2024

I've been observing this as well (as recently as today), but haven't been able to come up with a way to reproduce reliably it so that I can make a useful bug report. We're running 2.10.22. Typically I have seen it after NATS runs out of memory. When this happens, in many cases we tend to also observe what appears to be stream corruption on the node that OOMed that cascades and results in data loss. The loss of data causes the sequence numbers to reset and the affected stream consumers will have consumer offsets off in the future.

Currently, we just fix this by deleting and recreating the consumer, since I don't believe there's a way to reset the consumer position.

@wallyqs
Copy link
Member

wallyqs commented Nov 14, 2024

Thanks for the report @kylemcc, we've been able to reproduce some conditions which cause this and are now investigating. By any chance do you recreate the streams using the same name or use purge operations on the stream?

@wallyqs
Copy link
Member

wallyqs commented Nov 14, 2024

@kylemcc in the issue report you mentioned v2.10.20, is that the version where first ran into the issue or was it on v2.10.22?

@kylemcc
Copy link

kylemcc commented Nov 14, 2024

By any chance do you recreate the streams using the same name or use purge operations on the stream?

We don't typically recreate or purge streams. We've done it once or twice, but that was accompanied by also deleting/recreating the consumers.

in the issue report you mentioned v2.10.20, is that the version where first ran into the issue or was it on v2.10.22?

I don't remember exactly when I first encountered it, but I found a thread in our Slack from late last year where it was mentioned (so, early 2.10?). That said, we don't see it terribly often, but it has occurred a couple times on the current version. Wish I could offer something more helpful. Happy to share logs, profiles, or anything else you would find useful next time it happens.

@wallyqs
Copy link
Member

wallyqs commented Nov 14, 2024

@kylemcc when you run into it, what is helpful to share if you can in order to analyze is the full output of nats server request jsz N --all where N is the number of servers in the cluster, you can send that to [email protected]

@wallyqs wallyqs changed the title Consumer Last Delivered Message is way ahead of stream Last Sequence [2.10.20] Consumer Last Delivered Message is way ahead of stream Last Sequence [v2.10.20, v2.10.22] Nov 14, 2024
@sourabhaggrawal
Copy link

I have faced this issue too, twice in a week with nats-server 2.10.11. Had to delete and recreated the consumer to solve it.
No stream deletion or purge took place, if it helps we configured TTL on a stream (initially no ttl was set) and after couple of days we started seeing this issue.

@sourabhaggrawal
Copy link

Hi @wallyqs any idea about this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

4 participants