-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Followers falls out of sync but reports they are perfectly in sync if leader dies and comes back empty #688
Comments
I'm able to replicate it on AWS ECS (any container environment should do)
So what really happened here is that the leader will not cleanly killed and when it comes back online, its empty. The followers don't notice this change and continue along thinking everything is good. This means the issue is that the IP address changed of the leader during the reboot and the followers didn't re-verify that they weren't connecting to a leader who's AOF didn't match theirs (or even their server id). In my case the leader was dying because AOFSHRINK was not properly running so it ran out of drive space. Its a solvable problem but still reveals that there is an issue. |
I can confirm this on my side. Normally, immediately after connecting to a leader, a follower will issue some md5 checks to the leader to determine if they share the same AOF, and it not the follower AOF should be reset to match the leader. |
Describe the bug
I have 1 leader and 2 followers running for the past week. At some point in the past week one of the followers has lost changes spontaneously but reports that it is healthy and all caught up with the leader and fully connected. I have no idea when these changes were lost, if they happened in real time or if they happened later. In addition to the lost changes I also cannot get the server back in sync without disconnecting the follower and reconnecting it.
I should note that i have seen this behavior before, but it is exceedingly rare. I'm hosting all these instances on AWS ECS.
Here is whats going on (copied from slack group):
I have records in a key that exist on the 1 of the followers but not on the leader or the other follower. Even when i issue a DROP key it still doesn't clear the record from the follower that appears broken. On the broken follower i'm also able to GET and SCAN the key and the bad ID shows up.
The SERVER on the broken follower still says its all caught up. HEALTHZ also shows its all caught up.
The other follower shows identical result except for the num_strings and num_collections which happens to be the broken key/id that was cleared on the leader and 1 follower.
The leader shows:
The ROLE command on both followers show an incrementing offset which is expected as writes are constantly happening (device movement).
The INFO command on the bad follower is:
The INFO on the good follower is:
The INFO on the leader shows that there are two followers connected:
Looking at other keys (device positions). It appears that the leader and the good follower are VERY close in terms of object count but the broken follower is waay off. Here's the crazy thing... The bad follower is still getting updates for new changes.
I turned off the system injecting new data and watched the servers stabilize and they still remain way off.
After looking into the logs it appears the ECS rebooted the servers sometime in the night last night, I have no idea if this caused a problem but I don't think it should have as the one STRING record that is bad is one I manually added during the middle of the day a few days earlier and removed manually during the daytime too.
So what it looks like happened is it missed a chunk of updates and because the leader doesn't ship updates it doesn't think the followers need it stays out of sync. Or they rebooted and somehow failed to properly load the AOF file back in.
I should note that I do perform AOFSHRINK on an interval automatically.
To Reproduce
I've only been able to get it into this state (and notice it) 3 times in the last few months. I am working on figuring out how to cause it to happen. I will stand up a few more stacks to see if I can repeat the behavior.
Expected behavior
I mean, obviously it shouldn't happen.
Logs
I would love to provide these. The servers in question are up and running and i'm not touching them for fear of losing the ability to debug / investigate.
Operating System (please complete the following information):
Additional context
Lots of context above. I have the servers up and running and have the ability to SSH into them. I can pull logs or whatever is necessary to debug the issue. I am not going to touch them for now.
The text was updated successfully, but these errors were encountered: