-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] pg_rewind: fatal: could not find previous WAL record #803
Comments
FYI: technically, this is a pre-Apache Cloudberry (Incubating) release. The project has not had their initial Apache release. |
@antoniopetrole - the error messages indicate there were segfaults. Are your systems configured to capture core files? |
@edespino Yes we have specific directories where core files are dumped and I believe we have the core files for all of the times segments have gone down since we deployed the cluster. I'm going to include more information about that in my next issue regarding the segfault itself |
@antoniopetrole I am preparing a tool to assist with the core file review - extracting relevant information. I should have a version to share shortly. |
To assist with core file analysis, I created the following repository containing apache_cloudberry_core_analyzer.py: |
Please wait, then there will be engineers to help support. |
@antoniopetrole Thanks for reporting this complicated issue, we are going to investigate this soon. |
Although we still haven't figured out the root cause of issue #806, I am also working on constructing a workload that may make pg_rewind to fail, in an attempt to reproduce this issue independently. |
Have a potential fix for issue #806, and still finding the root cause. At the same time, we are also trying to reproduce this issue. |
@foreyes Thank you! I understand that this is a complex issue with a lot of moving parts |
Apache Cloudberry version
1.6.0
What happened
Hey all, so our cluster that we have has had a number of segments go down over the past month. Typically this isn't a major issue since we can just run gprecoverseg in an incremental fashion and things are back up within the hour. However almost every time we've had segments go down, the incremental recovery fails with the error
I'm not sure if the default WAL settings are perhaps insufficient in the sense that the WAL records are being rotated out too quickly or if this is a bug. To provide some additional context, here's a timeline of events for segments that crashed on 12/16
12/16/2024 22:08:13: Segment 171 crashes due to a segfault (I'll be opening another issue for this today) from a query
12/16/2024 22:09:33: Segment 51 crashes due to a segfault from the same query
12/16/2024 22:18:38: We run gprecoverseg (no args) and it shows it's only going to recover segment51 and it fails with the following error
12/16/2024 22:19:29: We try running gprecoverseg (no args) again and receive the same exact error as above. It also is saying it's only recovering segment 51
12/16/2024 22:28:01: We try running gprecoverseg -F -a and get the following error
12/16/2024 22:40:04: We try running gprecoverseg (no args) again and now it's showing it's going to try and recover both segment 51 and segment 171 but ends up with the same exact error as before
12/16/2024 22:48:33: 6 more segments go down due to a segfault from the same query (segments 187,19,138,108,143,57)
12/16/2024 22:56:33: 23 more segments go down due to a segfault from the same query (segments 60,129,53,144,131,62,63,175,54,107,71,105,45,160,12,47,177,161,180,70,178,61,147)
12/16/2024 23:42:40: We successfully recover segment 70 with incremental (the duration was 34 minutes)
12/17/2024 1:20:45: We successfully recover 8 more segments incrementally (segments 60,53,62,54,71,107,12,47)
12/17/2024 1:30:12: We try recovering 19 more segments incrementally (19,57,61,63,105,108,129,131,138,143,144,147,160,161,171,175,177,178,180) and received the following error
12/17/2024 3:02:46: We end up recovering 13 specified segments incrementally according to the logs (61,63,105,131,144,147,160,161,171,175,177,178,180)
12/17/2024 4:09:44: On our next run we try to recover more segments and we end up recovering duplicates of our last run (it seems like they didn't recover successfully although gprecoverseg says they did? I need to double check this)
(12,47,53,54,60,61,62,63,70,71,105,107,131,144,147,160,161,171,175,177,178,180)
12/17/2024 4:18:41: We kick off a FULL recovery for the remaining segments which works after running for 10 hours (segments 19,45,51,57,108,129,138,143,187)
12/17/2024 19:02:03: We run a rebalance and the cluster is back in a healthy state
So I know this is a lot of information to digest, I have a spreadsheet link below that shows the timeline of events along with error messages etc. The main thing I want to call out is that we received this WAL error on segments 51, 187, 45, 108, 57, and 19. However a bunch of the other segments that crashed at the same time as those were able to do an incremental with no issues. It seems like we effectively had to go through the process of eliminating potential problem segments from our gprecoverseg input file one at a time to figure out what's going wrong. What's also strange is that our first incremental recovery attempt was only 9 minutes after the segment went down and we hit this issue. However for the final incremental that worked it was almost 5 hours after those segments went down, so I'm not sure how valid the WAL record rotation theory is.
I'm happy to provide any clarifications/logs that might be needed for this, here is a more detailed timeline for review
https://docs.google.com/spreadsheets/d/1qJNC88rFAD6OzF2b0fEeNgAX57QdnKY8SZhnhbg9q6g/edit?usp=sharing
What you think should happen instead
The expected behavior here would be that there aren't inconsistencies in WAL files being available amongst a group of segments that all failed at the same time
How to reproduce
Unfortunately I'm not sure how to even reproduce this
Operating System
Rocky Linux 8.10 (Green Obsidian)
Anything else
This has happened 3 times in the past month requiring a full recovery for either all downed segments or a subset of them
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: