-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Failed to find previous kopia snapshot manifests" for velero backup using csi snapshot data mover #8222
Comments
Looks like the repo content with ID Take example for volume d
Therefore, looks like the content was not found at one time but found at the other time. The |
"Therefore, looks like the content was not found at one time but found at the other time." -> True, I have triggered a backup almost immediately after the backup is failed at "2024-09-17T02:09:58Z". We use AWS S3 as object store. backup repository example:
|
I also think the object should exist all the time. But the object store just returned |
corresponding log example from velero pod: "level=error msg="data path backup failed: Failed to run kopia backup: Failed to find previous kopia snapshot manifests for si default@default:snapshot-data-upload-download/kopia/feature-flag-proxy/redis-data-feature-flag-proxy-redis-replicas-0: unable to find manifest entries: failed to get manifests with labels map[hostname:default path:snapshot-data-upload-download/kopia/feature-flag-proxy/redis-data-feature-flag-proxy-redis-replicas-0 type:snapshot username:default]: error to find manifests: unable to load manifest contents: error loading manifest content: error getting cached content from blob "qf139466a8e99d0a9065f927a84f0aee4-sd5191d7c73ebc3bb12c": failed to get blob with ID qf139466a8e99d0a9065f927a84f0aee4-sd5191d7c73ebc3bb12c: BLOB not found, plugin: velero.io/csi-pvc-backupper" backup=velero/daily-bkp-20240918033004 logSource="pkg/controller/backup_controller.go:663" Is there any way to tell the plugin to retry in case of 404? |
this seems to happen quite frequently in multiple clusters. Tried with velero version 1.12 and 1.14 also aws-plugin versions. |
Another notice-worthy thing is please check whether the snapshots can be shared across clusters. |
the volumeSnapshots will be deleted after the dataupload is successful right? |
The VolumeSnapshots are reset to retain the snapshot before deletion, so that is not the problem. I think we can use |
@blackpiglet / @Lyndon-Li do we have any workaround for this at the moment? Found the following warnings on node-agent is it related?
|
The warning message is not relevant. So we need more info to get the root cause. Please try to answer below questions:
|
Hi @Lyndon-Li
|
Hi @Lyndon-Li , We had a cluster where we did not do any manual activity of deletion of backups. But still we saw this error there. we also increased the maintainence frequency to 24h. |
For this question:
If the answer is |
my bad, we do restore for single namespace whose dataupload is completed. The one with failed dataupload will not be able to get restored.
Is this somehow related to kopia issues: https://kopia.discourse.group/t/can-not-access-repository-failed-to-get-blob/1274/17 |
Hi @Lyndon-Li , One more thing, if I delete all the backup repository CRs and trigger backup its completing succesfully. before deletion of backup repositories few datauploads were failing since last 1 week. And further after few days starts failing again. |
It further indicates that the repo snapshot is not missing. |
But if this is intermittent connection issue how are other datauploads of same repo (other volumes in same namespace) getting completed at the same time? |
Is it some issue with IRSA? because i can confirm that since last few days i am deletig backuprepo CR's frequently and i dont see this issue . But this is for sure not the way we would like to go forward. |
What steps did you take and what happened:
velero version 1.12
also tried upgrading to 1.14.1
aws plugin 1.10.1
scheduled backup runs everyday with csi datamover.
backups are in partiallyFailed status intermittently with few datauploads in "Failed" state.
On describing the failed datauploads we find the following error.
data path backup failed: Failed to run kopia backup: Failed to find previous kopia snapshot manifests for si default@default:snapshot-data-upload-download/kopia/elastic-system/elasticsearch-data-logging-elasticsearch-es-master-nodeset-2: unable to find manifest entries: failed to get manifests with labels map[hostname:default path:snapshot-data-upload-download/kopia/elastic-system/elasticsearch-data-logging-elasticsearch-es-master-nodeset-2 type:snapshot username:default]: error to find manifests: unable to load manifest contents: error loading manifest content: error getting cached content from blob "q78b42bb7bf6a17c91da3fdd28388d193-s8b43049f3444df3912c": failed to get blob with ID q78b42bb7bf6a17c91da3fdd28388d193-s8b43049f3444df3912c: BLOB not found
What did you expect to happen:
All datauploads are in completed state and Backups are successful everyday.
The following information will help us better understand what's going on:
If you are using velero v1.7.0+:
Please use
velero debug --backup <backupname> --restore <restorename>
to generate the support bundle, and attach to this issue, more options please refer tovelero debug --help
bundle-2024-09-17-11-32-20.tar.gz
If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)
kubectl logs deployment/velero -n velero
velero backup describe <backupname>
orkubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename>
orkubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>
Anything else you would like to add:
Environment:
Production environment
velero version
):velero client config get features
):kubectl version
):/etc/os-release
):Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: