-
Notifications
You must be signed in to change notification settings - Fork 36
Fix flaky backup test: Handle 409 'version not available' error #719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
nammn
wants to merge
19
commits into
master
Choose a base branch
from
fix/backup-409-version-not-available
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+183
−7
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
MCK 1.7.0 Release NotesNew Features
Bug Fixes
|
When re-deploying a MongoDB resource, the monitoring agent needs time to re-register and send version information to Ops Manager. If the operator attempts to enable backup before this information is available, Ops Manager returns a 409 Conflict error with the message 'MongoDB version information is not yet available for one of your hosts'. Previously, the operator treated this error as a hard failure, which caused the MongoDB resource to enter a Failed state. This fix adds special handling for this specific 409 error to return workflow.Pending instead, allowing the operator to retry until the monitoring agent reports version information. Changes: - Add BackupVersionNotAvailable constant in apierror package - Add ErrorBackupVersionNotAvailable() helper method to check for this error - Update ensureBackupConfigStatuses() to return Pending for this error This fixes the flaky test: - Test: TestBackupForMongodb::test_deploy_same_mdb_again_with_orphaned_backup - Task: https://spruce.mongodb.com/task/mongodb_kubernetes_e2e_om80_kind_ubi_e2e_om_ops_manager_backup_294d005d333d64faff0d753ad57bb25b5941917f_26_01_23_12_36_41/tests
1984257 to
0f3523b
Compare
When authentication is disabled, explicitly clear the username and password from the monitoring and backup agent configs. This prevents the agents from attempting to authenticate with stale credentials when MongoDB has auth disabled. This fixes the root cause of the flaky backup test where: 1. First MongoDB deployment has auth enabled, credentials are set 2. Deployment is deleted with autoTerminateOnDeletion=False (orphaned backup) 3. Second deployment is created with auth disabled 4. Monitoring agent still has old credentials and tries to authenticate 5. Authentication fails, no metrics collected, no version info reported 6. OM returns 409 'version info not available' indefinitely
nammn
commented
Jan 26, 2026
| // Step 3: Verify that monitoring agent credentials are cleared | ||
| monitoringConfig, err = conn.ReadMonitoringAgentConfig() | ||
| require.NoError(t, err) | ||
| assert.Equal(t, util.MergoDelete, monitoringConfig.MonitoringAgentTemplate.Username, |
Collaborator
Author
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without the fix the agentTemplateUserName and Password are still
config.SetAgentUserName("mms-automation")
config.SetAgentPassword("stale-password-from-previous-deployment")
The 409 error handling only changed visibility (Pending vs Failed status) but both states retry with the same mechanism. The actual fix is clearing stale agent credentials when auth is disabled, which prevents the monitoring agent from failing to authenticate against MongoDB instances with auth disabled. This simplifies the code by relying on the existing transient error handling.
When SCRAM authentication is disabled for an agent, the monitoring and backup agent configs must have their username/password cleared. This prevents the agents from attempting SCRAM authentication against deployments that don't have auth enabled. This follows the same pattern used by LDAP and X509 authentication mechanisms, which already clear their respective credentials in DisableAgentAuthentication. Root cause: In a project with multiple deployments where some have SCRAM enabled and others have auth disabled, the monitoring agent would use project-level SCRAM credentials for ALL hosts. This caused authentication failures on auth-disabled deployments, preventing metrics collection and causing 409 'version info not available' errors. Fixes: mongodb-kubernetes-jdz
…ntials when disabling auth
- Restore EnsurePassword() call which is needed for safe auth transitions - Clear AutoPwd after EnsurePassword() to prevent stale credentials - Keep AutoUser = util.AutomationAgentName (original behavior) - Remove diagnostic logging - Update test to expect AutoPwd = MergoDelete when auth is disabled
…bled - Track wasEnabled state to determine if we're actually transitioning - EnsurePassword() is only needed during the transition (for agents to authenticate) - AutoPwd = MergoDelete is always set to clear stale credentials
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Fixes flaky test
TestBackupForMongodb::test_deploy_same_mdb_again_with_orphaned_backupby clearing monitoring/backup agent credentials when authentication is disabled.Why It's Flaky
The test creates
mdb-four-zero(auth enabled), deletes it withautoTerminateOnDeletion=False(orphaned backup), then createsmdb-four-two(auth disabled).Race condition:
mdb-four-zerohosts before backup request → only healthymdb-four-twohosts checkedmdb-four-zerohosts still registered with stale credentials → monitoring auth fails → no version info → 409The 409 occurs because OM checks version info across all hosts in the project, not just the target deployment.
Root Cause Analysis
The Bug
When re-deploying MongoDB with auth disabled after a previous deployment with auth enabled:
Evidence from Failure
Task ID:
https://spruce.mongodb.com/task/mongodb_kubernetes_e2e_om80_kind_ubi_e2e_om_ops_manager_backup_294d005d333d64faff0d753ad57bb25b5941917f_26_01_23_12_36_41/tests?execution=1&sorts=STATUS%3AASC
1. Monitoring agent log shows authentication failures despite
globalAuthUsername = <unset>:2. Automation config shows
disabled: truebut credentials still present:{ "auth": { "disabled": true, "autoUser": "mms-automation-agent", "autoPwd": "vDMC07Aj7t0QzU+xOEJrSiuz1/C/2Kmwq2tw7FLoHVuNre4C..." } }-> in om we don't care whether its disabled or not, we only set user and pw
-> in automation we use those as we receive them, automation doesn't check for disabled or not
3. Test failure after 600s timeout:
Causal Chain
Validation
Unit Test Added
TestDisableAuthenticationClearsAgentCredentialsincontrollers/operator/authentication/authentication_test.go:MERGO_DELETEEvergreen Patches (6/6 passed)