Skip to content

Conversation

@nammn
Copy link
Collaborator

@nammn nammn commented Jan 23, 2026

Summary

Fixes flaky test TestBackupForMongodb::test_deploy_same_mdb_again_with_orphaned_backup by clearing monitoring/backup agent credentials when authentication is disabled.

Why It's Flaky

The test creates mdb-four-zero (auth enabled), deletes it with autoTerminateOnDeletion=False (orphaned backup), then creates mdb-four-two (auth disabled).

Race condition:

  • Passes: OM cleans up mdb-four-zero hosts before backup request → only healthy mdb-four-two hosts checked
  • Fails: mdb-four-zero hosts still registered with stale credentials → monitoring auth fails → no version info → 409

The 409 occurs because OM checks version info across all hosts in the project, not just the target deployment.

Root Cause Analysis

The Bug

When re-deploying MongoDB with auth disabled after a previous deployment with auth enabled:

  1. Stale credentials remain in the monitoring agent template
  2. Ops Manager copies these credentials to per-host monitoring config
  3. Monitoring agent attempts SCRAM-SHA-1 authentication against MongoDB (which has auth disabled)
  4. Authentication fails → no metrics collected → no version info → 409 error

Evidence from Failure

Task ID:
https://spruce.mongodb.com/task/mongodb_kubernetes_e2e_om80_kind_ubi_e2e_om_ops_manager_backup_294d005d333d64faff0d753ad57bb25b5941917f_26_01_23_12_36_41/tests?execution=1&sorts=STATUS%3AASC

1. Monitoring agent log shows authentication failures despite globalAuthUsername = <unset>:

[2026-01-23T14:25:55.748+0000] [header.info] [::0] globalAuthUsername = <unset>

[2026-01-23T14:26:05.181+0000] [metrics.dbstats.collector-mdb-four-zero-0...error] 
Failed to get connectionStatus. Err: `auth error: sasl conversation error: 
unable to authenticate using mechanism "SCRAM-SHA-1": (AuthenticationFailed) Authentication failed.`

2. Automation config shows disabled: true but credentials still present:

{
  "auth": {
    "disabled": true,
    "autoUser": "mms-automation-agent",
    "autoPwd": "vDMC07Aj7t0QzU+xOEJrSiuz1/C/2Kmwq2tw7FLoHVuNre4C..."
  }
}

-> in om we don't care whether its disabled or not, we only set user and pw
-> in automation we use those as we receive them, automation doesn't check for disabled or not

3. Test failure after 600s timeout:

Status: 409 (Conflict), Detail: Backup failed to start: MongoDB version information 
is not yet available for one of your hosts.

Causal Chain

Stale credentials in config (auth.disabled=true but autoUser/autoPwd set)
    ↓
Ops Manager copies credentials to per-host monitoring config
    ↓
Monitoring agent attempts SCRAM-SHA-1 authentication
    ↓
MongoDB rejects (auth is disabled)

Validation

Unit Test Added

TestDisableAuthenticationClearsAgentCredentials in controllers/operator/authentication/authentication_test.go:

  • Passes with fix - credentials cleared via MERGO_DELETE

Evergreen Patches (6/6 passed)

Patch Status URL
1 https://evergreen.mongodb.com/version/697727c8a39dc400072e5a81
2 https://evergreen.mongodb.com/version/697727d9a39dc400072e5a9f
3 https://evergreen.mongodb.com/version/697727e201c5ae0007a50dbb
4 https://evergreen.mongodb.com/version/6977408884379900078dd545
5 https://evergreen.mongodb.com/version/6977409184379900078dd54e
6 https://evergreen.mongodb.com/version/6977409adfe34400070ea867

@github-actions
Copy link

github-actions bot commented Jan 23, 2026

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.7.0 Release Notes

New Features

  • Allows users to override any Ops Manager emptyDir mount with their own PVCs via overrides statefulSet.spec.volumeClaimTemplates.
  • Added support for auto embeddings in MongoDB Community to automatically generate vector embeddings for the vector search data. This document can be followed for detailed documentation
  • MongoDBSearch: Updated the default mongodb/mongodb-search image version to 0.60.1. This is the version MCK uses if .spec.version is not specified.
  • Added support for configurable ValidatingWebhookConfiguration name via operator.webhook.name helm value.

Bug Fixes

  • Fix an issue to ensure that hosts are consistently removed from Ops Manager monitoring during AppDB scale-down events.
  • Fixed an issue where monitoring agents would fail after disabling TLS on a MongoDB deployment.
  • Persistent Volume Claim resize fix: Fixed an issue where the Operator ignored namespaces when listing PVCs, causing conflicts with resizing PVCs of the same name. Now, PVCs are filtered by both name and namespace for accurate resizing.
  • Fixed a panic that occurred when the domain names for a horizon was empty. Now, if the domain names are not valid (RFC 1123), the validation will fail before reconciling.
  • MongoDBMultiCluster, MongoDB: Fix an issue where the operator skipped host removal when an external domain was used, leaving monitoring hosts in Ops Manager even after workloads were correctly removed from the cluster.
  • Fixed an issue where the Operator could crash when TLS certificates are configured using the certificatesSecretsPrefix field without additional TLS settings.
  • MongoDBOpsManager, AppDB: Block removing a member cluster while it still has non-zero members. This prevents scaling down without the preserved configuration and avoids unexpected issues.
  • Fixed an issue where the monitoring agent failed to report version information to Ops Manager when a MongoDB deployment with authentication disabled was created after a previous deployment with authentication enabled had been deleted.
  • The operator now clears stale agent credentials from monitoring and backup agent configs when authentication is disabled, preventing authentication failures against MongoDB instances that have auth disabled.

When re-deploying a MongoDB resource, the monitoring agent needs time to
re-register and send version information to Ops Manager. If the operator
attempts to enable backup before this information is available, Ops Manager
returns a 409 Conflict error with the message 'MongoDB version information
is not yet available for one of your hosts'.

Previously, the operator treated this error as a hard failure, which caused
the MongoDB resource to enter a Failed state. This fix adds special handling
for this specific 409 error to return workflow.Pending instead, allowing
the operator to retry until the monitoring agent reports version information.

Changes:
- Add BackupVersionNotAvailable constant in apierror package
- Add ErrorBackupVersionNotAvailable() helper method to check for this error
- Update ensureBackupConfigStatuses() to return Pending for this error

This fixes the flaky test:
- Test: TestBackupForMongodb::test_deploy_same_mdb_again_with_orphaned_backup
- Task: https://spruce.mongodb.com/task/mongodb_kubernetes_e2e_om80_kind_ubi_e2e_om_ops_manager_backup_294d005d333d64faff0d753ad57bb25b5941917f_26_01_23_12_36_41/tests
@nammn nammn force-pushed the fix/backup-409-version-not-available branch from 1984257 to 0f3523b Compare January 23, 2026 15:17
nammn added 5 commits January 23, 2026 19:18
When authentication is disabled, explicitly clear the username and password
from the monitoring and backup agent configs. This prevents the agents from
attempting to authenticate with stale credentials when MongoDB has auth
disabled.

This fixes the root cause of the flaky backup test where:
1. First MongoDB deployment has auth enabled, credentials are set
2. Deployment is deleted with autoTerminateOnDeletion=False (orphaned backup)
3. Second deployment is created with auth disabled
4. Monitoring agent still has old credentials and tries to authenticate
5. Authentication fails, no metrics collected, no version info reported
6. OM returns 409 'version info not available' indefinitely
// Step 3: Verify that monitoring agent credentials are cleared
monitoringConfig, err = conn.ReadMonitoringAgentConfig()
require.NoError(t, err)
assert.Equal(t, util.MergoDelete, monitoringConfig.MonitoringAgentTemplate.Username,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without the fix the agentTemplateUserName and Password are still

		config.SetAgentUserName("mms-automation")
		config.SetAgentPassword("stale-password-from-previous-deployment")

nammn added 2 commits January 26, 2026 14:30
The 409 error handling only changed visibility (Pending vs Failed status) but both
states retry with the same mechanism. The actual fix is clearing stale agent
credentials when auth is disabled, which prevents the monitoring agent from
failing to authenticate against MongoDB instances with auth disabled.

This simplifies the code by relying on the existing transient error handling.
@nammn nammn marked this pull request as ready for review January 27, 2026 09:12
@nammn nammn requested review from a team and vinilage as code owners January 27, 2026 09:12
@nammn nammn marked this pull request as draft January 27, 2026 11:43
nammn added 11 commits January 27, 2026 13:43
When SCRAM authentication is disabled for an agent, the monitoring and backup
agent configs must have their username/password cleared. This prevents the
agents from attempting SCRAM authentication against deployments that don't
have auth enabled.

This follows the same pattern used by LDAP and X509 authentication mechanisms,
which already clear their respective credentials in DisableAgentAuthentication.

Root cause: In a project with multiple deployments where some have SCRAM enabled
and others have auth disabled, the monitoring agent would use project-level SCRAM
credentials for ALL hosts. This caused authentication failures on auth-disabled
deployments, preventing metrics collection and causing 409 'version info not
available' errors.

Fixes: mongodb-kubernetes-jdz
- Restore EnsurePassword() call which is needed for safe auth transitions
- Clear AutoPwd after EnsurePassword() to prevent stale credentials
- Keep AutoUser = util.AutomationAgentName (original behavior)
- Remove diagnostic logging
- Update test to expect AutoPwd = MergoDelete when auth is disabled
…bled

- Track wasEnabled state to determine if we're actually transitioning
- EnsurePassword() is only needed during the transition (for agents to authenticate)
- AutoPwd = MergoDelete is always set to clear stale credentials
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants