Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: various reader failover fixes #1227

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Conversation

aaron-congo
Copy link
Contributor

@aaron-congo aaron-congo commented Dec 18, 2024

Description

  • failover1 fixes:
    • use query to verify host role instead of topology, since server-side topology may not be updated yet.
    • include the original writer as a candidate when failoverMode=strict-reader, since it is likely now a reader. The host role will be verified after connecting.
    • in the ClusterAwareReaderFailoverHandler we pass a variable indicating whether failoverMode=strict-reader, but before these changes the variable was incorrectly passed before failoverMode had been initialized so it was getting the wrong value.
  • failover2 fixes:
    • fixed a bug where reader failover was failing because the plugin assumed that the topology returned by forceRefreshHostList was not stale. In practice, the topology may not be stale, but it likely is because we do not wait for it to get updated before starting reader failover. To solve this, reader failover now tries to connect to all nodes in the topology. If failoverMode=strict-reader, the plugin will query the server to verify the host's role.
    • also fixed a bug where the original writer was only attempted once, which was causing failover failures for 2-instance clusters with failoverMode=strict-reader because it takes some time for the original writer to recover. All nodes will now be attempted multiple times until we hit the failover timeout.
  • telemetry fixes (failover1 and failover2):
    • fixed a bug where telemetryContext.setSuccess and telemetryContext.setException were not called when failover succeeded
    • fixed a bug where failoverReaderFailedCounter was increased twice instead of once when failover failed

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link

github-actions bot commented Dec 18, 2024

Qodana Community for JVM

It seems all right 👌

No new problems were found according to the checks applied

💡 Qodana analysis was run in the pull request mode: only the changed files were checked

View the detailed Qodana report

To be able to view the detailed Qodana report, you can either:

  1. Register at Qodana Cloud and configure the action
  2. Use GitHub Code Scanning with Qodana
  3. Host Qodana report at GitHub Pages
  4. Inspect and use qodana.sarif.json (see the Qodana SARIF format for details)

To get *.log files or any other Qodana artifacts, run the action with upload-result option set to true,
so that the action will upload the files as the job artifacts:

      - name: 'Qodana Scan'
        uses: JetBrains/[email protected]
        with:
          upload-result: true
Contact Qodana team

Contact us at [email protected]

@aaron-congo aaron-congo changed the title fix: reader failover logic fix: various reader failover fixes Dec 19, 2024
initHostProviderFunc.call();
}

@Override
public OldConnectionSuggestedAction notifyConnectionChanged(final EnumSet<NodeChangeOptions> changes) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function has the same implementation in the superclass so I removed it

@@ -578,22 +567,6 @@ protected void failover(final HostSpec failedHost) throws SQLException {
} else {
failoverReader(failedHost);
}

if (isInTransaction || this.pluginService.isInTransaction()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was moved to throwFailoverSuccessException so that the exception is thrown in failoverReader/failoverWriter instead. This was needed because failoverReader/failoverWriter catch the exception and then update the telemetry context, but there was a bug where we the exception was thrown here instead of in failoverReader/failoverWriter.

@@ -621,28 +593,24 @@ protected void failoverReader(final HostSpec failedHostSpec) throws SQLException
}

if (result == null || !result.isConnected()) {
// "Unable to establish SQL connection to reader instance"
processFailoverFailure(Messages.get("Failover.unableToConnectToReader"));
this.failoverReaderFailedCounter.inc();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was a bug here and other places where the failover failed counter was incorrectly increased twice, once here and once in the catch section for this block

@@ -442,43 +430,7 @@ void test_execute_withDirectExecute() throws SQLException {

private void initializePlugin() {
plugin = new FailoverConnectionPlugin(mockPluginService, properties);
}

private static class FooHostListProvider implements HostListProvider, DynamicHostListProvider {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class was not being used so I removed it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants