Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Report IntermediateMaster errors under CoMaster deployment #1481

Open
ZhangJiaQiao opened this issue Mar 23, 2023 · 0 comments
Open

Report IntermediateMaster errors under CoMaster deployment #1481

ZhangJiaQiao opened this issue Mar 23, 2023 · 0 comments

Comments

@ZhangJiaQiao
Copy link

ZhangJiaQiao commented Mar 23, 2023

I got two failure detections under two comaster clusters. there were UnreachableIntermediateMasterWithLaggingReplicas and DeadIntermediateMasterAndReplicas failures while the clusters were co-master.

image

Under such architecture, there should be UnreachableMaster or other co-master failure.

Then I check the analysis code:

} else if a.IsCoMaster && !a.LastCheckValid && !a.LastCheckPartialSuccess && a.CountValidReplicas > 0 && a.CountValidReplicatingReplicas > 0 {
a.Analysis = UnreachableCoMaster
a.Description = "Co-master cannot be reached by orchestrator but it has replicating replicas; possibly a network/host issue"
//

} else if !a.IsMaster && !a.LastCheckValid && a.CountReplicas > 0 && a.CountValidReplicas == 0 {
a.Analysis = DeadIntermediateMasterAndReplicas
a.Description = "Intermediate master cannot be reached by orchestrator and all of its replicas are unreachable"
//
} else if !a.IsMaster && !a.LastCheckValid && a.CountLaggingReplicas == a.CountReplicas && a.CountDelayedReplicas < a.CountReplicas && a.CountValidReplicatingReplicas > 0 {
a.Analysis = UnreachableIntermediateMasterWithLaggingReplicas
a.Description = "Intermediate master cannot be reached by orchestrator and all of its replicas are lagging"
//

If LastCheckPartialSuccess is true and syncing between two co-masters works well, then these IntermediateMaster failures will be reported instead of the co-master ones.
With syncing working well, we will get DeadIntermediateMasterAndReplicas if two co-masters are unreachable, and get UnreachableIntermediateMasterWithLaggingReplicas if the primary co-master is unreachable and some replicas are lagging.

LastCheckPartialSuccess is set as true in the process of discovery SQL:

err = db.QueryRow("select @@global.hostname, ifnull(@@global.report_host, ''), @@global.server_id, @@global.version, @@global.version_comment, @@global.read_only, @@global.binlog_format, @@global.log_bin, @@global.log_slave_updates").Scan(
&mysqlHostname, &mysqlReportHost, &instance.ServerID, &instance.Version, &instance.VersionComment, &instance.ReadOnly, &instance.Binlog_format, &instance.LogBinEnabled, &instance.LogReplicationUpdatesEnabled)
if err != nil {
goto Cleanup
}
partialSuccess = true // We at least managed to read something from the server.

There should be a bug in analyzing co-master and intermediate-master failures. It might be the if-else judgement fault.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant