-
Notifications
You must be signed in to change notification settings - Fork 42
SEP: Master cluster #72
base: master
Are you sure you want to change the base?
Conversation
I can't find a preview with the images inline in the right order, but why are "Salt Master" and "Master" different things? It also seems like another drawback will be increased master load, if events are being repeated to all masters. |
No,
The load increase on a single master will be minor because it's just shuffling the event bus back and forth from other masters. That load is negligible compared to increased capacity from having all masters share the load of jobs being executed. |
So every |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you go into more detail on the future of current HA methods with this in place. as well as the future of syndic? also any potential pitfalls to look at with things such as network latency. what kind of throughput will this require? what about split brain handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This work is not deprecating any of the current HA functionality nor is it deprecating Syndic.
The network will need to be a reliable network and this is called out in the docs. If there is a split brain problem, the network is not reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By that definition, no network is reliable. That's why we need HA solutions in the first place.
We at least need to know which way it's going to fail during a network partition and not do something unsafe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as consistency and reliability. There is a huge difference between local networks and WAN networks. With this design, if a master goes offline for some reason. There is no failure. Any minion connections will be routed to a different master by the load balancer. The other masters will still try and forward events and you will see timeouts in the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it isn't just about consistency and reliability. if the communication between masters CAN be broken and them not show as offline, it will happen. it needs to be documentation at the very least of what it looks like when it happens, I honestly don't think it will break much, as we don't do total bidirectional control. but it needs to be documented.
I can see this happening with the kind of engineer that loves segregating network traffic to separate lans. one network for minion communication, one network for storage, one network for master communication. then all of a sudden the network admin has a spanning tree go haywire in the master communication network. both masters will appear up to the minion and storage still works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both masters would not appear up to a minion because minions connect to the load balancer. I have not been able to break anything by taking masters offline. If you'd like to take the work for a spin and try and cause breakage please feel free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both masters would appear up to the load balancer too. The only connection that is broken in this scenario is master-master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the scenario described here you're salt cli would fail to receive events because they are not being forwarded from the disconnected master. There will be errors in the logs on the disconnected master that it's not able to forward it's events to the other master. The job would still finish correctly and the job cache would contain the correct results of the job.
GeneralCompletely agree that the current options for HA have serious drawbacks and this looks like a good start towards providing an alternative that allows scaling out of masters. Looking at the SEP and the current implementation, there are four main parts to suggested design:
For part 1, relying on an external loadbalancer seems like a sensible approach and means that end users can use whatever solution for load balancing they choose (or even use some sort of shared IP/failover if that works for their requirements). That does rely on sharing some keys (cluster keypair and probably session key), although there may be some alternatives to that (see later). For part 2, it would be great if the design included an explanation of key handling, the additional keypairs introduced and how those are generated and shared/updated. I've read the code from PR 64936 and I think I have a reasonable understanding, but it'd be useful to be able to compare the implementation to the intent rather than trying to reverse engineer the intent from the implementation. For part 3, I can't say I love the idea of having to use a shared filesystem for this, I think it's reasonable for a first pass but I'd like to think about some alternative ways of sharing key data between cluster masters. For part 4, again I have read the current implementation from PR 64936 and tried it out, but it would be good to have more detail about the intended design. It's changing some pretty core bits of how the master functions, so it'd be good to really understand the intent. Specifically, it would be good to see an explantaion of the additional processes, MasterPubServerChannel and the additional events, their format and how they are intended to be propgated/filtered. It would also be good to understand how job returns are expected to be handled. Currently the design doesn't address that and the implementation has some issues. It might be valid to say that for this to work, there needs to be a shared master job cache (ie external DB), but that should be stated. Current implementation doesn't work in a reasonable way with the deafult local master job cache - using the I've also not tested what happens if a master that a minion is connected to fails between job publish and job return - this would seem like a reasonably likely scenario.. One more minor note from the testing I have done - cluster_peers - these need to match the master ID of the peer and be resolvable as a hostname (via DNS or equiv). This isn't clear from docs and will probably cause issues AlternativesThe alternatives aren't only the current Salt "HA" implementations.. there are also alternative designs or parts of design that could/should be considered. For example:
I guess just some more detail about what hs been considered as alternatives for implementation would be good.. SEP and PR handlingThere are also some issues that should be addressed around the handling of this work, the motivation, PRs and the SEP proccess. This is a pretty significant change to how the salt master functions. It's also going to add some important (and very much welcome) functionality, which is going to have to be supported by Salt for a long time. Looking at the issues and PRs that have defined this feature, it feels like it has come from a requirement within VMware. To be clear, I don't have a problem with that at all - VMware funds the project and pays the salaries of the core team, so it's more than reasonable for them to have a siginificant impact on the direction of the project. That said, I think it would be good to understand if specific features/SEPs were being driven by an internal/customer VMware requirement. Without that knowledge it's difficult to understand if all the decisions on the design of a feature are purely being made on the basis of "this is the best technical solution" or "this is the best technical solution within the constraints of what we require for our commercial customers". It's fine if it's the latter, that's entirely reasonable, but as contributors to an opensource project, I think it's fair for us to want to understand that there are additional factors affecting the design decisions. A final point in the SEP process.. PR 64936 has already been merged to master. I don't have a problem with that as such, I think the SEP process can block forward progress in its current form. However there probably needs to be some agreement that if a PR is merged before the SEP is approved, if the SEP isn't approved, it gets reverted or modified to meet the agreed design in the SEP. Otherwise we just have to admit that the SEP process isn't actually being adhered to and talk about an alternative. |
Yes, we have a shared cluster key (masters retain their own keys). Gluster can leverage TLS. Outside of that, if someone chooses to use NFS, that's their choice and there's nothing wrong if they trust the network environment the master is in. We have some pretty serious messaging in the docs about not sharing keys between masters in an HA setup. This functionality will be a departure from that but I think with proper documentation we'll be okay. I.E document gluster with TLS and using linux ACLs to secure the important bits.
I will put some effort into expanding on this.
I agree the shared filesystem is a bit of a limitation. The intent is to get a solid working implementation going and then we can expand on it in future iterations. Having said that, if a shared file system is required, (imo) GlusterFS is a good place to start.
Thank you for trying it out and for this feed back. Again, I'll put some work in over the next few days to expand on this SEP.
If HAProxy is used for the load balancer (this is what I've been using for my testing). The minion's connection will be routed to another master and the return will be handled correctly.
True, this needs a more thorough explanation in the docs.
I explored this option but it was not a feasible.
Yes, in the future. For now that existing open source solution is GlusterFS. In the future we'll need to add an abstraction layer around how we get and store the keys.
The RabbitMQ work would not solve this issue. A single master would still be responsible for handling the load of a large job. Alternative transports RabbitMQ or Kafka would be useful for the master event bus sharing. Hence the work done to consolidate IPC communications into the other transports. This was done to enable alternative transports for the master event bus in the future.
This was not a mandate from VMware. This work started in an innovation sprint, a two week sprint where the core team was allowed to work on what we thought could constitute innovation. I have been wanting to tackle this for years. I saw a window of opportunity and put together a POC which was demo'd for the fist time in a Salt Open Hour. That demo gained enough traction that we prioritized the feature for 3007.
The SEP process needs an overhaul, no doubt. |
Cool - I think initially it's reasonable to do key sharing via a shared dir and as long as the docs are clear about security implications and push people towards making sensible choices like using Gluster over TLS, then that's fine. I would love to see a solution at a later that doesn't require this, but as you say, getting a usable solid working implementation out there would be great. Also agree that GlusterFS is probably the best choice for this.
Great - thank you.
👍
A note in the SEP that a future aim would be this, would be great.
Ok - so it looks like you have explored and considered various alternatives and your explanations of why they aren't suitable make sense - it would be good to have those (and any other alternatives you explored) in the SEP. That's what that section is for and it means people can review it and understand what you've considered and why it wasn't suitable.
Ok, thanks for explaining where this came from. It's great that you had an innovation sprint and this is what came from it Thanks for jumping on the opportunity to work on this. It wasn't mentioned in the SEP or the PR that this was where the feature had come from.. the PR and related issues were just a set of minimally filled out templates with no context - it looked very like a set of requirements coming from somewhere else.. in future it'd be great if the PR just said "this is a POC from and innovation sprint, something I've been wanting to work on for years".. that would give the community context to understand the motivation and decisions on design etc.
👍 |
No description provided.