Skip to content
This repository has been archived by the owner on Jan 24, 2024. It is now read-only.

SEP: Master cluster #72

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

SEP: Master cluster #72

wants to merge 4 commits into from

Conversation

dwoz
Copy link
Contributor

@dwoz dwoz commented Aug 9, 2023

No description provided.

@dwoz dwoz requested a review from a team as a code owner August 9, 2023 21:54
@dwoz dwoz requested review from twangboy and removed request for a team August 9, 2023 21:54
@dwoz dwoz mentioned this pull request Aug 9, 2023
14 tasks
0000-master-cluster.md Outdated Show resolved Hide resolved
@OrangeDog
Copy link

I can't find a preview with the images inline in the right order, but why are "Salt Master" and "Master" different things?

It also seems like another drawback will be increased master load, if events are being repeated to all masters.

@dwoz dwoz changed the title Master cluster SEP SEP: Master cluster Aug 11, 2023
@dwoz
Copy link
Contributor Author

dwoz commented Aug 28, 2023

I can't find a preview with the images inline in the right order, but why are "Salt Master" and "Master" different things?

No, Salt Master and Master are the same.

It also seems like another drawback will be increased master load, if events are being repeated to all masters.

The load increase on a single master will be minor because it's just shuffling the event bus back and forth from other masters. That load is negligible compared to increased capacity from having all masters share the load of jobs being executed.

0000-master-cluster.md Outdated Show resolved Hide resolved
0000-master-cluster.md Outdated Show resolved Hide resolved
0000-master-cluster.md Outdated Show resolved Hide resolved
0000-master-cluster.md Outdated Show resolved Hide resolved
0000-master-cluster.md Outdated Show resolved Hide resolved
0000-master-cluster.md Outdated Show resolved Hide resolved
0000-master-cluster.md Outdated Show resolved Hide resolved
0000-master-cluster.md Outdated Show resolved Hide resolved
0000-master-cluster.md Outdated Show resolved Hide resolved
0000-master-cluster.md Outdated Show resolved Hide resolved
@OrangeDog
Copy link

No, Salt Master and Master are the same.

So every Master in the diagram has the same internal detail as Salt Master? Can you do one showing them equally and the arrows in both directions?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you go into more detail on the future of current HA methods with this in place. as well as the future of syndic? also any potential pitfalls to look at with things such as network latency. what kind of throughput will this require? what about split brain handling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This work is not deprecating any of the current HA functionality nor is it deprecating Syndic.

The network will need to be a reliable network and this is called out in the docs. If there is a split brain problem, the network is not reliable.

Copy link

@OrangeDog OrangeDog Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By that definition, no network is reliable. That's why we need HA solutions in the first place.
We at least need to know which way it's going to fail during a network partition and not do something unsafe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as consistency and reliability. There is a huge difference between local networks and WAN networks. With this design, if a master goes offline for some reason. There is no failure. Any minion connections will be routed to a different master by the load balancer. The other masters will still try and forward events and you will see timeouts in the logs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it isn't just about consistency and reliability. if the communication between masters CAN be broken and them not show as offline, it will happen. it needs to be documentation at the very least of what it looks like when it happens, I honestly don't think it will break much, as we don't do total bidirectional control. but it needs to be documented.

I can see this happening with the kind of engineer that loves segregating network traffic to separate lans. one network for minion communication, one network for storage, one network for master communication. then all of a sudden the network admin has a spanning tree go haywire in the master communication network. both masters will appear up to the minion and storage still works.

Copy link
Contributor Author

@dwoz dwoz Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both masters would not appear up to a minion because minions connect to the load balancer. I have not been able to break anything by taking masters offline. If you'd like to take the work for a spin and try and cause breakage please feel free.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both masters would appear up to the load balancer too. The only connection that is broken in this scenario is master-master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the scenario described here you're salt cli would fail to receive events because they are not being forwarded from the disconnected master. There will be errors in the logs on the disconnected master that it's not able to forward it's events to the other master. The job would still finish correctly and the job cache would contain the correct results of the job.

@barneysowood
Copy link
Contributor

barneysowood commented Sep 11, 2023

General

Completely agree that the current options for HA have serious drawbacks and this looks like a good start towards providing an alternative that allows scaling out of masters.

Looking at the SEP and the current implementation, there are four main parts to suggested design:

  1. Minion to master connection - making it possible for a minion to connect to any master and using a load balancer to manage connections across available masters.

  2. Changes to key handling/addition of new key pairs (cluster pub/priv and sharing AES session key).

  3. Sharing of keys (minion pub keys, cluster pub/priv and AES session key) using a shared filesystem between cluster masters.

  4. Addition of a MasterPubServerChannel to forward events between cluster masters and changes to forward publish events to the Publisher.

For part 1, relying on an external loadbalancer seems like a sensible approach and means that end users can use whatever solution for load balancing they choose (or even use some sort of shared IP/failover if that works for their requirements). That does rely on sharing some keys (cluster keypair and probably session key), although there may be some alternatives to that (see later).

For part 2, it would be great if the design included an explanation of key handling, the additional keypairs introduced and how those are generated and shared/updated. I've read the code from PR 64936 and I think I have a reasonable understanding, but it'd be useful to be able to compare the implementation to the intent rather than trying to reverse engineer the intent from the implementation.

For part 3, I can't say I love the idea of having to use a shared filesystem for this, I think it's reasonable for a first pass but I'd like to think about some alternative ways of sharing key data between cluster masters.
If we just point users at the GlusterFS quickstart (as PR 65138 does), we'll end up with people sharing PKI data in the clear. I also worry that people will instead try and use NFS or some other method for as shared filesystem that will compromise the security/availability of the solution.
Adding the ability to share key data between cluster masters in a secure and consistent manner may be tricky if implemented from scratch, but it may be worth looking at other solutions (etcd? would have said consul or vault until recently...)

For part 4, again I have read the current implementation from PR 64936 and tried it out, but it would be good to have more detail about the intended design. It's changing some pretty core bits of how the master functions, so it'd be good to really understand the intent.

Specifically, it would be good to see an explantaion of the additional processes, MasterPubServerChannel and the additional events, their format and how they are intended to be propgated/filtered.

It would also be good to understand how job returns are expected to be handled. Currently the design doesn't address that and the implementation has some issues. It might be valid to say that for this to work, there needs to be a shared master job cache (ie external DB), but that should be stated. Current implementation doesn't work in a reasonable way with the deafult local master job cache - using the jobs runner will only return data for minions directly connected to that master. It may be possible to add some additional events that query job/return data across cluster masters, but currently you're going to get a broken view on of that depending on which master you query.

I've also not tested what happens if a master that a minion is connected to fails between job publish and job return - this would seem like a reasonably likely scenario..

One more minor note from the testing I have done - cluster_peers - these need to match the master ID of the peer and be resolvable as a hostname (via DNS or equiv). This isn't clear from docs and will probably cause issues

Alternatives

The alternatives aren't only the current Salt "HA" implementations.. there are also alternative designs or parts of design that could/should be considered. For example:

  • For part 2, it could be possible to re-use the existing master PKI (master_sign_pubkey) functionality
  • For part 3, it could be possible to use an existing opensource solution or library to share key data
  • It might be worth considering alternative transports/message queuing to solve this issue (eg the PoC work on Rabbit MQ)

I guess just some more detail about what hs been considered as alternatives for implementation would be good..

SEP and PR handling

There are also some issues that should be addressed around the handling of this work, the motivation, PRs and the SEP proccess.

This is a pretty significant change to how the salt master functions. It's also going to add some important (and very much welcome) functionality, which is going to have to be supported by Salt for a long time.

Looking at the issues and PRs that have defined this feature, it feels like it has come from a requirement within VMware. To be clear, I don't have a problem with that at all - VMware funds the project and pays the salaries of the core team, so it's more than reasonable for them to have a siginificant impact on the direction of the project.

That said, I think it would be good to understand if specific features/SEPs were being driven by an internal/customer VMware requirement. Without that knowledge it's difficult to understand if all the decisions on the design of a feature are purely being made on the basis of "this is the best technical solution" or "this is the best technical solution within the constraints of what we require for our commercial customers".

It's fine if it's the latter, that's entirely reasonable, but as contributors to an opensource project, I think it's fair for us to want to understand that there are additional factors affecting the design decisions.

A final point in the SEP process.. PR 64936 has already been merged to master. I don't have a problem with that as such, I think the SEP process can block forward progress in its current form. However there probably needs to be some agreement that if a PR is merged before the SEP is approved, if the SEP isn't approved, it gets reverted or modified to meet the agreed design in the SEP. Otherwise we just have to admit that the SEP process isn't actually being adhered to and talk about an alternative.

@dwoz dwoz requested review from twangboy and whytewolf September 11, 2023 20:51
@dwoz
Copy link
Contributor Author

dwoz commented Sep 11, 2023

General

Completely agree that the current options for HA have serious drawbacks and this looks like a good start towards providing an alternative that allows scaling out of masters.

Looking at the SEP and the current implementation, there are four main parts to suggested design:

  1. Minion to master connection - making it possible for a minion to connect to any master and using a load balancer to manage connections across available masters.
  2. Changes to key handling/addition of new key pairs (cluster pub/priv and sharing AES session key).
  3. Sharing of keys (minion pub keys, cluster pub/priv and AES session key) using a shared filesystem between cluster masters.
  4. Addition of a MasterPubServerChannel to forward events between cluster masters and changes to forward publish events to the Publisher.

For part 1, relying on an external loadbalancer seems like a sensible approach and means that end users can use whatever solution for load balancing they choose (or even use some sort of shared IP/failover if that works for their requirements). That does rely on sharing some keys (cluster keypair and probably session key), although there may be some alternatives to that (see later).

Yes, we have a shared cluster key (masters retain their own keys). Gluster can leverage TLS. Outside of that, if someone chooses to use NFS, that's their choice and there's nothing wrong if they trust the network environment the master is in. We have some pretty serious messaging in the docs about not sharing keys between masters in an HA setup. This functionality will be a departure from that but I think with proper documentation we'll be okay. I.E document gluster with TLS and using linux ACLs to secure the important bits.

For part 2, it would be great if the design included an explanation of key handling, the additional keypairs introduced and how those are generated and shared/updated. I've read the code from PR 64936 and I think I have a reasonable understanding, but it'd be useful to be able to compare the implementation to the intent rather than trying to reverse engineer the intent from the implementation.

I will put some effort into expanding on this.

For part 3, I can't say I love the idea of having to use a shared filesystem for this, I think it's reasonable for a first pass but I'd like to think about some alternative ways of sharing key data between cluster masters. If we just point users at the GlusterFS quickstart (as PR 65138 does), we'll end up with people sharing PKI data in the clear. I also worry that people will instead try and use NFS or some other method for as shared filesystem that will compromise the security/availability of the solution. Adding the ability to share key data between cluster masters in a secure and consistent manner may be tricky if implemented from scratch, but it may be worth looking at other solutions (etcd? would have said consul or vault until recently...)

I agree the shared filesystem is a bit of a limitation. The intent is to get a solid working implementation going and then we can expand on it in future iterations. Having said that, if a shared file system is required, (imo) GlusterFS is a good place to start.

For part 4, again I have read the current implementation from PR 64936 and tried it out, but it would be good to have more detail about the intended design. It's changing some pretty core bits of how the master functions, so it'd be good to really understand the intent.

Thank you for trying it out and for this feed back. Again, I'll put some work in over the next few days to expand on this SEP.

Specifically, it would be good to see an explantaion of the additional processes, MasterPubServerChannel and the additional events, their format and how they are intended to be propgated/filtered.

It would also be good to understand how job returns are expected to be handled. Currently the design doesn't address that and the implementation has some issues. It might be valid to say that for this to work, there needs to be a shared master job cache (ie external DB), but that should be stated. Current implementation doesn't work in a reasonable way with the deafult local master job cache - using the jobs runner will only return data for minions directly connected to that master. It may be possible to add some additional events that query job/return data across cluster masters, but currently you're going to get a broken view on of that depending on which master you query.

I've also not tested what happens if a master that a minion is connected to fails between job publish and job return - this would seem like a reasonably likely scenario..

If HAProxy is used for the load balancer (this is what I've been using for my testing). The minion's connection will be routed to another master and the return will be handled correctly.

One more minor note from the testing I have done - cluster_peers - these need to match the master ID of the peer and be resolvable as a hostname (via DNS or equiv). This isn't clear from docs and will probably cause issues

True, this needs a more thorough explanation in the docs.

Alternatives

The alternatives aren't only the current Salt "HA" implementations.. there are also alternative designs or parts of design that could/should be considered. For example:

  • For part 2, it could be possible to re-use the existing master PKI (master_sign_pubkey) functionality

I explored this option but it was not a feasible.

  • For part 3, it could be possible to use an existing opensource solution or library to share key data

Yes, in the future. For now that existing open source solution is GlusterFS. In the future we'll need to add an abstraction layer around how we get and store the keys.

  • It might be worth considering alternative transports/message queuing to solve this issue (eg the PoC work on Rabbit MQ)

The RabbitMQ work would not solve this issue. A single master would still be responsible for handling the load of a large job. Alternative transports RabbitMQ or Kafka would be useful for the master event bus sharing. Hence the work done to consolidate IPC communications into the other transports. This was done to enable alternative transports for the master event bus in the future.

I guess just some more detail about what hs been considered as alternatives for implementation would be good..

SEP and PR handling

There are also some issues that should be addressed around the handling of this work, the motivation, PRs and the SEP proccess.

This is a pretty significant change to how the salt master functions. It's also going to add some important (and very much welcome) functionality, which is going to have to be supported by Salt for a long time.

Looking at the issues and PRs that have defined this feature, it feels like it has come from a requirement within VMware. To be clear, I don't have a problem with that at all - VMware funds the project and pays the salaries of the core team, so it's more than reasonable for them to have a siginificant impact on the direction of the project.

That said, I think it would be good to understand if specific features/SEPs were being driven by an internal/customer VMware requirement. Without that knowledge it's difficult to understand if all the decisions on the design of a feature are purely being made on the basis of "this is the best technical solution" or "this is the best technical solution within the constraints of what we require for our commercial customers".

This was not a mandate from VMware. This work started in an innovation sprint, a two week sprint where the core team was allowed to work on what we thought could constitute innovation. I have been wanting to tackle this for years. I saw a window of opportunity and put together a POC which was demo'd for the fist time in a Salt Open Hour. That demo gained enough traction that we prioritized the feature for 3007.

It's fine if it's the latter, that's entirely reasonable, but as contributors to an opensource project, I think it's fair for us to want to understand that there are additional factors affecting the design decisions.

A final point in the SEP process.. PR 64936 has already been merged to master. I don't have a problem with that as such, I think the SEP process can block forward progress in its current form. However there probably needs to be some agreement that if a PR is merged before the SEP is approved, if the SEP isn't approved, it gets reverted or modified to meet the agreed design in the SEP. Otherwise we just have to admit that the SEP process isn't actually being adhered to and talk about an alternative.

The SEP process needs an overhaul, no doubt.

@barneysowood
Copy link
Contributor

For part 1, relying on an external loadbalancer seems like a sensible approach and means that end users can use whatever solution for load balancing they choose (or even use some sort of shared IP/failover if that works for their requirements). That does rely on sharing some keys (cluster keypair and probably session key), although there may be some alternatives to that (see later).

Yes, we have a shared cluster key (masters retain their own keys). Gluster can leverage TLS. Outside of that, if someone chooses to use NFS, that's their choice and there's nothing wrong if they trust the network environment the master is in. We have some pretty serious messaging in the docs about not sharing keys between masters in an HA setup. This functionality will be a departure from that but I think with proper documentation we'll be okay. I.E document gluster with TLS and using linux ACLs to secure the important bits.

Cool - I think initially it's reasonable to do key sharing via a shared dir and as long as the docs are clear about security implications and push people towards making sensible choices like using Gluster over TLS, then that's fine.

I would love to see a solution at a later that doesn't require this, but as you say, getting a usable solid working implementation out there would be great. Also agree that GlusterFS is probably the best choice for this.

For part 2, it would be great if the design included an explanation of key handling, the additional keypairs introduced and how those are generated and shared/updated. I've read the code from PR 64936 and I think I have a reasonable understanding, but it'd be useful to be able to compare the implementation to the intent rather than trying to reverse engineer the intent from the implementation.

I will put some effort into expanding on this.

For part 4, again I have read the current implementation from PR 64936 and tried it out, but it would be good to have more detail about the intended design. It's changing some pretty core bits of how the master functions, so it'd be good to really understand the intent.

Thank you for trying it out and for this feed back. Again, I'll put some work in over the next few days to expand on this SEP.

Great - thank you.

One more minor note from the testing I have done - cluster_peers - these need to match the master ID of the peer and be resolvable as a hostname (via DNS or equiv). This isn't clear from docs and will probably cause issues

True, this needs a more thorough explanation in the docs.

👍

Alternatives

The alternatives aren't only the current Salt "HA" implementations.. there are also alternative designs or parts of design that could/should be considered. For example:

  • For part 2, it could be possible to re-use the existing master PKI (master_sign_pubkey) functionality

I explored this option but it was not a feasible.

  • For part 3, it could be possible to use an existing opensource solution or library to share key data

Yes, in the future. For now that existing open source solution is GlusterFS. In the future we'll need to add an abstraction layer around how we get and store the keys.

A note in the SEP that a future aim would be this, would be great.

  • It might be worth considering alternative transports/message queuing to solve this issue (eg the PoC work on Rabbit MQ)

The RabbitMQ work would not solve this issue. A single master would still be responsible for handling the load of a large job. Alternative transports RabbitMQ or Kafka would be useful for the master event bus sharing. Hence the work done to consolidate IPC communications into the other transports. This was done to enable alternative transports for the master event bus in the future.

Ok - so it looks like you have explored and considered various alternatives and your explanations of why they aren't suitable make sense - it would be good to have those (and any other alternatives you explored) in the SEP. That's what that section is for and it means people can review it and understand what you've considered and why it wasn't suitable.

SEP and PR handling

That said, I think it would be good to understand if specific features/SEPs were being driven by an internal/customer VMware requirement. Without that knowledge it's difficult to understand if all the decisions on the design of a feature are purely being made on the basis of "this is the best technical solution" or "this is the best technical solution within the constraints of what we require for our commercial customers".

This was not a mandate from VMware. This work started in an innovation sprint, a two week sprint where the core team was allowed to work on what we thought could constitute innovation. I have been wanting to tackle this for years. I saw a window of opportunity and put together a POC which was demo'd for the fist time in a Salt Open Hour. That demo gained enough traction that we prioritized the feature for 3007.

Ok, thanks for explaining where this came from. It's great that you had an innovation sprint and this is what came from it Thanks for jumping on the opportunity to work on this.

It wasn't mentioned in the SEP or the PR that this was where the feature had come from.. the PR and related issues were just a set of minimally filled out templates with no context - it looked very like a set of requirements coming from somewhere else.. in future it'd be great if the PR just said "this is a POC from and innovation sprint, something I've been wanting to work on for years".. that would give the community context to understand the motivation and decisions on design etc.

The SEP process needs an overhaul, no doubt.

👍

@dwoz dwoz requested a review from OrangeDog September 12, 2023 21:40
@s0undt3ch s0undt3ch removed the request for review from OrangeDog December 2, 2023 17:34
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants