Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connect to multiple instances (clusters) at the same time #314

Open
sagikazarmark opened this issue Mar 25, 2021 · 11 comments
Open

Connect to multiple instances (clusters) at the same time #314

sagikazarmark opened this issue Mar 25, 2021 · 11 comments

Comments

@sagikazarmark
Copy link

From time to time I see feature requests in software consuming LDAP to add cluster support. I assume it's useful in environments when there is no load balancer in front of the different LDAP servers?

Anyway, I was wondering if it would be doable on the LDAP Client level or if there is an existing implementation somewhere. Couple searches in the repo and issues yielded no results.

Here is an example of those requests: dexidp/dex#1904

I'm not particularly a fan of the implementation in this PR and I'd like to find a better/lower-level solution if possible.

Any ideas? Thanks in advance!

@stefanmcshane
Copy link
Contributor

Hi @sagikazarmark,
To my knowledge, we don't currently support anything like this as it would effectively be connecting to multiple LDAP servers.
Whilst it doesn't follow the LDAP spec, for ease of use on the library going forward I would be open to an extra function for connecting to multiple servers, however this would also need to include a plan for failover for each connection including retries and timeouts.

Have you any ideas on what you would like to see in this @sagikazarmark @johnweldon?

On an initial pass, I would think that we have a Cluster struct which could contain either a slice of the *Conn structs, or a map which allows us to refer to a specific server in the cluster.
We would then need an internal retryOnNextConn method which could pass the request onto the next connection.
I'm open to suggestions, and working with you to get this implemented.

@sagikazarmark
Copy link
Author

@stefanmcshane glad to hear, I think it would help the consumers of this library facing similar use cases.

I'm not intimately familiar with the library, but here are a couple thoughts:

  • This is essentially client side load balancing, so I would treat it as such (load balancing algorithm, etc)
  • I probably wouldn't use a map, because it's unordered
  • when picking a connection (Conn; BTW this is a little bit confusing name in my opinion, because it's more like a client than a connection, isn't it?), the algorithm should check if the connection is actually alive (if that's even possible in this case) to always pick an available server (that is to make sure that a failover event doesn't cause cascading failures in applications)

Those are my initial thoughts about the problem. We can probably discuss retries and timeouts in more detail (for example when and how to retry requests? Should the implementation move to the next connection until it runs out?)

@stefanmcshane
Copy link
Contributor

Thanks @sagikazarmark for the suggestions.

The way I like to think of Conn vs Client is akin to how postgres do it in that there could be many connections for a given client. When implementing this library, I tend to call the implementing package the client as I usually setup retries, or various connections there. Granted, this is a personal opinion and wasnt necessarily the consideration when @johnweldon started the package.

  • This is essentially client side load balancing, so I would treat it as such (load balancing algorithm, etc)

I agree with this. Do you know of any libraries that implement something similar/implement this in a way that you think is seamless? If you want to suggest a draft PR on what you believe would be a desirable user-facing experience, that would also be helpful in the design decisions.

  • I probably wouldn't use a map, because it's unordered

Whilst this is true and would hurt retries as an example (try next), it could be useful on the assumption that the user will want to change the primary connection on a given request type. An example here could be on a globally deployed platform, they might want to add a new user, which is seen in US first, without waiting on their replication strategy kicking in. I suspect that in a naive implementation, we would have to setup a map of the given hosts to connect to, as well as a slice for the given maps to order them.

  • the algorithm should check if the connection is actually alive (if that's even possible in this case) to always pick an available server (that is to make sure that a failover event doesn't cause cascading failures in applications)

My initial thought at this would be to implement a basic round-robin, but expose a way that the user can implement their own retry/timeout.

Let me know what you think. If you're up for collaborating on this one, I'd appreciate that also.

@sagikazarmark
Copy link
Author

@stefanmcshane

I agree with this. Do you know of any libraries that implement something similar/implement this in a way that you think is
seamless?

I think a naive round robin can just be implemented as a counter (ie. uint32) that you keep increasing atomically and calculate the counter modulo number of connections to choose the next connection.

Whilst this is true and would hurt retries as an example (try next), it could be useful on the assumption that the user will want to change the primary connection on a given request type. An example here could be on a globally deployed platform, they might want to add a new user, which is seen in US first, without waiting on their replication strategy kicking in.

As long as the load balancing algorithm is explicit and doesn't depend on the random randomness of map access, I think the internal data structure is less important.

BTW that scenario sounds like a different type of retry. For example instead of choosing a specific connection, I'd implement a weighted list, prioritizing the closest server first and the primary later, but in that case, retrying the query happens if the closest server returns an empty result, not when an error is returned. I'm not sure that makes sense for LDAP, but it makes sense for tackling replication issues. But again, different story.

My initial thought at this would be to implement a basic round-robin, but expose a way that the user can implement their own retry/timeout.

I'd start with something stupid simple and add configuration later when some feedback arrives.

If you're up for collaborating on this one, I'd appreciate that also.

TBH I'm not very familiar with LDAP or this library, so I'm happy to discuss design, test the implementation in Dex or review code even, but I'd leave the implementation to someone more familiar with the library and LDAP.

@johnweldon
Copy link
Member

I like the collaboration and discussion here; it sounds like it's heading in the right direction.

I'd like to clarify that I didn't actually start this project, I just moved it to a more canonical name and tried to support it a little bit over the years. The original author I believe is @mmitton and this was the original repo before https://github.com/mmitton/ldap

@scaranoj
Copy link

Hi All! 👋 Here's a use case: Large US furniture manufacturer is using Dex for Kubernetes authentication, but they're only able to connect to a single LDAP/Active Directory backend and would like to prevent a single fault domain by allowing for 2 or more LDAP backends (adding on their behalf).

@jarrettprosser
Copy link

I can speak to our use case a bit - we are also using Dex for authentication in corporate environments. It's common for IT to provide us with several URLs for LDAP servers in different on-premise data centres. As @scaranoj mentioned it's a way of preventing a single fault domain for the directory, usually not using an (on-premise) load balancer because that would have to be hosted in one of the data centres and would reintroduce a single point of failure.

In some cases we can use Keycloak, which does support LDAP failover through the Java JNDI LDAP provider. I think the implementation there is effectively a round-robin which attempts to connect to each server in turn until a connection is successful, then continues to use that server. From the doco:

If the list contains more than one URL, the provider should attempt to use each URL in turn until it is able to create a successful connection, and after creation, set the property to the successful URL.

@mayrstefan
Copy link

Just iterating through a list seems to be a very common pattern that could be the starting point or default algorithm. This could be implemented first and more sophisticated algorithms could follow (round-robin, weighted-connection, least-response-time). The simpler algorithms could just reorder the list before iterating through it.

The simple list approach can be found in

  • Java JNDI LDAP URls (like already mentioned): space separated list of hostnames
  • Apache httpd mod_authnz_ldap: space separated list of hostnames
  • PostgreSQL JDBC Driver Connection Fail-Over: comma separated list of hostnames
  • Almost all DNS based solutions:
    • CDNs return multiple records for the queried domain. E.g. cloudflare.com
    • DELL EMC Isilon (now Powerscale) as a NAS/file server is also a DNS server. It returns multiple entries and rotates this list on each query for load balancing. So no need to mess with the order on the client side
  • and many other products

Requests for service discovery like #329 need the same base work: you get a list of servers that has to be sorted by priority and weight. Then you iterate through that list until you find a server you can connect to.

@mayrstefan
Copy link

Another advantage of a simple list: it also works with only one element. What we use today without loadbalancing or HA.

@alexei-matveev
Copy link

alexei-matveev commented Mar 14, 2023

At least in the AD-Context it seems to be the responsibility of the Client to
choose correct DC(s) and fail over as necessary:

https://serverfault.com/questions/734101/active-directory-multi-site-choose-nearest-dcs-into-linux-not-microsoft-applica

@mayrstefan
Copy link

@alexei-matveev for AD this boils down to the following:

  1. query DNS für SRV records which will give you a weighted list of ldap servers
  2. iterate through that list

Which means the ability to go through a list of servers until you find a working one is an essential core functionality

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants