Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Handling of Blacklisted Actors during Registry Unavailability #72

Open
aratz-lasa opened this issue May 17, 2023 · 1 comment
Open

Comments

@aratz-lasa
Copy link
Collaborator

Description

This issue addresses the scenario where an actor gets blacklisted while the registry is unavailable, leading to potential failures of subsequent requests for that actor. Currently, when the registry is down, there is no mechanism in place to handle the blacklisting of actors, which can result in continued routing of requests to blacklisted actors, leading to failures.

Proposed Solution

  1. Evaluate the impact of blacklisting on actor availability and system stability during registry unavailability.
  2. Design a strategy to handle the blacklisting of actors even when the registry is unavailable.
  3. Implement a mechanism to mark blacklisted actors and prevent requests from being routed to them, regardless of the registry's availability.
  4. Explore options for locally storing blacklisted actors and their corresponding server IDs during the registry downtime.
  5. Enhance the routing logic to check the local cache of blacklisted actors and prevent requests from being forwarded to them.
  6. Implement a mechanism to periodically synchronize the local cache with the registry once it becomes available again.
  7. Consider the potential overhead and performance implications of maintaining a local cache for blacklisted actors.
  8. Write tests to validate the behavior of blacklisted actors during registry unavailability and ensure the correctness of the implemented solution.
  9. Evaluate the system's behavior and performance under various scenarios, including blacklisting during registry downtime and subsequent cache synchronization.

Additional information

By addressing this issue, we aim to improve the handling of blacklisted actors during registry unavailability. The proposed solution will prevent requests from being routed to blacklisted actors, even when the registry is down, reducing the likelihood of failures and enhancing the overall stability and reliability of the system.

This issue serves as a reminder to investigate and implement the necessary changes to handle blacklisted actors when the registry is unavailable. It also provides an opportunity to evaluate the impact and effectiveness of the proposed solution in mitigating the potential failures associated with blacklisted actors during registry downtime.

@richardartoul
Copy link
Owner

@aratz-lasa maybe I’m missing something, but I think it may be as simple as:

replicas == 1 and we get blacklisted error: try to refresh registry synchronously. If registry is down, there is nothing we can do because the actor was blacklisted anyways. It’s “expected” that black listing will cause temporary unavailability with RF=1

replicas > 1 and we get blacklisted error: remove the blacklisted reference from the cache, async notify the registry that the actor is blacklisted (some new method we add to the registry) and then we’re done pretty much. Subsequent requests will use only the non blacklisted replicas and the registry will place the actor on a new server when it’s notified of the Blacklistiing and the servers will all eventually pick up the new placement as they asynchronously refresh their caches.

I think a lot of the complexity of the existing implementation were struggling with is because I hijacked the ensureActivation method as a way to notify the registry of actors that have been blacklisted. But if we just had a discrete pathway for that it would be much simpler / cleaner I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants