Replies: 16 comments
-
Yes, that is true. I see it as a tradeoff. Indeed, it is the difference between a "Pub/Sub" system vs a "Message Queue" system. The benefit of pubsub is that it's fast and light-weight (i.e. can scale to having HUGE groups and MASSIVE throughput; that's the appeal of Redis PubSub generally). The overhead of pubsub is not much more than just the networking stack itself! The drawback of pubsub is that it does not queue messages in-memory (well, not beyond the OS-level network stack) thus if the OS-level network connection is lost then you lose messages (as @qeternity points out above). A "Message Queue" solves this. Maybe we should have another ChannelLayer based on RabbitMQ? Then folks can choose the one that makes the most sense for their application.
My opinion here is that HAVING reconnect logic is better than NOT having it. It at least restores normal operation of your app server within 1 second of restarting your Redis server [1]. It cannot guarantee message delivery, but nothing can guarantee that when using Redis Pub/Sub. For those who want such guarantee, they should instead use a RabbitMQ channel layer (or the original impl if #1683 is fixed). [1] You previously asked about my production setup. I'm using AWS ElastiCache for Redis for my Redis server. I have my app servers on the same subnet and have never seen them lose the network connection to Redis "randomly". The only times I've seen the connection lost is when doing a planned Redis upgrade. Not saying it can't happen... but I've never noticed it happening (based on the logs I keep). So it's not a big concern for me. Dropped messages do screw up my front-end code (which is why I spent so much time looking at #1683) but so far the PubSub impl has been excellent for us in production. I would actually consider a RabitMQ alternative for my app (we don't actually need the crazy throughput PubSub offers) but I'm more familiar with Redis so that's where I focused my solution, TBH. |
Beta Was this translation helpful? Give feedback.
-
Redis Streams seem to handle the _ reconnect to the same key_ desire, but that would be another implementation. Thinking we just need to document the trade-offs maybe… 🤔 |
Beta Was this translation helpful? Give feedback.
-
@acu192 Many thanks for your thoughts, and even more so for the pubsub implementation. I agree with everything that you say, but the issue is not so much queueing messages but rather the silent nature of the reconnect. In fact, I would not want to queue messages because I also don't want to serve stale data. In our world, if there were to be a hiccup, then we want to know about it so that we can act appropriately. We have actually experimented with the rabbitmq based layer written by the Workbench team, but we are heavy celery users and lightweight redis connections actually work better for us (hence django/channels_redis#258). I think we're in agreement, I would just like to see a mechanism by which the channel layer bubbles up reconnection events so that consumers can act accordingly if needed. @carltongibson Not suggesting another implementation using redis streams, just thinking out loud. I also think that would be terrible and as I said above, I actually would not want to serve stale data in the event of a network partition. We should definitely document tradeoffs but I'm not sure there needs to be much of a tradeoff. We either make the reconnect logic optional, or better, we notify consumers of disconnect/reconnect events. |
Beta Was this translation helpful? Give feedback.
-
@acu192 Btw - you may have something similar in-house but to stress our infra for consistency at scale, we built this little tool - https://github.com/zumalabs/sockbasher In dire need of some external docs but might be useful in its current form for you. |
Beta Was this translation helpful? Give feedback.
-
@carltongibson @acu192 so after reviewing this morning for a bit, the shards should emit disconnect/reconnect events to that channel layer. Then, I think there are two approaches that we can take:
I am in favor of 2 as I think it's the better approach, with the caveat that it won't be fully backwards compatible. |
Beta Was this translation helpful? Give feedback.
-
Ah, I see now. Yes that makes sense. So, now I assume your idea is: If a Channel Consumer (server-side) knew that a message was dropped, it could close its corresponding websocket. The front-end would notice the closed websocket and reconnect itself. I could go for that on my site, I think it would work well. Let's call this "Idea A".
To achieve Idea A, I don't think your idea of bubbling-up the Redis-reconnection even is enough. It would help (and for that reason maybe we do it), but we will still need more. Two reasons I can see (bare with me... I'm kinda thinking out loud here):
With all this in mind... let's say Redis dies, a consumer realizes it, tries to notify every consumer of this event... but Redis is still down so it can't notify everyone... or worse Redis is down but the "send" function seems to work for a few milliseconds (so the consumer who is doing the notifying is tricked)... also there might be a lot of consumers all noticing this at the same time and all suddenly deciding to notify everyone else! My gut says this will be a mess.
I like the name... "sockbasher" 🤓 |
Beta Was this translation helpful? Give feedback.
-
A simpler way to achieve Idea A is for each producer to include sequence numbers in its messages. Then the consumers check for gaps in the sequence numbers. But that has to be implemented at the producer-layer and consumer-layer. I don't think the logic works at the channel-layer... but need to think more. |
Beta Was this translation helpful? Give feedback.
-
@acu192 have a look at the last few PRs I've opened. One of them implements the disconnect events. You handle all of this at the layer level directly to the consumers, bypassing any message broker. |
Beta Was this translation helpful? Give feedback.
-
I've only been able to follow this discussion from a high level, but given our implementation drives 1000's of point-of-sale devices, a missed message is a nightmare to code for. I think my current given set of options are:
Just confirming I am correct on these counts. Thanks. |
Beta Was this translation helpful? Give feedback.
-
"All Redis data resides in-memory, in contrast to databases that store data on disk or SSDs." ref Therefore you are playing with fire if you need guaranteed message delivery. If your Redis server dies/reboots/whatever (or the network connection to Redis has a hiccup) it is very likely messages will be lost, no matter which channel layer you use ( |
Beta Was this translation helpful? Give feedback.
-
The issue is slightly more complicated than that. Yes absolutely, Redis does not provide strong durability guarantees. But it is commonplace to run with at least RDB or better AOF which provides some persistence. Additionally, running replicas as we do (hence my focus on Sentinel) lowers the window for data loss tremendously. And ultimately, as long as we can engineer around lost messages (i.e. the disconnect logic and forcing a client re-sync) these issues can be hugely minimized. @LiteWait RabbitMQ and the Workbench layer is a much better approach if you can use it. We make extensive use of Celery and they have taken an approach that is relatively incompatible with the Celery pre-fork model. This is why our focus is on highly available Redis to AND detection of any network partitions. |
Beta Was this translation helpful? Give feedback.
-
@acu192 @qeternity thanks. What worries me isn't Elasticache/Redis dropping out, its from disconnects between Daphne and the JS client and lost messages. Granted you can't do anything about the Internet but as long as I don't have to worry about messages that haven't actually hit the wire yet going missing that would be fault tolerant enough for our application. |
Beta Was this translation helpful? Give feedback.
-
Same with my app actually, I'm not worried about this because it's very rare and it will not kill our business in the event it does happen. We store all the things we actually care to keep in a real database. Redis is just a nice way to sync state but it is not the authority of the state. With that in mind you should try the new pubsub impl. It should not [1] drop messages unless the Redis server (or the connection to it) dies. I've been happy with it for this reason, as the [1] "should not" meaning I don't know of any reason it would... and I've been testing it and running it in production for only a few months, so take that for what it's worth. We're considering it beta for now.
I've been using Uvicorn (instead of Daphne) and found it's generally better (faster, more reliable, whatever). Maybe give that a try if you think Daphne is causing issues for you. |
Beta Was this translation helpful? Give feedback.
-
@LiteWait @acu192 Absolutely you should not be using Daphne in prod (we also use Uvicorn). In terms of dropping messages between webserver and client, websockets run over tcp so you should have the same guarantees there as you do with tcp. That said, you should expect network issues everywhere. If you are using websockets as a source of truth, I think that's a mistake as you'd need to implement some sort of 2PC on top. Distributed systems are difficult, which is why we treat redis/channels as a nice-to-have real-time sync, which we expect to break and we fall back to sync'ing via api which is backed by our postgres cluster and postgres' decades of battle testing to overcome these exact issues. |
Beta Was this translation helpful? Give feedback.
-
This seems the key point. If the layer drops the connection to Redis, you're going to loose messages (independently of which ASGI server you happen to be using.) Short of a much more robust system on top of ASGI you either need to accept that the occasional message will go missing, or else periodically fallback to a more reliable method (HTTP polling the source of truth, as the first approach.) I'm thinking this is a documentation issue? There's a few hits for "at most once" in the Channels docs already. Perhaps pulling those together into a single discussion would be worthwhile. 🤔 (See discussion on django/channels_redis#259) |
Beta Was this translation helpful? Give feedback.
-
I'm going to move this over to discussions on the Channels repo. If we pin down something addressable, happy to move it back. |
Beta Was this translation helpful? Give feedback.
-
Continuing from django/channels_redis#251 - I have a few concerns around the architecture for pubsub. While the reconnect logic is admirable, I'm not sure that it's appropriate in pubsub because it allows for silent missed messages.
Let's say between our
_do_keepalive()
loops the connection is lost, and a publisher elsewhere sends a message.RedisSingleShardConnection
will silently reconnect having missed the message. In the current blocking list architecture, this would be fine because we could reconnect to the same key and continue popping items...they would simply queue up like a log (redis streams would work well). But in pubsub, those messages will be lost.This is very relevant to our usage. In the event of a network blip or sentinel failover, the websocket consumers disconnect and the frontend attempts to gacefully recover by reconnecting websockets and by performing a full state refresh to ensure no data has been missed.
Beta Was this translation helpful? Give feedback.
All reactions