PubSub Architecture #1718

qeternity · 2021-06-25T13:52:11Z

qeternity
Jun 25, 2021

Continuing from django/channels_redis#251 - I have a few concerns around the architecture for pubsub. While the reconnect logic is admirable, I'm not sure that it's appropriate in pubsub because it allows for silent missed messages.

Let's say between our _do_keepalive() loops the connection is lost, and a publisher elsewhere sends a message. RedisSingleShardConnection will silently reconnect having missed the message. In the current blocking list architecture, this would be fine because we could reconnect to the same key and continue popping items...they would simply queue up like a log (redis streams would work well). But in pubsub, those messages will be lost.

This is very relevant to our usage. In the event of a network blip or sentinel failover, the websocket consumers disconnect and the frontend attempts to gacefully recover by reconnecting websockets and by performing a full state refresh to ensure no data has been missed.

acu192 · 2021-06-25T16:35:57Z

acu192
Jun 25, 2021

Let's say between our _do_keepalive() loops the connection is lost, and a publisher elsewhere sends a message. RedisSingleShardConnection will silently reconnect having missed the message.

Yes, that is true.

I see it as a tradeoff. Indeed, it is the difference between a "Pub/Sub" system vs a "Message Queue" system.

The benefit of pubsub is that it's fast and light-weight (i.e. can scale to having HUGE groups and MASSIVE throughput; that's the appeal of Redis PubSub generally). The overhead of pubsub is not much more than just the networking stack itself!

The drawback of pubsub is that it does not queue messages in-memory (well, not beyond the OS-level network stack) thus if the OS-level network connection is lost then you lose messages (as @qeternity points out above). A "Message Queue" solves this. Maybe we should have another ChannelLayer based on RabbitMQ? Then folks can choose the one that makes the most sense for their application.

While the reconnect logic is admirable, I'm not sure that it's appropriate in pubsub because it allows for silent missed messages.

My opinion here is that HAVING reconnect logic is better than NOT having it. It at least restores normal operation of your app server within 1 second of restarting your Redis server [1]. It cannot guarantee message delivery, but nothing can guarantee that when using Redis Pub/Sub. For those who want such guarantee, they should instead use a RabbitMQ channel layer (or the original impl if #1683 is fixed).

[1] You previously asked about my production setup. I'm using AWS ElastiCache for Redis for my Redis server. I have my app servers on the same subnet and have never seen them lose the network connection to Redis "randomly". The only times I've seen the connection lost is when doing a planned Redis upgrade. Not saying it can't happen... but I've never noticed it happening (based on the logs I keep). So it's not a big concern for me. Dropped messages do screw up my front-end code (which is why I spent so much time looking at #1683) but so far the PubSub impl has been excellent for us in production. I would actually consider a RabitMQ alternative for my app (we don't actually need the crazy throughput PubSub offers) but I'm more familiar with Redis so that's where I focused my solution, TBH.

0 replies

carltongibson · 2021-06-26T05:54:45Z

carltongibson
Jun 26, 2021
Maintainer

Redis Streams seem to handle the _ reconnect to the same key_ desire, but that would be another implementation.

Thinking we just need to document the trade-offs maybe… 🤔

0 replies

qeternity · 2021-06-26T06:39:37Z

qeternity
Jun 26, 2021
Author

@acu192 Many thanks for your thoughts, and even more so for the pubsub implementation. I agree with everything that you say, but the issue is not so much queueing messages but rather the silent nature of the reconnect. In fact, I would not want to queue messages because I also don't want to serve stale data. In our world, if there were to be a hiccup, then we want to know about it so that we can act appropriately. We have actually experimented with the rabbitmq based layer written by the Workbench team, but we are heavy celery users and lightweight redis connections actually work better for us (hence django/channels_redis#258). I think we're in agreement, I would just like to see a mechanism by which the channel layer bubbles up reconnection events so that consumers can act accordingly if needed.

@carltongibson Not suggesting another implementation using redis streams, just thinking out loud. I also think that would be terrible and as I said above, I actually would not want to serve stale data in the event of a network partition. We should definitely document tradeoffs but I'm not sure there needs to be much of a tradeoff. We either make the reconnect logic optional, or better, we notify consumers of disconnect/reconnect events.

0 replies

qeternity · 2021-06-26T07:57:32Z

qeternity
Jun 26, 2021
Author

@acu192 Btw - you may have something similar in-house but to stress our infra for consistency at scale, we built this little tool - https://github.com/zumalabs/sockbasher

In dire need of some external docs but might be useful in its current form for you.

0 replies

qeternity · 2021-06-26T09:09:46Z

qeternity
Jun 26, 2021
Author

@carltongibson @acu192 so after reviewing this morning for a bit, the shards should emit disconnect/reconnect events to that channel layer. Then, I think there are two approaches that we can take:

Maintain backwards compatibility by raising an exception, which is what occurs in the current layer if redis goes away.
Send a message to all consumers on disconnect/reconnect to allow them to implement specific condition handling.

I am in favor of 2 as I think it's the better approach, with the caveat that it won't be fully backwards compatible.

0 replies

acu192 · 2021-06-26T16:02:35Z

acu192
Jun 26, 2021

I agree with everything that you say, but the issue is not so much queueing messages but rather the silent nature of the reconnect.

Ah, I see now. Yes that makes sense.

So, now I assume your idea is: If a Channel Consumer (server-side) knew that a message was dropped, it could close its corresponding websocket. The front-end would notice the closed websocket and reconnect itself. I could go for that on my site, I think it would work well. Let's call this "Idea A".

I would just like to see a mechanism by which the channel layer bubbles up reconnection events so that consumers can act accordingly if needed.

To achieve Idea A, I don't think your idea of bubbling-up the Redis-reconnection even is enough. It would help (and for that reason maybe we do it), but we will still need more. Two reasons I can see (bare with me... I'm kinda thinking out loud here):

A Redis-reconnect event will (likely) drop messages destined for more than one consumer, but, only one of those consumers will "notice" the dead Redis connection. The others may come in a few milliseconds later and find that the connection is just fine (because it already reconnected). So, to make this work we need a way to notify every consumer when any consumer finds a dead Redis connection. Are we then going to use Redis to notify every consumer? But even so, that's still not enough because...
There is an unlikely (but very much possible) situation on the producer's end. If a producer's connection dies, you probably lost more than just the most-recent message. You may have lost messages sent 2 millseconds ago that were in the network stack memory before the OS decided the TCP connection to Redis was dead.

With all this in mind... let's say Redis dies, a consumer realizes it, tries to notify every consumer of this event... but Redis is still down so it can't notify everyone... or worse Redis is down but the "send" function seems to work for a few milliseconds (so the consumer who is doing the notifying is tricked)... also there might be a lot of consumers all noticing this at the same time and all suddenly deciding to notify everyone else! My gut says this will be a mess.

we built this little tool

I like the name... "sockbasher" 🤓

0 replies

acu192 · 2021-06-26T16:07:44Z

acu192
Jun 26, 2021

A simpler way to achieve Idea A is for each producer to include sequence numbers in its messages. Then the consumers check for gaps in the sequence numbers.

But that has to be implemented at the producer-layer and consumer-layer. I don't think the logic works at the channel-layer... but need to think more.

0 replies

qeternity · 2021-06-26T16:28:12Z

qeternity
Jun 26, 2021
Author

@acu192 have a look at the last few PRs I've opened. One of them implements the disconnect events. You handle all of this at the layer level directly to the consumers, bypassing any message broker.

0 replies

LiteWait · 2021-06-28T13:28:42Z

LiteWait
Jun 28, 2021

I've only been able to follow this discussion from a high level, but given our implementation drives 1000's of point-of-sale devices, a missed message is a nightmare to code for.

I think my current given set of options are:

Leave the existing Redis channel in place and suffer (what appears to be) Daphne lock ups when we have a massive disconnect/reconnect event
Move to redis pub/sub, have clean disconnects/reconnects but the potential to lose messages
Explore RabbitMQ (which we left because we had all kinds of problems in distributed mode)

Just confirming I am correct on these counts.

Thanks.

0 replies

acu192 · 2021-06-28T15:13:10Z

acu192
Jun 28, 2021

@LiteWait

"All Redis data resides in-memory, in contrast to databases that store data on disk or SSDs." ref

Therefore you are playing with fire if you need guaranteed message delivery. If your Redis server dies/reboots/whatever (or the network connection to Redis has a hiccup) it is very likely messages will be lost, no matter which channel layer you use (core or pubsub). It's a shortcoming of Redis (and a tradeoff, it's also what makes Redis so fast).

0 replies

qeternity · 2021-06-28T15:24:17Z

qeternity
Jun 28, 2021
Author

The issue is slightly more complicated than that. Yes absolutely, Redis does not provide strong durability guarantees. But it is commonplace to run with at least RDB or better AOF which provides some persistence. Additionally, running replicas as we do (hence my focus on Sentinel) lowers the window for data loss tremendously. And ultimately, as long as we can engineer around lost messages (i.e. the disconnect logic and forcing a client re-sync) these issues can be hugely minimized.

@LiteWait RabbitMQ and the Workbench layer is a much better approach if you can use it. We make extensive use of Celery and they have taken an approach that is relatively incompatible with the Celery pre-fork model. This is why our focus is on highly available Redis to AND detection of any network partitions.

0 replies

LiteWait · 2021-06-28T15:52:42Z

LiteWait
Jun 28, 2021

@acu192 @qeternity thanks. What worries me isn't Elasticache/Redis dropping out, its from disconnects between Daphne and the JS client and lost messages. Granted you can't do anything about the Internet but as long as I don't have to worry about messages that haven't actually hit the wire yet going missing that would be fault tolerant enough for our application.

0 replies

acu192 · 2021-06-28T16:07:21Z

acu192
Jun 28, 2021

@LiteWait

What worries me isn't Elasticache/Redis dropping out

Same with my app actually, I'm not worried about this because it's very rare and it will not kill our business in the event it does happen. We store all the things we actually care to keep in a real database. Redis is just a nice way to sync state but it is not the authority of the state.

With that in mind you should try the new pubsub impl. It should not [1] drop messages unless the Redis server (or the connection to it) dies. I've been happy with it for this reason, as the core impl was dropping messages even while Redis was healthy.

[1] "should not" meaning I don't know of any reason it would... and I've been testing it and running it in production for only a few months, so take that for what it's worth. We're considering it beta for now.

disconnects between Daphne and the JS client

I've been using Uvicorn (instead of Daphne) and found it's generally better (faster, more reliable, whatever). Maybe give that a try if you think Daphne is causing issues for you.

0 replies

qeternity · 2021-06-28T16:16:48Z

qeternity
Jun 28, 2021
Author

@LiteWait @acu192 Absolutely you should not be using Daphne in prod (we also use Uvicorn). In terms of dropping messages between webserver and client, websockets run over tcp so you should have the same guarantees there as you do with tcp. That said, you should expect network issues everywhere. If you are using websockets as a source of truth, I think that's a mistake as you'd need to implement some sort of 2PC on top. Distributed systems are difficult, which is why we treat redis/channels as a nice-to-have real-time sync, which we expect to break and we fall back to sync'ing via api which is backed by our postgres cluster and postgres' decades of battle testing to overcome these exact issues.

0 replies

carltongibson · 2021-07-01T10:52:54Z

carltongibson
Jul 1, 2021
Maintainer

If you are using websockets as a source of truth, I think that's a mistake ...

This seems the key point. If the layer drops the connection to Redis, you're going to loose messages (independently of which ASGI server you happen to be using.) Short of a much more robust system on top of ASGI you either need to accept that the occasional message will go missing, or else periodically fallback to a more reliable method (HTTP polling the source of truth, as the first approach.)

I'm thinking this is a documentation issue? There's a few hits for "at most once" in the Channels docs already. Perhaps pulling those together into a single discussion would be worthwhile. 🤔

(See discussion on django/channels_redis#259)

0 replies

carltongibson · 2021-07-01T15:06:07Z

carltongibson
Jul 1, 2021
Maintainer

I'm going to move this over to discussions on the Channels repo. If we pin down something addressable, happy to move it back.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PubSub Architecture #1718

{{title}}

Replies: 16 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

PubSub Architecture #1718

Replies: 16 comments

carltongibson Jun 26, 2021 Maintainer

qeternity Jun 26, 2021 Author

qeternity Jun 26, 2021 Author

qeternity Jun 26, 2021 Author

qeternity Jun 26, 2021 Author

qeternity Jun 28, 2021 Author

qeternity Jun 28, 2021 Author

carltongibson Jul 1, 2021 Maintainer

carltongibson Jul 1, 2021 Maintainer

carltongibson
Jun 26, 2021
Maintainer

qeternity
Jun 26, 2021
Author

qeternity
Jun 26, 2021
Author

qeternity
Jun 26, 2021
Author

qeternity
Jun 26, 2021
Author

qeternity
Jun 28, 2021
Author

qeternity
Jun 28, 2021
Author

carltongibson
Jul 1, 2021
Maintainer

carltongibson
Jul 1, 2021
Maintainer