Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]Federation HA - Passive node not getting Presence syncing broadcast #3564

Open
xhuang-sangoma opened this issue Jan 22, 2025 · 13 comments
Assignees

Comments

@xhuang-sangoma
Copy link

xhuang-sangoma commented Jan 22, 2025

OpenSIPS version you are running

[root@sip-b97b69845-fgbn6 /]# opensips -V
version: opensips 3.5.3 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 051e1c4cc
main.c compiled on 02:28:56 Dec 19 2024 with cc 8

Describe the bug

This is a followup issue relating to a previous issue reported at: #2960

We have two opensips instances configured as active-backup HA pair in a federation cluster mode.

The active node has following settings:

modparam("clusterer","db_url","mysql://xxxx")
modparam("clusterer", "my_node_id", 1)
modparam("clusterer", "sharing_tag" ,"69.108.214.70/1=active")

modparam("presence","db_url","mysql://xxxx")
modparam("presence", "db_update_period", 0)
modparam("presence", "fallback2db", 0)
modparam("presence", "cluster_id", 1)
modparam("presence", "cluster_federation_mode", "full-sharing")
modparam("presence", "cluster_be_active_shtag" ,"69.168.214.70")

The backup node has following settings:

modparam("clusterer","db_url","mysql://xxxx")
modparam("clusterer", "my_node_id", 2) 
modparam("clusterer", "sharing_tag" ,"69.108.214.70/1=backup")

modparam("presence","db_url","mysql://xxxx")
modparam("presence", "db_update_period", 0)
modparam("presence", "fallback2db", 0)
modparam("presence", "cluster_id", 1)
modparam("presence", "cluster_federation_mode", "full-sharing")
modparam("presence", "cluster_be_active_shtag" ,"69.168.214.70")

This is the entries in clusterer table:

mysql> select * from clusterer;
+----+------------+---------+-------------------------+-------+-----------------+----------+---------------+-------+-------------+
| id | cluster_id | node_id | url                     | state | no_ping_retries | priority | sip_addr      | flags | description |
+----+------------+---------+-------------------------+-------+-----------------+----------+---------------+-------+-------------+
|  2 |          1 |       1 | bin:69.108.214.99:5566  |     1 |               3 |       50 | 69.108.214.70 | seed  | NULL        |
|  4 |          1 |       2 | bin:69.108.214.97:5566  |     1 |               3 |       50 | 69.108.214.70 | NULL  | NULL        |
+----+------------+---------+-------------------------+-------+-----------------+----------+---------------+-------+-------------+

The VIP 69.168.214.70 is configured on the active node_id 1.

We have phones sending REGISTER and SUBSCRIBE (for BLF) requests to VIP on the active node. And the subscriptions are processed by active node and stored in memory.

Note that we set db_update_period and fallback2db to 0 and cluster_federation_mode to full-sharing as we don't want to use db to share the subscriptions. Instead we want the backup node to get subscriptions synced by receiving the cluster broadcast from the active node.

We are expecting that subscriptions processed by active node will also be synced to the backup node. And if we run "opensips-cli -x mi subs_phtable_list" on both active and backup node, they should show the same list.

The result is that only active node prints out the subscriptions list. And the backup node prints empty list.

This is causing an issue that if we switch VIP and make the backup node active, since the backup node doesn't have the subscriptions, it will fail to handle any PUBLISH and deliver NOTIFYs to subscribers.

We think the issue is related to this code commit: 8b96b70

In this code, the backup node checks the cluster_be_active_shtag, since it's off, the node will stop accepting any presence cluster traffic. We think this is incorrect.

The expected behaviour is that the backup node should continue to accept the presence cluster broadcast and store the subscriptions in memory. It just does NOT NEED TO PROCESS any of them.

@bogdan-iancu
Copy link
Member

@xhuang-sangoma , it might be a bit a confusion here. The cluster_be_active_shtag is to be used in scenarios where you have a federated clustering, but you want some nodes within the cluster to be inactive from clustering perspective (not to send or receive anything via the clustering layer) - the idea here is to allow such cluster-idle node to act as standby backups. Such node are DB updated, not cluster update.
But in your case you do not want to use the DB at all. So the cluster-idle nodes will be totally disconnected (as data) from the rest of the nodes.
The right configuration will be to drop this cluster_be_active_shtag and let all nodes in the cluster to receive and send data over the cluster - so all nodes will have the published data replicated. And in order to control the active-backup setup of your nodes, you need to use the sharing data attached to the subscription, via the handle_subscribe() function - this sh-tag will control which server is the one responsible for the actions related to the subscription - and of course, this sh-tag must be active on the opensips node handling the data and backup on the opensips node in standby.
With this setting, both node will share the full presentity data set, but only one will perform subscription related actions (expiring, notifications, etc).

@xhuang-sangoma
Copy link
Author

@bogdan-iancu Thanks for looking into this.

Following your suggestion, I updated configuration like below:

Active Node (Note that I've commented out the cluster_be_active_shtag option)

modparam("presence", "db_update_period", 0)
modparam("presence", "fallback2db", 0)
modparam("presence", "cluster_id", 1) 
modparam("presence", "cluster_federation_mode", "full-sharing")
#modparam("presence", "cluster_be_active_shtag" ,"69.168.214.70")

route[handle_presence]
{
	t_newtran();
	if (is_method("PUBLISH")) {
		 handle_publish();
	}
	if (is_method("SUBSCRIBE")) {
		handle_subscribe(,"69.168.214.70");
	}
	exit
}	

Backup Node:

modparam("presence", "db_update_period", 0)
modparam("presence", "fallback2db", 0)
modparam("presence", "cluster_id", 1) 
modparam("presence", "cluster_federation_mode", "full-sharing")
#modparam("presence", "cluster_be_active_shtag" ,"69.168.214.70")

route[handle_presence]
{
	t_newtran();
	if (is_method("PUBLISH")) {
		 handle_publish();
	}
	if (is_method("SUBSCRIBE")) {
		handle_subscribe(,"69.168.214.70");
	}
	exit
}

The result

  1. There're error log below in the backup node:
Jan 25 01:54:23 [326554] INFO:clusterer:handle_sync_packet: Received all sync packets for capability 'presence' in cluster 1
Jan 25 01:54:23 [326544] CRITICAL:db_mysql:wrapper_single_mysql_real_query: driver error (1062): Duplicate entry 'HP-000FD3D085A7-18300027-sandbox2-sip.nxf-test.fonality.com-pres' for key 'presentity.presentity_idx'
Jan 25 01:54:23 [326544] ERROR:core:db_do_insert: error while submitting query
Jan 25 01:54:23 [326544] ERROR:presence:update_presentity: inserting new record in database
Jan 25 01:54:23 [326544] ERROR:presence:handle_replicated_publish: failed to update presentity based on replicated Publish
Jan 25 01:54:23 [326544] ERROR:presence:handle_replicated_publish: failed to handle bin packet 101 from node 1
Jan 25 01:54:23 [326544] WARNING:presence:bin_packet_handler: failed to process sync chunk!

Seems the backup node is still trying to insert entries to presentity table. It's not supposed to do so as a backup node.

  1. The backup node is not receiving any subscriptions sync from active node.

Active node prints out entry in subs_phtable_list:

[root@sip-b97b69845-fgbn6 /]# opensips-cli -x mi subs_phtable_list
[
    {
        "pres_uri": "sip:[email protected]",
        "event": "message-summary",
        "expires": "2025-01-25 02:09:40",
        "db_flag": 2,
        "version": 16,
        "sharing_tag": "69.168.214.70",
        "to_user": "HM-0000000012862-18300027",
        "to_domain": "69.168.214.70",
        "to_tag": "c36a-521bf6280b9673752315e00b3a6378f9",
        "from_user": "HM-0000000012862-18300027",
        "from_domain": "69.168.214.70",
        "from_tag": "720cdb28",
        "contact": "sip:[email protected]:60909;transport=UDP",
        "callid": "WR1r0xWGrL48bcfGSfC4rg..",
        "local_cseq": 16,
        "remote_cseq": 16
    }
]


[root@sip-b97b69845-fgbn6 /]# opensips-cli -x mi clusterer_list_shtags
[
    {
        "Tag": "69.168.214.70",
        "Cluster": 1,
        "State": "active"
    }
]

Backup node has empty list:

[root@sip-b97b69845-bzbr6 /]# opensips-cli -x mi subs_phtable_list
[]

[root@sip-b97b69845-bzbr6 /]# opensips-cli -x mi clusterer_list_shtags
[
    {
        "Tag": "69.168.214.70",
        "Cluster": 1,
        "State": "backup"
    }
]

@bogdan-iancu
Copy link
Member

@xhuang-sangoma , I'm a bit confused when comes to what you want to achieve here. Do you want to setup a pure active-backup configuration or a federation (with multiple active nodes sharing parts of the data) ??

@xhuang-sangoma
Copy link
Author

xhuang-sangoma commented Jan 28, 2025 via email

@bogdan-iancu
Copy link
Member

Federation = collection of nodes, all active, node partitioning the presence data (no node has the full presentity and subscription data set).

HA setup = active - backup setup where only one node is active; the backup nodes do have access to the full presentity and subscription data set.

So, I would say you want to do a HA setup, not a federation.

@xhuang-sangoma
Copy link
Author

We're following this tutorial: https://www.opensips.org/Documentation/Tutorials-Distributed-User-Location-Federation

Specifically this one that says "Federated User Location (with HA)":

Image

It's just that right now we're only testing with a single pair of active/passive nodes setup and expecting the two nodes sharing a VIP to have the presentity and subscription data fully replicated, then we'll expand the test to multiple pairs as a federation cluster.

So, is Federation and HA exclusive to each other or they can be combined as shown in the above diagram?

@bogdan-iancu
Copy link
Member

@xhuang-sangoma , the above you mentioned is 100% about User Location, so about registrations. While the initial discussion started about presence sharing........ so, what are we talking about here :) ?

@xhuang-sangoma
Copy link
Author

xhuang-sangoma commented Feb 6, 2025

@bogdan-iancu Sorry for the confusion. We use opensips to handle both registrations and presence, so we're expecting presence to work the same way as user location in regard to cluster.

Actually your tutorial at: https://blog.opensips.org/2018/03/27/clustering-presence-services-with-opensips-2-4/
shows that it's possible to have presence to work that way in the "Federating scenario with redundancy" section.

Image

The only difference in our setup is that we don't want to use a shared database (due to mysql performance issue when too much writing to active_watchers table). So we set the cluster_federation_mode to full-sharing expecting the presentities and subscriptions get synced to the backup node, but the result is the backup node has empty subscriptions list.

Is this archivable or we must use a shared db to store and share subscriptions?

@bogdan-iancu
Copy link
Member

Couple of issues:

  1. the blog you mentioned is for 2.4, things changed in the last 6 years, the presence clustering works a bit different. This is why I kept asking what exactly you want to achieve here.
  2. Presence and registration clustering works totally different !
  3. for presence, to achieve HA (a fully sync backup), there is no way to do it without DB. Clustering (in presence) helps with the PUBLISH broadcasting, but not with the sharing/distribution of the subscriptions. Subscriptions may be distributed only via federated clustering

@xhuang-sangoma
Copy link
Author

@bogdan-iancu Thanks for clarification.

The 3.5.x presence module document still points to the blog post: https://opensips.org/docs/modules/3.5.x/presence.html#presence_clustering
I can't find any new document explaining how the latest presence clustering works, so I'm assuming it's still working the way 2.4 does.

Would you please explain in what scenario can the 'full-sharing' option be used in the cluster_federation_mode parameter of the presence module? The document says following:

If you don't want to use a shared database (via [fallback2db](https://opensips.org/docs/modules/3.5.x/presence.html#param_fallback2db)), but still want a complete data set everywhere, you may choose mode full-sharing. This mode allows you to switch PUBLISH endpoints, even for already published Event States, thus allowing you to add and remove presence servers without losing state.

which makes us think it's possible to have a HA without a sharing DB.

@bogdan-iancu
Copy link
Member

Yes, as the doc says, in full-sharing, the sharing is about the presentities, about the PUBLISHEd data; it is not related to the subscription data:

full-sharing - published state is kept on all presence nodes even when there aren't any local subscribers. 

Again, the only way to share subscriptions is via DB.

@xhuang-sangoma
Copy link
Author

Thanks for confirming. We'll setup a shared DB to test again and report back.

@xhuang-sangoma
Copy link
Author

We tested the presence cluster using a shared MySQL DB. Here're some of the issues we found:

  1. Sometimes a subscription entry get deleted in DB (not sure by the active or backup) for no reason, but the entry remains in active node's memory.

  2. In the above case, when a PUBLISH comes for the subscription, the active node doesn't send out NOTIFY even it the subscription in in memory, instead it ALWAYS querys DB for subscription but found no record and prints out the following without sending the NOTIFY.

Feb 10 06:40:17 [2025373] DBG:db_mysql:db_mysql_convert_rows: no rows returned from the queryFeb 10 06:40:17 [2025373] DBG:presence:get_subs_db: The query for subscribtion for [uri]= sip:*710*[email protected] for [event]= presence returned no result

  1. During startup, backup node loads presentities and subscriptions from db, but they never get updated by any cluster syncing message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants