Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[orchagent/syncd] Adding Loopback0 IP causes orchagent/syncd crash/dump on Accton-AS9716-32D Tomahawk 3 #20837

Open
wdoekes opened this issue Nov 18, 2024 · 8 comments · May be fixed by sonic-net/sonic-swss#3377
Labels
Triaged this issue has been triaged

Comments

@wdoekes
Copy link
Contributor

wdoekes commented Nov 18, 2024

Description

After adding an IP to Loopback0, using config interface ip add Loopback0 10.1.0.1/32 and a cold reboot, swss/syncd fail on the Accton-AS9716-32D broadcom Tomahawk 3.

Before rebooting, adding the IP works fine:

# config interface ip add Loopback0 10.1.0.1/32

# ping 10.1.0.1 -c1
PING 10.1.0.1 (10.1.0.1) 56(84) bytes of data.
64 bytes from 10.1.0.1: icmp_seq=1 ttl=64 time=0.053 ms

After reboot, we see this:

  • doDecapTunnelTask: Tunnel IPINIP_TUNNEL added to ASIC_DB
  • processQuadEvent: attr: SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_TYPE: SAI_TUNNEL_TERM_TABLE_ENTRY_TYPE_P2MP
  • create: create status: SAI_STATUS_NOT_SUPPORTED

This eventually results in:

2024 Nov 18 08:54:48.200497 sonic NOTICE swss#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP
2024 Nov 18 08:54:48.200684 spine2 NOTICE syncd#syncd: :- processNotifySyncd: Invoking SAI failure dump

Both syncd and orchagent (swss) stopping.

Steps to reproduce the issue:

  1. Take a Accton-AS9716-32D Tomahawk 3.
  2. Load master or 202405 branch (BRCM SAI ver: [11.2.13.1], OCP SAI ver: [1.14.0], SDK ver: [sdk-6.5.30-SP4], RedisRemoteSaiInterface: sairedis git revision 17e893c5, SAI git revision: e0e9787, sonic-swss e99f5a914 (probably))
  3. Add IP to loopback interface:
# config interface ip add Loopback0 10.1.0.1/32
2024 Nov 18 08:46:05.856763 spine2 NOTICE swss#orchagent: :- addIp2MeRoute: Create IP2me route ip:10.1.0.1
  1. Reboot
  2. Observe how orchagent stops and takes syncd with it:
2024 Nov 18 08:54:48.198405 sonic NOTICE swss#orchagent: :- addDecapTunnel: Create overlay loopback router interface oid:60000000005b2
2024 Nov 18 08:54:48.199444 sonic NOTICE swss#orchagent: :- doDecapTunnelTask: Tunnel IPINIP_TUNNEL added to ASIC_DB.
2024 Nov 18 08:54:48.200064 spine2 ERR syncd#syncd: :- sendApiResponse: api SAI_COMMON_API_CREATE failed in syncd mode: SAI_STATUS_NOT_SUPPORTED
2024 Nov 18 08:54:48.200187 spine2 ERR syncd#syncd: :- processQuadEvent: attr: SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_VR_ID: oid:0x3000000000021
2024 Nov 18 08:54:48.200187 spine2 ERR syncd#syncd: :- processQuadEvent: attr: SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_TYPE: SAI_TUNNEL_TERM_TABLE_ENTRY_TYPE_P2MP
2024 Nov 18 08:54:48.200187 spine2 ERR syncd#syncd: :- processQuadEvent: attr: SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_TUNNEL_TYPE: SAI_TUNNEL_TYPE_IPINIP
2024 Nov 18 08:54:48.200187 spine2 ERR syncd#syncd: :- processQuadEvent: attr: SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_ACTION_TUNNEL_ID: oid:0x2a0000000005b3
2024 Nov 18 08:54:48.200187 spine2 ERR syncd#syncd: :- processQuadEvent: attr: SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_DST_IP: 10.1.0.1
2024 Nov 18 08:54:48.200418 sonic ERR swss#orchagent: :- create: create status: SAI_STATUS_NOT_SUPPORTED
2024 Nov 18 08:54:48.200418 sonic ERR swss#orchagent: :- addDecapTunnelTermEntry: Failed to create tunnel decap term entry 10.1.0.1/32.
2024 Nov 18 08:54:48.200418 sonic ERR swss#orchagent: :- handleSaiCreateStatus: Encountered failure in create operation, exiting orchagent, SAI API: SAI_API_TUNNEL, status: SAI_STATUS_NOT_SUPPORTED
2024 Nov 18 08:54:48.200497 sonic NOTICE swss#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP
2024 Nov 18 08:54:48.200684 spine2 NOTICE syncd#syncd: :- processNotifySyncd: Invoking SAI failure dump

Describe the results you received:

Crashing orchagent/syncd.

Describe the results you expected:

No crashing.

Output of show version:

SONiC Software Version: SONiC.osso202405.0-49b1c0f39
SONiC OS Version: 12
Distribution: Debian 12.8
Kernel: 6.1.0-22-2-amd64
Build commit: 49b1c0f39
Build date: Wed Nov 13 23:13:12 UTC 2024
Built by: [email protected]

Platform: x86_64-accton_as9716_32d-r0
HwSKU: Accton-AS9716-32D
ASIC: broadcom
ASIC Count: 1

Build: https://github.com/ossobv/sonic-buildimage/tree/osso202405-20241113-0-49b1c0f39-dbg

Based on 202405 branch.

Additional information you deem important (e.g. issue happens only occasionally):

I think the culprit might be: sonic-net/sonic-swss@353ab92

The same image doesn't work on my Accton-AS7326-56X Trident 3 because of a different issue; but with a master build with libsai reverted to 10.1.42.0 it does work, adding the Loopback0 IP as appropriate, with the following logs:

Nov 16 09:29:40.756925 leaf1.dostno.systems NOTICE swss#orchagent: :- addDecapTunnel: Create overlay loopback router interface oid:6000000000a1f
Nov 16 09:29:40.757552 leaf1.dostno.systems INFO syncd#syncd: [none] brcm_sai_create_tunnel:7698 tunnel 1 id: 180388626433
Nov 16 09:29:40.761297 leaf1.dostno.systems NOTICE swss#orchagent: :- addDecapTunnelTermEntries: Created tunnel entry for ip: 10.1.0.1
Nov 16 09:29:42.039038 leaf1.dostno.systems NOTICE swss#tunnelmgrd: :- doLpbkIntfTask: Loopback intf Loopback0 saved 10.1.0.1/32
Nov 16 09:29:43.770966 leaf1.dostno.systems NOTICE swss#intfmgrd: :- doIntfAddrTask: Delete route with ip_prefix:10.1.0.1
Nov 16 09:29:43.827672 leaf1.dostno.systems NOTICE swss#orchagent: :- addIp2MeRoute: Create IP2me route ip:10.1.0.1
@wdoekes
Copy link
Contributor Author

wdoekes commented Nov 18, 2024

Yea. Confirmed.

After applying this patch to sonic-swss:

--- a/orchagent/tunneldecaporch.cpp
+++ b/orchagent/tunneldecaporch.cpp
@@ -968,6 +968,11 @@ bool TunnelDecapOrch::addDecapTunnelTermEntry(
     sai_status_t status = sai_tunnel_api->create_tunnel_term_table_entry(&tunnel_term_table_entry_id, gSwitchId, (uint32_t)tunnel_table_entry_attrs.size(), tunnel_table_entry_attrs.data());
     if (status != SAI_STATUS_SUCCESS)
     {
+        if (status == SAI_STATUS_NOT_SUPPORTED)
+        {
+            SWSS_LOG_WARN("Creating SAI_API_TUNNEL returned SAI_STATUS_NOT_SUPPORTED for tunnel %s IP %s.", tunnel_name.c_str(), dst_ip.to_string().c_str());
+            return false;
+        }
         SWSS_LOG_ERROR("Failed to create tunnel decap term entry %s.", dst_ip.to_string().c_str());
         task_process_status handle_status = handleSaiCreateStatus(SAI_API_TUNNEL, status);
         if (handle_status != task_success)

Then the logs look a lot happier:

2024 Nov 18 20:19:06.931352 spine2 WARNING swss#orchagent: :- trapGroupUpdatePolicer: Creating policer for existing Trap group: 1100000000051f (name:queue4_group3).
2024 Nov 18 20:19:06.931859 spine2 NOTICE swss#orchagent: :- createPolicer: Create policer for trap group queue4_group3
2024 Nov 18 20:19:06.932281 spine2 NOTICE swss#orchagent: :- createPolicer: Bind policer to trap group queue4_group3:
2024 Nov 18 20:19:06.937624 spine2 NOTICE swss#orchagent: :- addDecapTunnel: Create overlay loopback router interface oid:6000000000523
2024 Nov 18 20:19:06.938214 spine2 NOTICE swss#orchagent: :- doDecapTunnelTask: Tunnel IPINIP_TUNNEL added to ASIC_DB.
2024 Nov 18 20:19:06.938581 spine2 ERR syncd#syncd: :- sendApiResponse: api SAI_COMMON_API_CREATE failed in syncd mode: SAI_STATUS_NOT_SUPPORTED
2024 Nov 18 20:19:06.938581 spine2 ERR syncd#syncd: :- processQuadEvent: attr: SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_VR_ID: oid:0x3000000000021
2024 Nov 18 20:19:06.938581 spine2 ERR syncd#syncd: :- processQuadEvent: attr: SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_TYPE: SAI_TUNNEL_TERM_TABLE_ENTRY_TYPE_P2MP
2024 Nov 18 20:19:06.938603 spine2 ERR syncd#syncd: :- processQuadEvent: attr: SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_TUNNEL_TYPE: SAI_TUNNEL_TYPE_IPINIP
2024 Nov 18 20:19:06.938603 spine2 ERR syncd#syncd: :- processQuadEvent: attr: SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_ACTION_TUNNEL_ID: oid:0x2a000000000524
2024 Nov 18 20:19:06.938621 spine2 ERR syncd#syncd: :- processQuadEvent: attr: SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_DST_IP: 10.1.0.1
2024 Nov 18 20:19:06.938794 spine2 ERR swss#orchagent: :- create: create status: SAI_STATUS_NOT_SUPPORTED
2024 Nov 18 20:19:06.938794 spine2 WARNING swss#orchagent: :- addDecapTunnelTermEntry: Creating SAI_API_TUNNEL returned SAI_STATUS_NOT_SUPPORTED for tunnel IPINIP_TUNNEL IP 10.1.0.1/32.
2024 Nov 18 20:19:06.938794 spine2 ERR swss#orchagent: :- doDecapTunnelTermTask: IPINIP_TUNNEL:10.1.0.1: failed to add tunnel decap term to ASIC_DB.
2024 Nov 18 20:19:06.939217 spine2 ERR syncd#syncd: :- sendApiResponse: api SAI_COMMON_API_CREATE failed in syncd mode: SAI_STATUS_NOT_SUPPORTED
2024 Nov 18 20:19:06.939313 spine2 ERR syncd#syncd: :- processQuadEvent: attr: SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_VR_ID: oid:0x3000000000021
2024 Nov 18 20:19:06.939313 spine2 ERR syncd#syncd: :- processQuadEvent: attr: SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_TYPE: SAI_TUNNEL_TERM_TABLE_ENTRY_TYPE_P2MP
2024 Nov 18 20:19:06.939313 spine2 ERR syncd#syncd: :- processQuadEvent: attr: SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_TUNNEL_TYPE: SAI_TUNNEL_TYPE_IPINIP
2024 Nov 18 20:19:06.939397 spine2 ERR syncd#syncd: :- processQuadEvent: attr: SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_ACTION_TUNNEL_ID: oid:0x2a000000000524
2024 Nov 18 20:19:06.939397 spine2 ERR syncd#syncd: :- processQuadEvent: attr: SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_DST_IP: 192.168.0.2
2024 Nov 18 20:19:06.939397 spine2 ERR swss#orchagent: :- create: create status: SAI_STATUS_NOT_SUPPORTED
2024 Nov 18 20:19:06.939426 spine2 WARNING swss#orchagent: :- addDecapTunnelTermEntry: Creating SAI_API_TUNNEL returned SAI_STATUS_NOT_SUPPORTED for tunnel IPINIP_TUNNEL IP 192.168.0.2/32.
2024 Nov 18 20:19:06.939426 spine2 ERR swss#orchagent: :- doDecapTunnelTermTask: IPINIP_TUNNEL:192.168.0.2: failed to add tunnel decap term to ASIC_DB.
2024 Nov 18 20:19:06.939875 spine2 ERR syncd#syncd: :- sendApiResponse: api SAI_COMMON_API_CREATE failed in syncd mode: SAI_STATUS_NOT_SUPPORTED
2024 Nov 18 20:19:06.939983 spine2 ERR syncd#syncd: :- processQuadEvent: attr: SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_VR_ID: oid:0x3000000000021
2024 Nov 18 20:19:06.939983 spine2 ERR syncd#syncd: :- processQuadEvent: attr: SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_TYPE: SAI_TUNNEL_TERM_TABLE_ENTRY_TYPE_P2MP
2024 Nov 18 20:19:06.939983 spine2 ERR syncd#syncd: :- processQuadEvent: attr: SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_TUNNEL_TYPE: SAI_TUNNEL_TYPE_IPINIP
2024 Nov 18 20:19:06.940018 spine2 ERR syncd#syncd: :- processQuadEvent: attr: SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_ACTION_TUNNEL_ID: oid:0x2a000000000524
2024 Nov 18 20:19:06.940076 spine2 ERR syncd#syncd: :- processQuadEvent: attr: SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_DST_IP: 192.168.8.2
2024 Nov 18 20:19:06.940076 spine2 ERR swss#orchagent: :- create: create status: SAI_STATUS_NOT_SUPPORTED
2024 Nov 18 20:19:06.940076 spine2 WARNING swss#orchagent: :- addDecapTunnelTermEntry: Creating SAI_API_TUNNEL returned SAI_STATUS_NOT_SUPPORTED for tunnel IPINIP_TUNNEL IP 192.168.8.2/32.
2024 Nov 18 20:19:06.940101 spine2 ERR swss#orchagent: :- doDecapTunnelTermTask: IPINIP_TUNNEL:192.168.8.2: failed to add tunnel decap term to ASIC_DB.
2024 Nov 18 20:19:06.959060 spine2 WARNING syncd#syncd: [none] SAI_API_UNSPECIFIED:sai_bulk_object_get_stats:788 Bulk Object Stats get not supported on this device 
2024 Nov 18 20:19:07.021181 spine2 WARNING syncd#syncd: message repeated 150 times: [ [none] SAI_API_UNSPECIFIED:sai_bulk_object_get_stats:788 Bulk Object Stats get not supported on this device ]
2024 Nov 18 20:19:07.021181 spine2 NOTICE syncd#syncd: :- threadFunction: time span 0 ms for 'start_poll:PG_DROP_STAT_COUNTER:oid:0x1a00000000048d'
2024 Nov 18 20:19:07.021181 spine2 WARNING syncd#syncd: [none] SAI_API_UNSPECIFIED:sai_bulk_object_get_stats:788 Bulk Object Stats get not supported on this device 

I'll go file a PR there.

wdoekes added a commit to ossobv/sonic-swss that referenced this issue Nov 19, 2024
…unnels

The Tomahawk 3 chipset on an Accton AS9716-32D does not do
SAI_TUNNEL_TERM_TABLE_ENTRY_TYPE_P2MP tunnels and returns
SAI_STATUS_NOT_SUPPORTED.

Since sonic-net#3117, such a tunnel is created unconditionally whenever an IP
address is added to Loopback0. The SAI_STATUS_NOT_SUPPORTED is handled
as an unrecoverable error. That takes the orchagent down and syncd along
with it.

    root@spine2:0:~# dmidecode 2>/dev/null | grep -A1 'Manufacturer:' | head -n2
    Manufacturer: Accton
    Product Name: AS9716-32D

    root@spine2:0:~# printf '%s\n' 'bsv' 'show unit' 'ver' |
        xargs -d\\n -n1 bcmcmd -t 1 | grep '^[A-Z]'
    BRCM SAI ver: [11.2.13.1], OCP SAI ver: [1.14.0], SDK ver: [sdk-6.5.30-SP4]
    Unit 0 chip BCM56980_B0 (current)
    Broadcom Command Monitor: Copyright (c) 1998-2024 Broadcom
    Release: sdk-6.5.30-SP4 built 20241016 (Wed Oct 16 13:33:02 2024)
    From root@7ec56acb7b44:/__w/1/s/output/x86-xgsall-deb/xgs-sdk-src/hsdk-all-6.5.30
    Platform: X86
    OS: Unix (Posix)

Logs output:

    NOTICE swss#orchagent: :- addDecapTunnel: Create overlay loopback router
      interface oid:60000000005b2
    NOTICE swss#orchagent: :- doDecapTunnelTask: Tunnel IPINIP_TUNNEL added
      to ASIC_DB.
    ERR syncd#syncd: :- sendApiResponse: api SAI_COMMON_API_CREATE failed in
      syncd mode: SAI_STATUS_NOT_SUPPORTED
    ERR syncd#syncd: :- processQuadEvent: attr:
      SAI_TUNNEL_TERM_TABLE_ENTRY_ATTR_VR_ID: oid:0x3000000000021a
    ...
    ERR swss#orchagent: :- create: create status: SAI_STATUS_NOT_SUPPORTED
    ERR swss#orchagent: :- addDecapTunnelTermEntry: Failed to create tunnel
      decap term entry 10.1.0.1/32.
    ERR swss#orchagent: :- handleSaiCreateStatus: Encountered failure in
      create operation, exiting orchagent, SAI API: SAI_API_TUNNEL, status:
      SAI_STATUS_NOT_SUPPORTED
    NOTICE swss#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP
    NOTICE syncd#syncd: :- processNotifySyncd: Invoking SAI failure dump

This changeset adds a check for SAI_STATUS_NOT_SUPPORTED, turning the
error into a warning and going back to behaviour before subnet_decap was
added:

    ERR swss#orchagent: :- create: create status: SAI_STATUS_NOT_SUPPORTED
    WARNING swss#orchagent: :- addDecapTunnelTermEntry: Creating SAI_API_TUNNEL
      returned SAI_STATUS_NOT_SUPPORTED for tunnel IPINIP_TUNNEL IP 10.1.0.1/32.
    ERR swss#orchagent: :- doDecapTunnelTermTask: IPINIP_TUNNEL:10.1.0.1: failed
      to add tunnel decap term to ASIC_DB.

Resolves: sonic-net/sonic-buildimage#20837
@prgeor prgeor added the Triaged this issue has been triaged label Nov 20, 2024
@prgeor
Copy link
Contributor

prgeor commented Nov 20, 2024

@wdoekes looks like you already fixed the issue

@wdoekes
Copy link
Contributor Author

wdoekes commented Nov 20, 2024

It ain't fixed until it's merged, yes? =)

Can you get my PRs looked at, @prgeor ? ❤️

If changes are needed, I'll gladly oblige. If tests are needed, I'm going to need some hand holding / examples.

@bradh352
Copy link

bradh352 commented Nov 22, 2024

@wdoekes I see you mentioned something about TD3, that's not related to your PR #3377 you said right? What was your resolution for TD3 on master? You said you reverted to an older libsai, did you revert this commit? ffb9bc0
I do see they just updated it again today, maybe its finally fixed I'll have to test at some point #20839

I've been having a heck of a time on TD3 finding a release that actually works on Dell S5248F properly (with VXLAN support) and have starting trying to pull in patches into older branches to get it working.

@wdoekes
Copy link
Contributor Author

wdoekes commented Nov 22, 2024

@bradh352: For TD3 on master, I needed two reverts:

Specifically the latest two on this chain of commits:
https://github.com/ossobv/sonic-buildimage/commits/ossomain-20241112-0-41ea968fc-dbg/
( tag ossomain-20241112-0-41ea968fc-dbg )

I cannot tell whether Vxlan support works with libsaibcm back to 10.1.42. I did not get around to trying.

Another thing I tried was running master and then replacing libsaibcm.deb inside the docker-swss with a newer version. That also appeared to work. Fetching a newer .deb is a bit tricky though. I repacked one from a broadcom or edgecore distro (don't remember off-hand) and simply dpkg -i'd it inside the running docker container. Things didn't break immediately, but .. I wouldn't trust it to work long term.

In my case, the master build with libsaibcm 11.2.13 doesn't work, but the 202405 build (also 11.2.23) does. So trying branch 202405 would be my best bet, unless you know that Vxlan is unavailable there.

@bradh352
Copy link

@wdoekes thanks for that. I know 202405 vxlan definitely doesn't work, even with the required patches merged. It seems to at first, but there are major issues.

@wdoekes
Copy link
Contributor Author

wdoekes commented Nov 22, 2024

For the TD3, did you check syslog for "objectTypeQuery returned NULL": #20725 ? (If I'm not mistaken both my th3 and td3 suffer from that.)

Not that it helps if you do, but then you'll know where it is tracked. And there is activity going on.

@bradh352
Copy link

Ah thanks, no, on master I just gave up early since nothing works. I figured someone probably had that covered especially with 202411 branching soon I was hopeful it would be fixed. Trying to track down where regressions occurred with vxlan at this point as I know a lot of people are running 202211 with vxlan on TD3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Triaged this issue has been triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants