coredump in uct_rdmacm_cm_handle_error_event() when deal with RDMA_CM_EVENT_DEVICE_REMOVAL #9740
huzhijiang
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
I am doing hot-unplug and always crush when dealing with RDMA_CM_EVENT_DEVICE_REMOVAL event.
It seems that the event->id->context (which expected to be a uct_rdmacm_cm_ep_t *cep) it got is not a valid cep pointer any more.
By adding printing, event->id->context seems point to a uct_rdmacm_listener_t (same as the id created in uct_rdmacm_listener_t_init()), is that means uct_rdmacm_cm_handle_error_event() should not simply cast every event->id->context to cep? Do not know if this is the problem and relate to my cordump.
By adding printing, I also confirmed that there is no freeing of uct_rdmacm_cm_ep_t or uct_rdmacm_listener_t object before crush. so event->id->context should point to a valid memory, but coredump say it points to a memory area that already freed...
Any ideal?
BTW, I am using ucx 1.12.1 version.
Beta Was this translation helpful? Give feedback.
All reactions