Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/verbs: one side trigger fi_connect and fi_eq_sread,it cost about 18.5s when the other side recv fi_eq_sread connect msg,Why? #10566

Open
hsh258 opened this issue Nov 21, 2024 · 5 comments

Comments

@hsh258
Copy link

hsh258 commented Nov 21, 2024

Describe the bug
A clear and concise description of what the bug is.
prov/verbs: one side trigger fi_connect and fi_eq_sread,it cost about 18.5s when the other side recv fi_eq_sread connect msg,Why?
By the way,how to set the interval of reconnect?tks

To Reproduce
Steps to reproduce the behavior:
1,prov/verbs: one side trigger fi_connect and fi_eq_sread
2, After 18.5s,the other side recv fi_eq_sread connect msg.

Expected behavior
If needed, a clear and concise description of what you expected to happen.

Output
If applicable, add output to help explain your problem. (e.g. backtrace, debug logs)

Environment:
OS (if not Linux), provider, endpoint type, etc.

Additional context
Add any other context about the problem here.

@hsh258 hsh258 added the bug label Nov 21, 2024
@sydidelot
Copy link
Member

@hsh258 It's clearly an issue with your environment. Have you tried with standalone tools like rdma_{server,client} or qperf if the problem reproduces?

@hsh258
Copy link
Author

hsh258 commented Nov 21, 2024

@hsh258 It's clearly an issue with your environment. Have you tried with standalone tools like rdma_{server,client} or qperf if the problem reproduces?

@sydidelot
Hi,
I want to try to change the interval of connectReq.How to achieve the target? tks

@sydidelot
Copy link
Member

@hsh258 The only timeout I am aware of in the verbs provider is this one:
#define VERBS_RESOLVE_TIMEOUT 2000 // ms

But honestly, it would be a better option for you to fix your network than trying to hack libfabric :)

@hsh258
Copy link
Author

hsh258 commented Nov 28, 2024

@hsh258 The only timeout I am aware of in the verbs provider is this one: #define VERBS_RESOLVE_TIMEOUT 2000 // ms

But honestly, it would be a better option for you to fix your network than trying to hack libfabric :)

@sydidelot
Hi
When the problem reproduces,I find that the first RRoCE connectRequest is valid (322 bytes) in send sides,but it is invalid package in recv sides(120 bytes),the Mad Header ,CM ConnectRequest Invariant CRC of InfiniBand are drop.
Only the second RRoCE connectRequest happen,the recv side can recv the connectRequest usually.
Could you tell me the cause? tks .

@sydidelot
Copy link
Member

sydidelot commented Dec 2, 2024

I'm sorry, I don't do consulting on networking problems.
As I told you earlier, you should verify your network infrastructure with standalone tools like rdma_server + rdma_client or qperf. If the problem reproduces with these tools, then it's not specific to libfabric.
At DDN, we've been using libfabric with the verbs provider on large scale systems: issues during connection establishment are usually due to a problem in the network fabric itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants