Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/verbs: memory consistency with MOFED 5.8-3.0.7.0 #9621

Open
thomasgillis opened this issue Dec 3, 2023 · 3 comments
Open

prov/verbs: memory consistency with MOFED 5.8-3.0.7.0 #9621

thomasgillis opened this issue Dec 3, 2023 · 3 comments

Comments

@thomasgillis
Copy link
Contributor

Hi all,

I am reaching out as I have a issues with verbs following an update of MOFED.
I am not sure how to approach this, any advise/idea is welcome

Describe the bug
I have a client on top of ofi using verbs;ofi_rxm, which works very well on MLNX_OFED_LINUX-5.8-2.0.3.0 (OFED-5.8-2.0.3).
On a similar system, they have bumped the version of ofed: MLNX_OFED_LINUX-5.8-3.0.7.0 (OFED-5.8-3.0.7): and since then, I have issues. When using fi_write with FI_REMOTE_CQ_DATA, I observe that even after the entry in the target cq has been read, the memory is not in the expected state.

Would you have any idea on how to move forward with this? (I don't think that versioning down is an option for the moment)

@ghost
Copy link

ghost commented Dec 4, 2023

Please reach out to Nvidia support. The problem sounds severe so most likely it is already fixed.

@thomasgillis
Copy link
Contributor Author

thomasgillis commented Dec 8, 2023

I could modify fabtest and get the following reproducer:

Waiting for CQ data from client
data[0] = 1
data[1] = 2
data[2] = 3
data[3] = 4
Posting write with CQ data: 0x89abcdef
sending 1
sending 2
len of RMA = 8, offset = 0x240b040, data = 0
sending 3
sending 4
len of RMA = 8, offset = 0x240b048, data = 1
Done
received 1 cq-data: 1/2: len = 8, data=0, buf=(nil)
received 1 cq-data: 2/2: len = 8, data=1, buf=(nil)
fi_cq_data_entry.len verify: success
error, value of 0x240b040 + 2 =0x240b048 data[2] should be 3 instead of 8064

Here is the patch to it: thomasgillis@c0fad63

@shefty could you take a look at the modified test and confirm that it is a valid test for the cq_data? I plan to submit it with the issue to NVidia.

@shefty
Copy link
Member

shefty commented Dec 12, 2023

@thomasgillis, I recommend creating a reproducer that calls libibverbs directly. It's difficult to read the fabtests code and determine if it's correct because of the internal fabtests abstractions.

@j-xiong j-xiong removed the bug label May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants