Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC - Expose send params driver bypass #75

Open
wants to merge 5 commits into
base: expose_send_params
Choose a base branch
from

Conversation

yanivbl6
Copy link

This code should be compared with the upstream/expose_send_params branch

The goal of those commits is to allow the changes in the "Expose send params" branch to work without the required changes to the mlx5 driver. This is done by using direct verbs to extract driver info.

The method to bypass the driver is:

  1. Use Direct verbs to create the mlx5dv_qp object after qp creation, from which a pointer to the send queue and the send_queue size can be extracted.
  2. extend the gds qp context to hold additional fields: the send queue pointer and size extracted in the procedure above, and a counter that tracks the wqes being used in the send queue.
  3. The counter is to updated in each call to gds_post_send, by calling gds_report_post.
  4. gds_query_last_info() is implemented using the counter, pointer and size of send queue kept in the gds qp context. Only the last work-request can be probed.
  5. number of sges is determined by probing the opcode and ds from the last wqe.Number of BB consumed in the send queue can be checked in the same way but currently is fixed to 1.

I haven't yet managed to run all the tests in gda successfully (with or without this patch), but I was able to run:
Running gds_kernel_latency, peersync, descriptors, RC
Running gds_kernel_latency, peersync, descriptors, GMEM buffers, RC

without errors/hangs, after changing the hard coded number of batches in the test to 1.

@haggaie @bureddy @drossetti @e-ago

configure.ac Outdated Show resolved Hide resolved
@@ -39,6 +39,8 @@
( ((((v) & 0xffff0000U) >> 16) == GDS_API_MAJOR_VERSION) && \
((((v) & 0x0000ffffU) >> 0 ) >= GDS_API_MINOR_VERSION) )

#define IBV_EXP_SEND_GET_INFO (1 << 28)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the code still keep this flag in send_flags? I think it would be better to use a separate field so that future send flags in libibverbs won't conflict with this definition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem here is that gds_send_wr is simply ibv_exp_send_wr.
maybe we want to have a new flags arg for gds_prepare_send().

@@ -159,8 +166,23 @@ typedef enum gds_update_send_info_type {
* Represents a posted send operation on a particular QP
*/

#define GDS_SEND_MAX_SGE 16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to enforce this limitation? Maybe check it in gds_create_qp?

* Notes:
* - TODO.
*/
int gds_report_post(struct gds_qp *gqp /*, struct gds_send_wr* wr*/);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function documentation is still missing. I guess this function advances the tracking in the gds_qp struct of the current producer index with the given send wr size?

{
gds_dbg("[%s] wr_id=%lx, num_sge=%d\n", func_name, swr_info.wr_id, swr_info.num_sge);

for(int j=0; j < swr_info.num_sge; j++)
{
gds_dbg("[%s] SGE=%d, Size ptr=0x%08x, Size=%d (0x%08x), +offset=%d\n",
gds_dbg("[%s] SGE=%d, Size ptr=00x%lx, Size=%d (0x%08x), +offset=%d\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo here (00x), and you might want to put debugging print changes in a separate patch, to make the review easier.

gds_info->sge_list[i].ptr_to_size = (uintptr_t) &(sge->byte_count);
gds_info->sge_list[i].ptr_to_lkey = (uintptr_t) &(sge->key);
gds_info->sge_list[i].ptr_to_addr = (uintptr_t) &(sge->addr);
gds_info->sge_list[i].offset = 0; //why is that here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does the offset field stand for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@e-ago do you remember why you had offset in the first place?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did you find this? Considering file https://github.com/gpudirect/libmlx5/blob/expose_send_params/src/qp.c the offset is modified here
qp->swr_info[qp->cur_swr].sge[qp->swr_info[qp->cur_swr].cur_sge].offset = offset;
and here
swr_info->sge_list[j].offset = qp->swr_info[i].sge[j].offset

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the offset field affecting the data written into the wqe in some way?

src/gdsync.cpp Show resolved Hide resolved
Co-Authored-By: yanivbl6 <[email protected]>
Copy link
Contributor

@drossetti drossetti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a general preexisting remark, using mlx5dv is narrowing the scope of the whole library both to a single vendor and family of adapters.
Either in this change or in a later change in the expose_send_params branch, I think we need to:

  1. detect mlx5dv and define a macro
  2. implement the new APIs if the macro is defined, and return a clean error otherwise.

Makefile.am Outdated

endif

SUFFIXES= .cu

.cu.o:
$(NVCC) $(CPPFLAGS) $(AM_CPPFLAGS) $(NVCCFLAGS) $(GENCODE_FLAGS) -c -o $@ $<
$(NVCC) $(CPPFLAGS) $(AM_LDFLAGS) $(AM_CPPFLAGS) $(NVCCFLAGS) $(GENCODE_FLAGS) -c -o $@ $<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-lmlx5 should not be needed here

@@ -39,6 +39,8 @@
( ((((v) & 0xffff0000U) >> 16) == GDS_API_MAJOR_VERSION) && \
((((v) & 0x0000ffffU) >> 0 ) >= GDS_API_MINOR_VERSION) )

#define IBV_EXP_SEND_GET_INFO (1 << 28)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem here is that gds_send_wr is simply ibv_exp_send_wr.
maybe we want to have a new flags arg for gds_prepare_send().

gds_info->sge_list[i].ptr_to_size = (uintptr_t) &(sge->byte_count);
gds_info->sge_list[i].ptr_to_lkey = (uintptr_t) &(sge->key);
gds_info->sge_list[i].ptr_to_addr = (uintptr_t) &(sge->addr);
gds_info->sge_list[i].offset = 0; //why is that here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@e-ago do you remember why you had offset in the first place?

@e-ago
Copy link
Collaborator

e-ago commented Dec 5, 2018

@yanivbl6 May I ask you to run again the gds_kernel_latency, peersync, descriptors, RC and gds_kernel_latency, peersync, descriptors, GMEM buffers, RC tests with the -v (validate) option? The test doesn't complete correctly.
Also, to double check that validation is not working, you can run the libmp/example/mp_sendrecv_stream_exp.cu with -v option (use the gdasync repository to get the correct command line as for libgdsync).

@yanivbl6
Copy link
Author

yanivbl6 commented Dec 6, 2018

I have reproduced the error, and will be looking into it.

Edit:
I am still not sure what the job of the offset is- is it ok to leave it with zero value?

@yanivbl6
Copy link
Author

yanivbl6 commented Dec 9, 2018

I fixed a critical bug (using wrong structure for send wqe), but I still get a validity error on the next iterations.

@e-ago
Copy link
Collaborator

e-ago commented Dec 10, 2018

I ran again the tests. When running gds_kernel_latency -v -U -I -k 2 -E I get this error

validation check failed index: 0 expected: 7 actual: 8 iteration
[5761] ERR:   main [0] post_work error (-1) rcnt=20 n_post=20 routs=40

while I there is no error when running gds_kernel_latency -v -U -I -k 2 . Can you confirm?

@yanivbl6
Copy link
Author

I experienced errors with GPU-Memory as well.

@yanivbl6
Copy link
Author

I've managed to pass the tests successfully by adding a MPI_Barrier(MPI_COMM_WORLD) in the validate section.

I guess there is some race condition but not sure where it is coming from.

@e-ago
Copy link
Collaborator

e-ago commented Dec 19, 2018

Which version of CUDA and NVIDIA driver are you using?
At that point, the cudaDeviceSychronize should be enough to guarantee that all send/recv have been correctly posted and executed (everything happens on the CUDA stream).
Maybe, it would be better to have an MPI_Barrier before the loop for (i = 0; i < posted_recv; ++i).

@yanivbl6
Copy link
Author

I was using CUDA9.0. I think the driver was 384.90, but not sure.
I tried moving the barrier around, but it didn't work on other places. I never tried before the loop though.

I don't have access to GPU servers at the moment, I will try it when it is available again.

I tried understanding why the MPI-Barrier may help but it coudln't. I didn't check the kernel though- is there a way to the dump the cqes polled the GPU?

@e-ago
Copy link
Collaborator

e-ago commented Dec 30, 2018

I deleted the previous comment. I noticed two errors, one related to the original gds_kernel_latency code and one related to my validation piece of code. @yanivbl6 please run again the tests:

The validation error may be related to the fact that all the receive requests posted by pp_post_recv refer to the same rxbuf. The mp_sendrecv_stream_exp seems to correctly work.

As a reminder, I've tested everything using:

  • 2 DGX-1V
  • Ubuntu 16.04
  • CUDA 10.0 with official driver 410.48
  • OFED 4.3 with ConnectX-4 NICs
  • OpenMPI 3.1.3
  • gcc 5.4

@e-ago
Copy link
Collaborator

e-ago commented Jan 28, 2019

@yanivbl6 Please take a look at this PR #78 . Here I've introduced multiple send/recv buffers and reworked the code a bit. I've tested this version of gds_kernel_latency and everything works fine with and without validation, with and without my exp send implementation, host and device memory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants