RFC - Expose send params driver bypass #75

yanivbl6 · 2018-11-13T14:09:12Z

This code should be compared with the upstream/expose_send_params branch

The goal of those commits is to allow the changes in the "Expose send params" branch to work without the required changes to the mlx5 driver. This is done by using direct verbs to extract driver info.

The method to bypass the driver is:

Use Direct verbs to create the mlx5dv_qp object after qp creation, from which a pointer to the send queue and the send_queue size can be extracted.
extend the gds qp context to hold additional fields: the send queue pointer and size extracted in the procedure above, and a counter that tracks the wqes being used in the send queue.
The counter is to updated in each call to gds_post_send, by calling gds_report_post.
gds_query_last_info() is implemented using the counter, pointer and size of send queue kept in the gds qp context. Only the last work-request can be probed.
number of sges is determined by probing the opcode and ds from the last wqe.Number of BB consumed in the send queue can be checked in the same way but currently is fixed to 1.

I haven't yet managed to run all the tests in gda successfully (with or without this patch), but I was able to run:
Running gds_kernel_latency, peersync, descriptors, RC
Running gds_kernel_latency, peersync, descriptors, GMEM buffers, RC

without errors/hangs, after changing the hard coded number of batches in the test to 1.

@haggaie @bureddy @drossetti @e-ago

configure.ac

haggaie · 2018-11-15T08:05:21Z

include/gdsync/core.h

@@ -39,6 +39,8 @@
    ( ((((v) & 0xffff0000U) >> 16) == GDS_API_MAJOR_VERSION) &&   \
      ((((v) & 0x0000ffffU) >> 0 ) >= GDS_API_MINOR_VERSION) )

+#define IBV_EXP_SEND_GET_INFO (1 << 28)


Does the code still keep this flag in send_flags? I think it would be better to use a separate field so that future send flags in libibverbs won't conflict with this definition.

the problem here is that gds_send_wr is simply ibv_exp_send_wr.
maybe we want to have a new flags arg for gds_prepare_send().

haggaie · 2018-11-15T08:06:54Z

include/gdsync/core.h

@@ -159,8 +166,23 @@ typedef enum gds_update_send_info_type {
 * Represents a posted send operation on a particular QP
 */

+#define GDS_SEND_MAX_SGE 16


Is there a way to enforce this limitation? Maybe check it in gds_create_qp?

haggaie · 2018-11-15T08:11:45Z

include/gdsync/core.h

+ * Notes:
+ * - TODO.
+ */
+int gds_report_post(struct gds_qp *gqp  /*, struct gds_send_wr* wr*/);


The function documentation is still missing. I guess this function advances the tracking in the gds_qp struct of the current producer index with the given send wr size?

haggaie · 2018-11-15T08:14:23Z

src/apis.cpp

 {
    gds_dbg("[%s] wr_id=%lx, num_sge=%d\n", func_name, swr_info.wr_id, swr_info.num_sge);

    for(int j=0; j < swr_info.num_sge; j++)
    {
-        gds_dbg("[%s]    SGE=%d, Size ptr=0x%08x, Size=%d (0x%08x), +offset=%d\n", 
+        gds_dbg("[%s]    SGE=%d, Size ptr=00x%lx, Size=%d (0x%08x), +offset=%d\n",


There's a typo here (00x), and you might want to put debugging print changes in a separate patch, to make the review easier.

haggaie · 2018-11-15T08:16:50Z

src/apis.cpp

+        gds_info->sge_list[i].ptr_to_size = (uintptr_t) &(sge->byte_count);
+        gds_info->sge_list[i].ptr_to_lkey = (uintptr_t) &(sge->key);
+        gds_info->sge_list[i].ptr_to_addr = (uintptr_t) &(sge->addr);
+        gds_info->sge_list[i].offset = 0; //why is that here?


what does the offset field stand for?

@e-ago do you remember why you had offset in the first place?

Where did you find this? Considering file https://github.com/gpudirect/libmlx5/blob/expose_send_params/src/qp.c the offset is modified here
qp->swr_info[qp->cur_swr].sge[qp->swr_info[qp->cur_swr].cur_sge].offset = offset;
and here
swr_info->sge_list[j].offset = qp->swr_info[i].sge[j].offset

Is the offset field affecting the data written into the wqe in some way?

src/gdsync.cpp

Co-Authored-By: yanivbl6 <[email protected]>

drossetti

as a general preexisting remark, using mlx5dv is narrowing the scope of the whole library both to a single vendor and family of adapters.
Either in this change or in a later change in the expose_send_params branch, I think we need to:

detect mlx5dv and define a macro
implement the new APIs if the macro is defined, and return a clean error otherwise.

drossetti · 2018-11-15T22:48:32Z

Makefile.am


 endif

 SUFFIXES= .cu

 .cu.o:
-	$(NVCC) $(CPPFLAGS) $(AM_CPPFLAGS) $(NVCCFLAGS) $(GENCODE_FLAGS) -c -o $@ $<
+	$(NVCC) $(CPPFLAGS) $(AM_LDFLAGS)  $(AM_CPPFLAGS) $(NVCCFLAGS) $(GENCODE_FLAGS) -c -o $@ $<


-lmlx5 should not be needed here

drossetti · 2018-11-15T23:00:50Z

include/gdsync/core.h

@@ -39,6 +39,8 @@
    ( ((((v) & 0xffff0000U) >> 16) == GDS_API_MAJOR_VERSION) &&   \
      ((((v) & 0x0000ffffU) >> 0 ) >= GDS_API_MINOR_VERSION) )

+#define IBV_EXP_SEND_GET_INFO (1 << 28)


the problem here is that gds_send_wr is simply ibv_exp_send_wr.
maybe we want to have a new flags arg for gds_prepare_send().

drossetti · 2018-11-19T18:26:40Z

src/apis.cpp

+        gds_info->sge_list[i].ptr_to_size = (uintptr_t) &(sge->byte_count);
+        gds_info->sge_list[i].ptr_to_lkey = (uintptr_t) &(sge->key);
+        gds_info->sge_list[i].ptr_to_addr = (uintptr_t) &(sge->addr);
+        gds_info->sge_list[i].offset = 0; //why is that here?


@e-ago do you remember why you had offset in the first place?

e-ago · 2018-12-05T14:49:32Z

@yanivbl6 May I ask you to run again the gds_kernel_latency, peersync, descriptors, RC and gds_kernel_latency, peersync, descriptors, GMEM buffers, RC tests with the -v (validate) option? The test doesn't complete correctly.
Also, to double check that validation is not working, you can run the libmp/example/mp_sendrecv_stream_exp.cu with -v option (use the gdasync repository to get the correct command line as for libgdsync).

yanivbl6 · 2018-12-06T12:31:31Z

I have reproduced the error, and will be looking into it.

Edit:
I am still not sure what the job of the offset is- is it ok to leave it with zero value?

yanivbl6 · 2018-12-09T11:41:37Z

I fixed a critical bug (using wrong structure for send wqe), but I still get a validity error on the next iterations.

e-ago · 2018-12-10T16:26:19Z

I ran again the tests. When running gds_kernel_latency -v -U -I -k 2 -E I get this error

validation check failed index: 0 expected: 7 actual: 8 iteration
[5761] ERR:   main [0] post_work error (-1) rcnt=20 n_post=20 routs=40

while I there is no error when running gds_kernel_latency -v -U -I -k 2 . Can you confirm?

yanivbl6 · 2018-12-10T19:11:20Z

I experienced errors with GPU-Memory as well.

yanivbl6 · 2018-12-16T14:51:38Z

I've managed to pass the tests successfully by adding a MPI_Barrier(MPI_COMM_WORLD) in the validate section.

I guess there is some race condition but not sure where it is coming from.

e-ago · 2018-12-19T17:07:07Z

Which version of CUDA and NVIDIA driver are you using?
At that point, the cudaDeviceSychronize should be enough to guarantee that all send/recv have been correctly posted and executed (everything happens on the CUDA stream).
Maybe, it would be better to have an MPI_Barrier before the loop for (i = 0; i < posted_recv; ++i).

yanivbl6 · 2018-12-19T17:38:08Z

I was using CUDA9.0. I think the driver was 384.90, but not sure.
I tried moving the barrier around, but it didn't work on other places. I never tried before the loop though.

I don't have access to GPU servers at the moment, I will try it when it is available again.

I tried understanding why the MPI-Barrier may help but it coudln't. I didn't check the kernel though- is there a way to the dump the cqes polled the GPU?

e-ago · 2018-12-30T19:56:53Z

I deleted the previous comment. I noticed two errors, one related to the original gds_kernel_latency code and one related to my validation piece of code. @yanivbl6 please run again the tests:

Removing your MPI_Barrier()
Adding an MPI_Barrier here https://github.com/yanivbl6/libgdsync/blob/18c817796d7f222fad714c15fd9b000daf9f694e/tests/gds_kernel_latency.c#L827
Adding an MPI_Barrier here after the cudaMemcpy https://github.com/yanivbl6/libgdsync/blob/18c817796d7f222fad714c15fd9b000daf9f694e/tests/gds_kernel_latency.c#L1108
With and without the -I option, using host and device memory

The validation error may be related to the fact that all the receive requests posted by pp_post_recv refer to the same rxbuf. The mp_sendrecv_stream_exp seems to correctly work.

As a reminder, I've tested everything using:

2 DGX-1V
Ubuntu 16.04
CUDA 10.0 with official driver 410.48
OFED 4.3 with ConnectX-4 NICs
OpenMPI 3.1.3
gcc 5.4

e-ago · 2019-01-28T15:22:52Z

@yanivbl6 Please take a look at this PR #78 . Here I've introduced multiple send/recv buffers and reworked the code a bit. I've tested this version of gds_kernel_latency and everything works fine with and without validation, with and without my exp send implementation, host and device memory

Yaniv Blumenfeld added 2 commits November 11, 2018 08:30

reformatted changes in direct verbs branch to one compact commit

5870739

Fixed the sge count

4f47911

haggaie reviewed Nov 15, 2018

View reviewed changes

Update configure.ac

263a587

Co-Authored-By: yanivbl6 <[email protected]>

drossetti reviewed Nov 19, 2018

View reviewed changes

Fixed info extraction, memory leak on error, and build-related comment

612813e

Added MPI Barrier in validation, which allows test to pass

18c8177

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC - Expose send params driver bypass #75

RFC - Expose send params driver bypass #75

yanivbl6 commented Nov 13, 2018

haggaie Nov 15, 2018

drossetti Nov 15, 2018

haggaie Nov 15, 2018

haggaie Nov 15, 2018

haggaie Nov 15, 2018

haggaie Nov 15, 2018

drossetti Nov 19, 2018

e-ago Dec 5, 2018

yanivbl6 Dec 9, 2018

drossetti left a comment

drossetti Nov 15, 2018

drossetti Nov 15, 2018

drossetti Nov 19, 2018

e-ago commented Dec 5, 2018 •

edited

Loading

yanivbl6 commented Dec 6, 2018 •

edited

Loading

yanivbl6 commented Dec 9, 2018

e-ago commented Dec 10, 2018 •

edited

Loading

yanivbl6 commented Dec 10, 2018

yanivbl6 commented Dec 16, 2018

e-ago commented Dec 19, 2018 •

edited

Loading

yanivbl6 commented Dec 19, 2018

e-ago commented Dec 30, 2018 •

edited

Loading

e-ago commented Jan 28, 2019

RFC - Expose send params driver bypass #75

Are you sure you want to change the base?

RFC - Expose send params driver bypass #75

Conversation

yanivbl6 commented Nov 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drossetti left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

e-ago commented Dec 5, 2018 • edited Loading

yanivbl6 commented Dec 6, 2018 • edited Loading

yanivbl6 commented Dec 9, 2018

e-ago commented Dec 10, 2018 • edited Loading

yanivbl6 commented Dec 10, 2018

yanivbl6 commented Dec 16, 2018

e-ago commented Dec 19, 2018 • edited Loading

yanivbl6 commented Dec 19, 2018

e-ago commented Dec 30, 2018 • edited Loading

e-ago commented Jan 28, 2019

e-ago commented Dec 5, 2018 •

edited

Loading

yanivbl6 commented Dec 6, 2018 •

edited

Loading

e-ago commented Dec 10, 2018 •

edited

Loading

e-ago commented Dec 19, 2018 •

edited

Loading

e-ago commented Dec 30, 2018 •

edited

Loading