Introducing multiple send/recv memory buffers in gds_kernel_latency #78

e-ago · 2019-01-28T15:20:37Z

what's the purpose of rx_flag ? Can I remove it?
I removed the pp_post_recv at line 1541. The first pp_post_recv are re-posted in pp_post_work

drossetti

@e-ago can you add a brief bullet list of changes included in this PR?

drossetti · 2019-01-28T20:04:11Z

tests/gds_kernel_latency.c

-        int                      exp_send_info;
-        int                      validate;
-        char                     *validate_buf;
+        int                     buf_size;


white space change: can you not change the alignment of the whole struct ?

I am still seeing a lot of white space noise, which makes hard to understand which fields are new

drossetti · 2019-01-28T20:09:08Z

tests/gds_kernel_latency.c

-        int                      validate;
-        char                     *validate_buf;
+        int                     buf_size;
+        int                     buf_align;


I think this var name is incorrect, it suggests alignment requirements, but instead it is an aligned size.

did you change the name of this member var ?

drossetti · 2019-01-28T20:10:36Z

tests/gds_kernel_latency.c

@@ -218,7 +221,7 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size,
        ctx->txbufexp_lkey      = NULL;
        ctx->txbufexp_addr      = NULL;

-        size_t alloc_size = max_batch_len * align_to(size + 40, page_size);
+        size_t alloc_size = max_batch_len * ctx->buf_align;


since you are using memalign/posix_memalign below, why do you need buf_align in the first place?
it should not be needed, as those allocators already provide buffers with the right size.

This is from previous version of the code. I'll remove it

Should I keep using size + 40?

+40 is the padding required for the UD protocol.
You can removed it if we make sure UD cannot be selected in this test.

drossetti · 2019-01-28T20:11:04Z

tests/gds_kernel_latency.c

@@ -234,15 +237,15 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size,
                        ctx->txbufexp_addr = (uintptr_t*)gpu_malloc(page_size, sizeof(uintptr_t)*max_batch_len);
                }
        } else {
-                ctx->txbuf = memalign(page_size, ctx->txtot_size); //posix_memalign
-                ctx->rxbuf = memalign(page_size, ctx->rxtot_size);
+                assert(0 == posix_memalign((void **)&(ctx->txbuf), page_size, ctx->txtot_size));


why switching from memalign to posix_memalign ?

because memalign is obsolete https://linux.die.net/man/3/memalign

FYI judging from https://github.com/linux-rdma/rdma-core/blob/1cf909a14b3d07c8a301e3de03bfb91e62aaeff5/libibverbs/examples/ud_pingpong.c#L309 it looks like memalign is still being used.
Here we forked that code, so it is up to us.
Note that switching to posix_memalign() brings a bit different requirements wrt memalign():
"The function posix_memalign() allocates size bytes and places the address of the allocated memory in *memptr. The address of the allocated memory will be a multiple of alignment, which must be a power of two and a multiple of sizeof(void *)."
Also note that buffer size is not checked to be a multiple of alignment.

drossetti · 2019-01-30T00:20:24Z

tests/gds_kernel_latency.c

@@ -234,15 +237,15 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size,
                        ctx->txbufexp_addr = (uintptr_t*)gpu_malloc(page_size, sizeof(uintptr_t)*max_batch_len);
                }
        } else {
-                ctx->txbuf = memalign(page_size, ctx->txtot_size); //posix_memalign
-                ctx->rxbuf = memalign(page_size, ctx->rxtot_size);
+                assert(0 == posix_memalign((void **)&(ctx->txbuf), page_size, ctx->txtot_size));


FYI judging from https://github.com/linux-rdma/rdma-core/blob/1cf909a14b3d07c8a301e3de03bfb91e62aaeff5/libibverbs/examples/ud_pingpong.c#L309 it looks like memalign is still being used.
Here we forked that code, so it is up to us.
Note that switching to posix_memalign() brings a bit different requirements wrt memalign():
"The function posix_memalign() allocates size bytes and places the address of the allocated memory in *memptr. The address of the allocated memory will be a multiple of alignment, which must be a power of two and a multiple of sizeof(void *)."
Also note that buffer size is not checked to be a multiple of alignment.

drossetti · 2019-01-30T00:21:25Z

tests/gds_kernel_latency.c

-                ctx->txbuf = memalign(page_size, ctx->txtot_size); //posix_memalign
-                ctx->rxbuf = memalign(page_size, ctx->rxtot_size);
+                assert(0 == posix_memalign((void **)&(ctx->txbuf), page_size, ctx->txtot_size));
+                assert(0 == posix_memalign((void **)&(ctx->rxbuf), page_size, ctx->rxtot_size));


careful with assert() as it is a no-op if NDEBUG is defined.

I would rather not do this change in this PR, as it is not essential and unrelated to the prototype

Ok so I'll come back to memalign removing the posix_memalign

drossetti · 2019-01-30T00:45:31Z

tests/gds_kernel_latency.c

@@ -517,34 +515,32 @@ int pp_close_ctx(struct pingpong_context *ctx)

        if(ctx->exp_send_info == 1)
        {
+


please remove this whitespace change and the others below as they simply add noise

drossetti · 2019-01-30T00:45:55Z

tests/gds_kernel_latency.c

@@ -559,12 +555,16 @@ int pp_close_ctx(struct pingpong_context *ctx)
        {
                if( ctx->exp_send_info == 1 )
                {
-                        free(ctx->txbufexp);
+                        free(ctx->txbufexp);                        


unneeded whitespace change

drossetti · 2019-01-30T00:46:45Z

tests/gds_kernel_latency.c

                        free(ctx->txbufexp_size);
                        free(ctx->txbufexp_lkey);
                        free(ctx->txbufexp_addr);
                }
        }
+


why did you move ibv_close_device here?

it's an oversight, I was debugging a free error

drossetti · 2019-01-30T00:54:18Z

tests/gds_kernel_latency.c

-                gpu_warn("[%d] Could not post all receive, requested %d, actually posted %d\n", my_rank, max_batch_len, nrecv);
-                return 1;
-        }
+//        int nrecv = pp_post_recv(ctx, max_batch_len);


where is pp_post_recv() being called now?

Inside pp_post_work before starting the main loop for (i = 0; i < posted_recv; ++i). My initial question was: why there is this additional pp_post_recv outside and before the pp_post_work ?

e-ago · 2019-01-30T08:41:57Z

@drossetti Changes in this PR:

Use posix_memalign instead of memalign (obsolete) [to be removed]
Initial pp_post_recv (before the first pp_post_work) removed. Should I restore it?
Instead of using the same buffer for every send/recv, now every send/recv in pp_post_work has its own memory buffer
This implies that txbufexp_addris an array of addresses (when enabling exp send feature)

Also: what's the purpose of rx_flag ? Can I remove it?

e-ago · 2019-02-11T09:46:10Z

@drossetti ping

drossetti · 2019-02-12T00:42:40Z

tests/gds_kernel_latency.c

-        int                      validate;
-        char                     *validate_buf;
+        int                     buf_size;
+        int                     buf_align;


did you change the name of this member var ?

drossetti · 2019-02-12T00:44:08Z

tests/gds_kernel_latency.c

-        int                      exp_send_info;
-        int                      validate;
-        char                     *validate_buf;
+        int                     buf_size;


I am still seeing a lot of white space noise, which makes hard to understand which fields are new

drossetti · 2019-02-12T00:45:12Z

tests/gds_kernel_latency.c

-                        ctx->txbufexp_size = (uint32_t*)gpu_malloc(page_size, sizeof(uint32_t)*max_batch_len);
-                        ctx->txbufexp_lkey = (uint32_t*)gpu_malloc(page_size, sizeof(uint32_t)*max_batch_len);
-                        ctx->txbufexp_addr = (uintptr_t*)gpu_malloc(page_size, sizeof(uintptr_t)*max_batch_len);
+                        ctx->txbufexp           = gpu_malloc(page_size, ctx->txtot_size);


for the UD requirement, don't you need +40 even here?

I suppose that ctx->txtot_size now includes the additional 40B, right?

drossetti · 2019-02-12T00:48:16Z

tests/gds_kernel_latency.c

+                                ctx->txbufexp_size[i] = ctx->buf_sizeexp;
+                                ctx->txbufexp_lkey[i] = ctx->mrexp->lkey;
+                                ctx->txbufexp_addr[i]=(uintptr_t)(ctx->txbufexp+(i*ctx->size_align));
+                                gpu_info("exp_send_info - hi=%d, ostmem: new tx size: %d instead of %d. New tx addr: %lx instead of %lx\n", 


"ostmem" probably missing an 'h'

drossetti · 2019-02-12T00:54:09Z

tests/gds_kernel_latency.c

-
+                for(i=0; i < max_batch_len; i++)
+                {
+                        if (ctx->gpumem) {


I find it hard to follow the logic here.
Could you explain why this for loop has that if (ctx->gpumem) ?

Use multiple send/recv memory buffers. Minor fixes.

d21ad81

e-ago requested a review from drossetti January 28, 2019 15:20

e-ago mentioned this pull request Jan 28, 2019

RFC - Expose send params driver bypass #75

Open

drossetti reviewed Jan 28, 2019

View reviewed changes

drossetti requested changes Jan 30, 2019

View reviewed changes

Back to memalign, minor fixes

b7e7f7f

drossetti requested changes Feb 12, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing multiple send/recv memory buffers in gds_kernel_latency #78

Introducing multiple send/recv memory buffers in gds_kernel_latency #78

e-ago commented Jan 28, 2019 •

edited

Loading

drossetti left a comment

drossetti Jan 28, 2019

drossetti Feb 12, 2019

drossetti Jan 28, 2019

drossetti Feb 12, 2019

drossetti Jan 28, 2019

e-ago Jan 29, 2019

e-ago Jan 29, 2019

drossetti Jan 30, 2019

drossetti Jan 28, 2019

e-ago Jan 29, 2019

drossetti Jan 30, 2019

drossetti Jan 30, 2019

drossetti Jan 30, 2019

drossetti Jan 30, 2019

drossetti Jan 30, 2019

e-ago Jan 30, 2019

drossetti Jan 30, 2019

drossetti Jan 30, 2019

drossetti Jan 30, 2019

e-ago Jan 30, 2019

drossetti Jan 30, 2019

e-ago Jan 30, 2019

e-ago commented Jan 30, 2019 •

edited

Loading

e-ago commented Feb 11, 2019

drossetti Feb 12, 2019

drossetti Feb 12, 2019

drossetti Feb 12, 2019

drossetti Feb 12, 2019

drossetti Feb 12, 2019

drossetti Feb 12, 2019

		@@ -517,34 +515,32 @@ int pp_close_ctx(struct pingpong_context *ctx)

		if(ctx->exp_send_info == 1)
		{

Introducing multiple send/recv memory buffers in gds_kernel_latency #78

Are you sure you want to change the base?

Introducing multiple send/recv memory buffers in gds_kernel_latency #78

Conversation

e-ago commented Jan 28, 2019 • edited Loading

drossetti left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

e-ago commented Jan 30, 2019 • edited Loading

e-ago commented Feb 11, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

e-ago commented Jan 28, 2019 •

edited

Loading

e-ago commented Jan 30, 2019 •

edited

Loading