-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing multiple send/recv memory buffers in gds_kernel_latency #78
base: expose_send_params
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@e-ago can you add a brief bullet list of changes included in this PR?
int exp_send_info; | ||
int validate; | ||
char *validate_buf; | ||
int buf_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
white space change: can you not change the alignment of the whole struct ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still seeing a lot of white space noise, which makes hard to understand which fields are new
tests/gds_kernel_latency.c
Outdated
int validate; | ||
char *validate_buf; | ||
int buf_size; | ||
int buf_align; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this var name is incorrect, it suggests alignment requirements, but instead it is an aligned size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you change the name of this member var ?
tests/gds_kernel_latency.c
Outdated
@@ -218,7 +221,7 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size, | |||
ctx->txbufexp_lkey = NULL; | |||
ctx->txbufexp_addr = NULL; | |||
|
|||
size_t alloc_size = max_batch_len * align_to(size + 40, page_size); | |||
size_t alloc_size = max_batch_len * ctx->buf_align; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you are using memalign/posix_memalign below, why do you need buf_align in the first place?
it should not be needed, as those allocators already provide buffers with the right size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from previous version of the code. I'll remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I keep using size + 40?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+40 is the padding required for the UD protocol.
You can removed it if we make sure UD cannot be selected in this test.
tests/gds_kernel_latency.c
Outdated
@@ -234,15 +237,15 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size, | |||
ctx->txbufexp_addr = (uintptr_t*)gpu_malloc(page_size, sizeof(uintptr_t)*max_batch_len); | |||
} | |||
} else { | |||
ctx->txbuf = memalign(page_size, ctx->txtot_size); //posix_memalign | |||
ctx->rxbuf = memalign(page_size, ctx->rxtot_size); | |||
assert(0 == posix_memalign((void **)&(ctx->txbuf), page_size, ctx->txtot_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why switching from memalign to posix_memalign ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because memalign is obsolete https://linux.die.net/man/3/memalign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI judging from https://github.com/linux-rdma/rdma-core/blob/1cf909a14b3d07c8a301e3de03bfb91e62aaeff5/libibverbs/examples/ud_pingpong.c#L309 it looks like memalign is still being used.
Here we forked that code, so it is up to us.
Note that switching to posix_memalign() brings a bit different requirements wrt memalign():
"The function posix_memalign() allocates size bytes and places the address of the allocated memory in *memptr. The address of the allocated memory will be a multiple of alignment, which must be a power of two and a multiple of sizeof(void *)."
Also note that buffer size is not checked to be a multiple of alignment.
tests/gds_kernel_latency.c
Outdated
@@ -234,15 +237,15 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size, | |||
ctx->txbufexp_addr = (uintptr_t*)gpu_malloc(page_size, sizeof(uintptr_t)*max_batch_len); | |||
} | |||
} else { | |||
ctx->txbuf = memalign(page_size, ctx->txtot_size); //posix_memalign | |||
ctx->rxbuf = memalign(page_size, ctx->rxtot_size); | |||
assert(0 == posix_memalign((void **)&(ctx->txbuf), page_size, ctx->txtot_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI judging from https://github.com/linux-rdma/rdma-core/blob/1cf909a14b3d07c8a301e3de03bfb91e62aaeff5/libibverbs/examples/ud_pingpong.c#L309 it looks like memalign is still being used.
Here we forked that code, so it is up to us.
Note that switching to posix_memalign() brings a bit different requirements wrt memalign():
"The function posix_memalign() allocates size bytes and places the address of the allocated memory in *memptr. The address of the allocated memory will be a multiple of alignment, which must be a power of two and a multiple of sizeof(void *)."
Also note that buffer size is not checked to be a multiple of alignment.
tests/gds_kernel_latency.c
Outdated
ctx->txbuf = memalign(page_size, ctx->txtot_size); //posix_memalign | ||
ctx->rxbuf = memalign(page_size, ctx->rxtot_size); | ||
assert(0 == posix_memalign((void **)&(ctx->txbuf), page_size, ctx->txtot_size)); | ||
assert(0 == posix_memalign((void **)&(ctx->rxbuf), page_size, ctx->rxtot_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
careful with assert() as it is a no-op if NDEBUG is defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not do this change in this PR, as it is not essential and unrelated to the prototype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so I'll come back to memalign
removing the posix_memalign
tests/gds_kernel_latency.c
Outdated
@@ -517,34 +515,32 @@ int pp_close_ctx(struct pingpong_context *ctx) | |||
|
|||
if(ctx->exp_send_info == 1) | |||
{ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove this whitespace change and the others below as they simply add noise
tests/gds_kernel_latency.c
Outdated
@@ -559,12 +555,16 @@ int pp_close_ctx(struct pingpong_context *ctx) | |||
{ | |||
if( ctx->exp_send_info == 1 ) | |||
{ | |||
free(ctx->txbufexp); | |||
free(ctx->txbufexp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unneeded whitespace change
tests/gds_kernel_latency.c
Outdated
free(ctx->txbufexp_size); | ||
free(ctx->txbufexp_lkey); | ||
free(ctx->txbufexp_addr); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you move ibv_close_device here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's an oversight, I was debugging a free error
gpu_warn("[%d] Could not post all receive, requested %d, actually posted %d\n", my_rank, max_batch_len, nrecv); | ||
return 1; | ||
} | ||
// int nrecv = pp_post_recv(ctx, max_batch_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is pp_post_recv() being called now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inside pp_post_work
before starting the main loop for (i = 0; i < posted_recv; ++i)
. My initial question was: why there is this additional pp_post_recv
outside and before the pp_post_work
?
@drossetti Changes in this PR:
Also: what's the purpose of |
@drossetti ping |
tests/gds_kernel_latency.c
Outdated
int validate; | ||
char *validate_buf; | ||
int buf_size; | ||
int buf_align; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you change the name of this member var ?
int exp_send_info; | ||
int validate; | ||
char *validate_buf; | ||
int buf_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still seeing a lot of white space noise, which makes hard to understand which fields are new
ctx->txbufexp_size = (uint32_t*)gpu_malloc(page_size, sizeof(uint32_t)*max_batch_len); | ||
ctx->txbufexp_lkey = (uint32_t*)gpu_malloc(page_size, sizeof(uint32_t)*max_batch_len); | ||
ctx->txbufexp_addr = (uintptr_t*)gpu_malloc(page_size, sizeof(uintptr_t)*max_batch_len); | ||
ctx->txbufexp = gpu_malloc(page_size, ctx->txtot_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the UD requirement, don't you need +40 even here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose that ctx->txtot_size now includes the additional 40B, right?
ctx->txbufexp_size[i] = ctx->buf_sizeexp; | ||
ctx->txbufexp_lkey[i] = ctx->mrexp->lkey; | ||
ctx->txbufexp_addr[i]=(uintptr_t)(ctx->txbufexp+(i*ctx->size_align)); | ||
gpu_info("exp_send_info - hi=%d, ostmem: new tx size: %d instead of %d. New tx addr: %lx instead of %lx\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ostmem" probably missing an 'h'
|
||
for(i=0; i < max_batch_len; i++) | ||
{ | ||
if (ctx->gpumem) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it hard to follow the logic here.
Could you explain why this for loop has that if (ctx->gpumem) ?
@drossetti
rx_flag
? Can I remove it?pp_post_recv
at line 1541. The firstpp_post_recv
are re-posted inpp_post_work