Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing multiple send/recv memory buffers in gds_kernel_latency #78

Open
wants to merge 2 commits into
base: expose_send_params
Choose a base branch
from

Conversation

e-ago
Copy link
Collaborator

@e-ago e-ago commented Jan 28, 2019

@drossetti

  • what's the purpose of rx_flag ? Can I remove it?
  • I removed the pp_post_recv at line 1541. The first pp_post_recv are re-posted in pp_post_work

Copy link
Contributor

@drossetti drossetti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@e-ago can you add a brief bullet list of changes included in this PR?

int exp_send_info;
int validate;
char *validate_buf;
int buf_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

white space change: can you not change the alignment of the whole struct ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still seeing a lot of white space noise, which makes hard to understand which fields are new

int validate;
char *validate_buf;
int buf_size;
int buf_align;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this var name is incorrect, it suggests alignment requirements, but instead it is an aligned size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you change the name of this member var ?

@@ -218,7 +221,7 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size,
ctx->txbufexp_lkey = NULL;
ctx->txbufexp_addr = NULL;

size_t alloc_size = max_batch_len * align_to(size + 40, page_size);
size_t alloc_size = max_batch_len * ctx->buf_align;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you are using memalign/posix_memalign below, why do you need buf_align in the first place?
it should not be needed, as those allocators already provide buffers with the right size.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from previous version of the code. I'll remove it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I keep using size + 40?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+40 is the padding required for the UD protocol.
You can removed it if we make sure UD cannot be selected in this test.

@@ -234,15 +237,15 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size,
ctx->txbufexp_addr = (uintptr_t*)gpu_malloc(page_size, sizeof(uintptr_t)*max_batch_len);
}
} else {
ctx->txbuf = memalign(page_size, ctx->txtot_size); //posix_memalign
ctx->rxbuf = memalign(page_size, ctx->rxtot_size);
assert(0 == posix_memalign((void **)&(ctx->txbuf), page_size, ctx->txtot_size));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why switching from memalign to posix_memalign ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because memalign is obsolete https://linux.die.net/man/3/memalign

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI judging from https://github.com/linux-rdma/rdma-core/blob/1cf909a14b3d07c8a301e3de03bfb91e62aaeff5/libibverbs/examples/ud_pingpong.c#L309 it looks like memalign is still being used.
Here we forked that code, so it is up to us.
Note that switching to posix_memalign() brings a bit different requirements wrt memalign():
"The function posix_memalign() allocates size bytes and places the address of the allocated memory in *memptr. The address of the allocated memory will be a multiple of alignment, which must be a power of two and a multiple of sizeof(void *)."
Also note that buffer size is not checked to be a multiple of alignment.

@@ -234,15 +237,15 @@ static struct pingpong_context *pp_init_ctx(struct ibv_device *ib_dev, int size,
ctx->txbufexp_addr = (uintptr_t*)gpu_malloc(page_size, sizeof(uintptr_t)*max_batch_len);
}
} else {
ctx->txbuf = memalign(page_size, ctx->txtot_size); //posix_memalign
ctx->rxbuf = memalign(page_size, ctx->rxtot_size);
assert(0 == posix_memalign((void **)&(ctx->txbuf), page_size, ctx->txtot_size));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI judging from https://github.com/linux-rdma/rdma-core/blob/1cf909a14b3d07c8a301e3de03bfb91e62aaeff5/libibverbs/examples/ud_pingpong.c#L309 it looks like memalign is still being used.
Here we forked that code, so it is up to us.
Note that switching to posix_memalign() brings a bit different requirements wrt memalign():
"The function posix_memalign() allocates size bytes and places the address of the allocated memory in *memptr. The address of the allocated memory will be a multiple of alignment, which must be a power of two and a multiple of sizeof(void *)."
Also note that buffer size is not checked to be a multiple of alignment.

ctx->txbuf = memalign(page_size, ctx->txtot_size); //posix_memalign
ctx->rxbuf = memalign(page_size, ctx->rxtot_size);
assert(0 == posix_memalign((void **)&(ctx->txbuf), page_size, ctx->txtot_size));
assert(0 == posix_memalign((void **)&(ctx->rxbuf), page_size, ctx->rxtot_size));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

careful with assert() as it is a no-op if NDEBUG is defined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not do this change in this PR, as it is not essential and unrelated to the prototype

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so I'll come back to memalign removing the posix_memalign

@@ -517,34 +515,32 @@ int pp_close_ctx(struct pingpong_context *ctx)

if(ctx->exp_send_info == 1)
{

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove this whitespace change and the others below as they simply add noise

@@ -559,12 +555,16 @@ int pp_close_ctx(struct pingpong_context *ctx)
{
if( ctx->exp_send_info == 1 )
{
free(ctx->txbufexp);
free(ctx->txbufexp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unneeded whitespace change

free(ctx->txbufexp_size);
free(ctx->txbufexp_lkey);
free(ctx->txbufexp_addr);
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you move ibv_close_device here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's an oversight, I was debugging a free error

gpu_warn("[%d] Could not post all receive, requested %d, actually posted %d\n", my_rank, max_batch_len, nrecv);
return 1;
}
// int nrecv = pp_post_recv(ctx, max_batch_len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is pp_post_recv() being called now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside pp_post_work before starting the main loop for (i = 0; i < posted_recv; ++i). My initial question was: why there is this additional pp_post_recv outside and before the pp_post_work ?

@e-ago
Copy link
Collaborator Author

e-ago commented Jan 30, 2019

@drossetti Changes in this PR:

  • Use posix_memalign instead of memalign (obsolete) [to be removed]
  • Initial pp_post_recv (before the first pp_post_work) removed. Should I restore it?
  • Instead of using the same buffer for every send/recv, now every send/recv in pp_post_work has its own memory buffer
  • This implies that txbufexp_addris an array of addresses (when enabling exp send feature)

Also: what's the purpose of rx_flag ? Can I remove it?

@e-ago
Copy link
Collaborator Author

e-ago commented Feb 11, 2019

@drossetti ping

int validate;
char *validate_buf;
int buf_size;
int buf_align;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you change the name of this member var ?

int exp_send_info;
int validate;
char *validate_buf;
int buf_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still seeing a lot of white space noise, which makes hard to understand which fields are new

ctx->txbufexp_size = (uint32_t*)gpu_malloc(page_size, sizeof(uint32_t)*max_batch_len);
ctx->txbufexp_lkey = (uint32_t*)gpu_malloc(page_size, sizeof(uint32_t)*max_batch_len);
ctx->txbufexp_addr = (uintptr_t*)gpu_malloc(page_size, sizeof(uintptr_t)*max_batch_len);
ctx->txbufexp = gpu_malloc(page_size, ctx->txtot_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the UD requirement, don't you need +40 even here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that ctx->txtot_size now includes the additional 40B, right?

ctx->txbufexp_size[i] = ctx->buf_sizeexp;
ctx->txbufexp_lkey[i] = ctx->mrexp->lkey;
ctx->txbufexp_addr[i]=(uintptr_t)(ctx->txbufexp+(i*ctx->size_align));
gpu_info("exp_send_info - hi=%d, ostmem: new tx size: %d instead of %d. New tx addr: %lx instead of %lx\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ostmem" probably missing an 'h'


for(i=0; i < max_batch_len; i++)
{
if (ctx->gpumem) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it hard to follow the logic here.
Could you explain why this for loop has that if (ctx->gpumem) ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants