Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/RNDV: Adjust max_frag to be at least of minimal RNDV chunk size #10407

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

iyastreb
Copy link
Contributor

@iyastreb iyastreb commented Jan 7, 2025

What?

This PR addresses an assertion failure described in https://jirasw.nvidia.com/browse/UCX-1054
Assertion happens when running io-demo with cuda support, currently it's reproducible only on rock05.
Steps to reproduce:

UCX_TLS=rc,cuda UCX_IB_NUM_PATHS=1 UCX_RC_MAX_GET_ZCOPY=32k ./bin/io_demo -d 32768 -p 20000

UCX_TLS=rc,cuda UCX_IB_NUM_PATHS=2 UCX_RC_MAX_GET_ZCOPY=32k ./bin/io_demo -m cuda -i 1 -d 32768 -o read 1.1.60.5:20000

This results in an assertion failure on client side (if read operation is requested) or server side (with write operation):

[rock05:2421956:0:2421956] proto_rndv.inl:286  Assertion `max_payload <= lpriv->max_frag' failed: req=0x147ad40 max_payload=16384 max_frag=9728                                                                                    

/labhome/iyastrebov/ws/ucx4/bld-devel/src/ucp/../../../src/ucp/rndv/proto_rndv.inl: [ ucp_proto_rndv_bulk_max_payload() ]                                                                                                          
      ...                                               
      283                   total_length, lpriv->max_frag_sum, max_frag_sum, max_payload);                                                                                                                                         
      284                                               
      285     /* Check that send length is not greater than maximal fragment size */                                                                                                                                               
==>   286     ucs_assertv(max_payload <= lpriv->max_frag,                                                        
      287                 "req=%p max_payload=%zu max_frag=%zu", req, max_payload,                               
      288                 lpriv->max_frag);                                                                      
      289     return max_payload;                       

==== backtrace (tid:2421956) ====                       
 0 0x000000000010574a ucp_proto_rndv_bulk_max_payload()  /labhome/iyastrebov/ws/ucx4/bld-devel/src/ucp/../../../src/ucp/rndv/proto_rndv.inl:286                                                                                    
 1 0x000000000010574a ucp_proto_rndv_bulk_max_payload_align()  /labhome/iyastrebov/ws/ucx4/bld-devel/src/ucp/../../../src/ucp/rndv/proto_rndv.inl:315                                                                              
 2 0x000000000010574a ucp_proto_rndv_get_zcopy_send_func()  /labhome/iyastrebov/ws/ucx4/bld-devel/src/ucp/../../../src/ucp/rndv/rndv_get.c:153                                                                                     
 3 0x00000000001072a4 ucp_proto_multi_progress()  /labhome/iyastrebov/ws/ucx4/bld-devel/../src/ucp/proto/proto_multi.inl:182                                                                                                       
 4 0x00000000001072a4 ucp_proto_multi_zcopy_progress()  /labhome/iyastrebov/ws/ucx4/bld-devel/../src/ucp/proto/proto_multi.inl:251                                                                                                 

Why?

The issue boils down to the corner case when lpriv->max_frag is smaller than min_rndv_chunk. This happens due to low max_zcopy limit that we set with UCX_RC_MAX_GET_ZCOPY=32k config.

How?

Adjust max_frag to be at least of min_rndv_chunk size.
Tested with io-demo
Reproduced issue with test_ucp_proto_mock_rcx, but cannot append this unit test here because it requires #10369

@iyastreb iyastreb requested review from tvegas1 and yosefe January 7, 2025 10:49
@yosefe
Copy link
Contributor

yosefe commented Jan 7, 2025

Can we reproduce it in gtest?

@iyastreb
Copy link
Contributor Author

iyastreb commented Jan 7, 2025

Can we reproduce it in gtest?

Yes, as I wrote in description I reproduced it with mock test
But I cannot commit this test, because it depends on #10369 which is not merged yet

@@ -382,7 +382,9 @@ ucp_proto_common_get_lane_perf(const ucp_proto_common_init_params_t *params,
&perf_attr.latency) +
params->latency;
tl_perf->sys_latency = 0;
tl_perf->min_length = ucs_max(params->min_length, tl_min_frag);
/* min_length must be within [tl_min_frag, tl_max_frag] range */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess comment can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -382,7 +382,9 @@ ucp_proto_common_get_lane_perf(const ucp_proto_common_init_params_t *params,
&perf_attr.latency) +
params->latency;
tl_perf->sys_latency = 0;
tl_perf->min_length = ucs_max(params->min_length, tl_min_frag);
/* min_length must be within [tl_min_frag, tl_max_frag] range */
tl_perf->min_length = ucs_max(ucs_min(params->min_length, tl_max_frag),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this specific range forcing covered by some parameter combination/gtest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not covered.. This is rather a common sense to keep the min_length within the HW limit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think that if params->min_length > tl_max_frag, the protocol should be disabled: it requires a fragment length that is not supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done

min_rndv_chunk = lane_perf->bandwidth *
context->config.ext.min_rndv_chunk_size /
min_bandwidth;
/* Minimal RNDV chunk must be within [min_length, tl_max_frag] range */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe something more like: "we still to operate within iface/hw limits" or something, else it repeats the code a bit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I better remove that comment

/* For RNDV only: max scaled fragment must be at least min_rndv_chunk */
if ((params->super.send_op == UCT_EP_OP_PUT_ZCOPY) ||
(params->super.send_op == UCT_EP_OP_GET_ZCOPY)) {
max_frag = ucs_max(max_frag, min_rndv_chunk);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the actual fix, right? we have also test for that in maybe mock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this line is the real fix, other changes are just the boundary checks
Yes, I have a test, but as written in description I cannot commit it until #10369 is merged

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. seems weird that proto_muti.c has a specific case for rndv.
  2. we should not increase the size to be more than transport max frag, or the transport may not be able to send it. Maybe in the case here it works because UCX_RC_MAX_GET_ZCOPY is not "real" HW limitation but just a SW config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Agree, it looks a bit weird, but it does the right thing: we adjust max_frag only for RNDV, according to the minimal RNDV chunk size.

  2. No, it shouldn't increase max_frag above tl_max_frag, here is the reasoning:
    Initial value of max_frag is capped by tl_max_frag:

        max_frag = ucs_double_to_sizet(lane_perf->bandwidth / max_frag_ratio,
                                       lane_perf->max_frag);

lane_perf->min_length is guaranteed to be within [tl_min_frag, tl_max_frag]
=> min_rndv_chunk is guaranteed to be within [min_length, tl_max_frag]
=> max_frag = ucs_max(max_frag, min_rndv_chunk) can never exceed tl_max_frag

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can we move this code to rndv protocol or make it more generic? maybe using a flag in params?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid moving this code outside is gonna be very hard, because other calculations are tightly coupled with max_frag. Maybe in the future we can refactor this function, as it does a lot of things.

Using flag in params seems viable option to me, and btw there are already suitable flags, indicating that RNDV is used: UCP_PROTO_COMMON_INIT_FLAG_SEND_ZCOPY and UCP_PROTO_COMMON_INIT_FLAG_RECV_ZCOPY. And the end it will be the same as checking send_op

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactored code around min_chunk:

  • A separate function ucp_proto_multi_get_min_chunk returns min_rndv_chunk_size for RNDV protocols, 0 otherwise
  • ucp_proto_multi_init is pure generic wrt min_chunk

ucs_debug("protocol %s min_length %zu is larger than lane[%d] max_frag "
"%zu", ucp_proto_id_field(params->super.proto_id, name),
params->min_length, lane, tl_max_frag);
return UCS_ERR_OUT_OF_RANGE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UCS_ERR_INVALID_PARAM - params->min_length is invalid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

/* For RNDV only: max scaled fragment must be at least min_rndv_chunk */
if ((params->super.send_op == UCT_EP_OP_PUT_ZCOPY) ||
(params->super.send_op == UCT_EP_OP_GET_ZCOPY)) {
max_frag = ucs_max(max_frag, min_rndv_chunk);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can we move this code to rndv protocol or make it more generic? maybe using a flag in params?

@@ -355,6 +355,10 @@ ucp_proto_common_get_lane_perf(const ucp_proto_common_init_params_t *params,

ucp_proto_common_get_frag_size(params, &wiface->attr, lane, &tl_min_frag,
&tl_max_frag);
if (params->min_length > tl_max_frag) {
ucs_debug("params->min_length=%zu is invalid", params->min_length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls print more details

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I misinterpreted your last comment #10407 (comment)
I thought you were proposing to reduce error message

{
ucp_context_h context = params->super.super.worker->context;

if (params->super.flags & (UCP_PROTO_COMMON_INIT_FLAG_SEND_ZCOPY |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we take it from params->min_length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this comment well

  1. If you are asking whether we can return params->min_length instead of default for non-rendezvous return 0, then yes, we can do it, it should not change anything

  2. If you are asking whether we can reuse params->min_length field in order to store min_rndv_chunk for RNDV protocols - I'm not sure we can do that.
    The meaning of params->min_length is the minimal message length, and there is a bunch or related calculations for that.
    The meaning of min_rndv_chunk is really different: minimum allowed chunk size, meaning we don't want to split the message in smaller chunks, but we do allow smaller messages. So if we store min_rndv_chunk size in params->min_length, then we mess up the min_length meaning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add param to ucp_proto_multi_init_params_t

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I analysed min_length a bit more, and I'm sure we cannot reuse it for the purpose of min_rndv_chunk.
The reason is that field has already an established meaning (minimal message length supported by a protocol), and corresponding implementation: ucp_proto_init_perf uses this field to define the range of payload:

range_start = ucs_max(params->min_length, tl_perf->min_length);

I'm sure we don't want to change this logic.

Regarding the idea of having single-lane protocol, implementation wise it might be non-trivial. Currently there is only one probe per protocol (e.g. "rndv/get/zcopy") that supports multiple lanes. So if we want to have single-lane version of that protocol, it would require to introduce just another protocol ("rndv/single/get/zcopy"?). IMO looks like an overkill for this task.

I think the best approach would be to extend ucp_proto_multi_init_params_t with one extra field e.g. min_chunk_size.

@@ -26,6 +26,7 @@ static void ucp_rndv_am_probe_common(ucp_proto_multi_init_params_t *params)
params->super.exclude_map = 0;
params->super.min_length = 0;
params->super.max_length = SIZE_MAX;
params->super.min_chunk = context->config.ext.min_rndv_chunk_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can set it to 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

test/gtest/ucp/test_ucp_proto_mock.cc Show resolved Hide resolved
src/ucp/proto/proto_common.c Show resolved Hide resolved
Comment on lines 93 to 94
/* Minimal chunk size */
size_t min_chunk;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. maybe min_chunk should be only for proto_multi? what is the meaning if it is a single protocol and there is no fragmentation to chunks?
  2. maybe rename to min_frag?
  3. i'd extend the documentation to explain exactly what it's doing - for example "do not create fragments smaller than this size" ? something that explain the difference between this and min_length field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Look, I added min_chunk field because I was asked to provide generic implementation ucp_proto_multi_init, which does not check whether protocol is RNDV or not.
    Moving this field to ucp_proto_multi_priv_t makes no much sense, because:
  • It gets initialized in ucp_proto_multi_init, so we have to check again whether protocol is RNDV or not
  • It's not being used apart from ucp_proto_multi_init, so in this case we don't need it
  • There is already min_frag field over there, with slightly different meaning though
  1. Then it can be easily confused with existing min_frag I think.

  2. Sure, I'll update the documentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I misunderstood your comment, you proposed to add it to ucp_proto_multi_init_params_t which totally makes sense to me, will be fixed

@iyastreb
Copy link
Contributor Author

/azp run UCX PR

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

brminich
brminich previously approved these changes Jan 24, 2025
src/ucp/proto/proto_multi.h Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants