-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
powerman: support error diagnostics with setresult #172
Conversation
I haven't looked in too much detail, but if we're providing a way for the powerman server to send back textual errors (good!) then maybe we should always present them (as received, not summarized) to the user on stderr and not have an option for it? |
I guess my logic was that when doing
The option is simply to avoid changing default behavior compared to what it has been for years. There's also the question of it only working with There really isn't a right or wrong answer on any of this. Just gotta decide the path with more pros than cons. Lemme ping some folks on the admin team, see what their 2 cents are. |
I was hoping with the chassis hierarchy support you added we might have less of those errors on a constant basis! Maybe some example output would help me understand? It seems like sending diags to stderr is pretty typical of unix commands, and if something is really badly broken then lots of output may be appropriate? |
well, some new errors become normalized like "ancestor off".
Good point. As an aside, when sending over this diag output, it is done during the very end right before the "power status" (error vs non-error) is send. So that's why it's summary hostrange output. We could alter that of course. |
just one other thought, it is so common for some set of nodes to be down for maintenace, a If we want to make this default to on ... that probably indicates that we should not support with this |
so just summarizing, these are the debate points A) should this be on or off by default B) should errors be "collapsed" / "summarized", given they are not streamed as they occur, I think summarized makes more sense. C) should this work with EDIT: D) I just realized, it seems that telemetry output is output to stdout by default, so the diagnostic errors were output to stdout too. Do we want to alter this? Have diag errors only go to stderr? if we default this output on, then I guess it should definitely go to stderr. |
I guess one advantage to receiving errors as they occur is that you're left in the dark when a command that has has actually failed is taking a long time and you might want to abort it at the first sign of trouble. Definitely stderr. Edit: I seem to recall being annoyed that powerman is not consistent about its use of stderr so we may need to do an audit at some point and make sure errors go there when appropriate. |
thinking this through via this conversation ... I think we'll go with the following approach.
|
Are you thinking like chassis that are turned off so we can't query them? Yeah, that makes sense. Indiivdual slots, and nodes ought to be skipped because we'll know their parent is off. |
nah. In that case everything in the chassis is Like maybe some some chunk of hardware is removed / replaced, so it's missing for a bit of time. But no one bothers to update the powerman.conf, so suddenly you get A better example might be some of the test cluster, were the powerman.conf is sort of a collection of the random hardware that could be installed at any point in time in that cluster, but a non-trivial percentage of the time a good chunk of it is missing, getting replaced, getting tweaked, etc. etc. |
ok, so per discussion above I re-pushed with a new implementation no more
the errors at the top are to stderr, the "Command completed with errors" is to stdout. |
That seems good. Any way to improve the errors since it may not be super clear what the parent and child are? Like
or maybe that's a different pr :-) |
I experimented with what you’re talking about, but hit issues … which I figure is for another day, see #153 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I had a bunch of nitpicky changes but you can decide which ones are useful and which ones not. By and large this looks good!
static void _diag_printf(int client_id, const char *fmt, ...) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way this printf like function could be protected with __attribute__ printf
? Maybe it can be added to the typedef? (I don't recall ever doing it with a pointer to a function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it wasn't there in the telemetry function so just didn't add it :-) I'll add to the telemetry one in an extra commit.
client_id, arglist); | ||
dpf_fun, client_id, arglist); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespace issue there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case intentional. Powerman uses both
function(a, b, c, d
e, f, g);
and
function(a, b, c, d
e, f, g);
styles. I think the latter is the more common approach today. So in the few circumstances that a change half-forced a change from the former style to the latter, i made the change.
src/powerman/device.c
Outdated
char strbuf[1024 + 1] = {0}; | ||
snprintf(strbuf, 1024, "%s", arg->val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should do here
char strbuf[1024]; // or 1025 whatever
snprintf(strbuf, sizeof (strbuf), "%s", arg->val);
snprintf always null terminates
/* remove trailing carriage return or newline */ | ||
strbuf[strcspn(strbuf, "\r\n")] = '\0'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that seems like a clever way to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
courtesy of stackoverflow :-)
src/powerman/powerman.c
Outdated
static bool _output_to_stderr(int num) | ||
{ | ||
/* diag output goes to stderr */ | ||
if (num == 309) | ||
return true; | ||
return false; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion:
static FILE *getstream (int num)
{
return (num == 309) ? stderr : stdout
}
fprintf (getstream (num), "%s\n", buf + 4)
static int is_bad_plug(int n) | ||
{ | ||
if (bad_plugs_count) { | ||
int i; | ||
for (i = 0; i < bad_plugs_count; i++) { | ||
if (bad_plugs[i] == n) | ||
return 1; | ||
} | ||
} | ||
return 0; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would an array of flags be better here? Then instead of a function call do
if (bad_plugs[i % NUM_PLUGS]...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a good idea, tried it out, but ended up keeping what I had. I think adding the flags makes the code a tad more confusing for a future developer. I think storing the actual plug makes things more obvious.
After seeing your comment, I even contemplated making the bad_plugs a bitmask. But figured that was overkill.
6b56501
to
b35b4ef
Compare
re-pushed, with tweaks per comments above. Only notable addition is new commit adding |
Problem: Several printf-like function types do set the printf format attribute. This doesn't allow modern type checking to be done on those functions. Add print attribute format to several printf-like function types.
Problem: When power control to a target fails, there is no way for a user to know why it failed except through the very verbose --telemetry output. When `setresult` is used in a device script, send text indicating why the power operation failed to a specific plug. From the client side, have this be output to stderr.
Problem: The --bad-plug option in vpcd cannot be called multiple times to specify multiple bad plugs. Support calling --bad-plug multiple times by putting the bad plugs into an array.
Problem: There is no coverage for the new result diagnostics that can be sent back to the user over stderr. Add tests in new t0036-diagnostics.t file.
rebased, setting MWP ... |
Problem: When power control/query to a target fails, there is no way for a user to know why it failed except through the very verbose --telemetry output.
Add a new --diag to powerman that will inform powermand to send diagnostic information about why a power operation failed. Common errors from the same host will be collapsed into a hostrange. This only works with setplugstate and the new setresult statement.