Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

powerman: support error diagnostics with setresult #172

Merged
merged 4 commits into from
Apr 11, 2024

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Apr 4, 2024

Problem: When power control/query to a target fails, there is no way for a user to know why it failed except through the very verbose --telemetry output.

Add a new --diag to powerman that will inform powermand to send diagnostic information about why a power operation failed. Common errors from the same host will be collapsed into a hostrange. This only works with setplugstate and the new setresult statement.

> pm --diag -1 t[0-15]
t[7,15]: error
Command completed with errors

@garlick
Copy link
Member

garlick commented Apr 8, 2024

I haven't looked in too much detail, but if we're providing a way for the powerman server to send back textual errors (good!) then maybe we should always present them (as received, not summarized) to the user on stderr and not have an option for it?

@chu11
Copy link
Member Author

chu11 commented Apr 8, 2024

(as received, not summarized) to the user on stderr

I guess my logic was that when doing pm --on elcap[0-12000], it could up being many many lines. So we don't want to do that?

and not have an option for it?

The option is simply to avoid changing default behavior compared to what it has been for years.

There's also the question of it only working with -q in most legacy devices (working w/ setplugstate) while only working with on/off/query in the newest devices.

There really isn't a right or wrong answer on any of this. Just gotta decide the path with more pros than cons. Lemme ping some folks on the admin team, see what their 2 cents are.

@garlick
Copy link
Member

garlick commented Apr 8, 2024

I was hoping with the chassis hierarchy support you added we might have less of those errors on a constant basis!

Maybe some example output would help me understand? It seems like sending diags to stderr is pretty typical of unix commands, and if something is really badly broken then lots of output may be appropriate?

@chu11
Copy link
Member Author

chu11 commented Apr 8, 2024

I was hoping with the chassis hierarchy support you added we might have less of those errors on a constant basis!

well, some new errors become normalized like "ancestor off".

if something is really badly broken then lots of output may be appropriate?

Good point.

As an aside, when sending over this diag output, it is done during the very end right before the "power status" (error vs non-error) is send. So that's why it's summary hostrange output. We could alter that of course.

@chu11
Copy link
Member Author

chu11 commented Apr 8, 2024

just one other thought, it is so common for some set of nodes to be down for maintenace, a pm -q would suddenly start spewing to stderr why the power status of a bunch of stuff is always unknown. That's probably not desirable.

If we want to make this default to on ... that probably indicates that we should not support with this --query, it should only work with the new setresult directive. So it really only works with --on and --off.

@chu11
Copy link
Member Author

chu11 commented Apr 8, 2024

so just summarizing, these are the debate points

A) should this be on or off by default

B) should errors be "collapsed" / "summarized", given they are not streamed as they occur, I think summarized makes more sense.

C) should this work with --query (i.e. setplugstate) or only with the new setresult (mostly --on and --off). if we default it to on, I think the answer is only with setresult, b/c default output of all many devices would change.

EDIT:

D) I just realized, it seems that telemetry output is output to stdout by default, so the diagnostic errors were output to stdout too. Do we want to alter this? Have diag errors only go to stderr? if we default this output on, then I guess it should definitely go to stderr.

@garlick
Copy link
Member

garlick commented Apr 8, 2024

I guess one advantage to receiving errors as they occur is that you're left in the dark when a command that has has actually failed is taking a long time and you might want to abort it at the first sign of trouble.

Definitely stderr.

Edit: I seem to recall being annoyed that powerman is not consistent about its use of stderr so we may need to do an audit at some point and make sure errors go there when appropriate.

@chu11
Copy link
Member Author

chu11 commented Apr 8, 2024

thinking this through via this conversation ... I think we'll go with the following approach.

  • diag output only works with setresult, thus --on and --off... it will be too jarring if it shows up a ton in --query, especially with older devices

  • since it only works with --on and --off, it can be enabled by default

  • we'll stream it line by line ... although not sure how to do that at the moment. I assume gotta see how the telemetry data is streamed over, i assume it'll be similar.

@garlick
Copy link
Member

garlick commented Apr 8, 2024

just one other thought, it is so common for some set of nodes to be down for maintenace, a pm -q would suddenly start spewing to stderr why the power status of a bunch of stuff is always unknown. That's probably not desirable.

Are you thinking like chassis that are turned off so we can't query them? Yeah, that makes sense. Indiivdual slots, and nodes ought to be skipped because we'll know their parent is off.

@chu11
Copy link
Member Author

chu11 commented Apr 8, 2024

Are you thinking like chassis that are turned off so we can't query them?

nah. In that case everything in the chassis is off, so query would work.

Like maybe some some chunk of hardware is removed / replaced, so it's missing for a bit of time. But no one bothers to update the powerman.conf, so suddenly you get network error for all the hardware and status of unknown for that hardware.

A better example might be some of the test cluster, were the powerman.conf is sort of a collection of the random hardware that could be installed at any point in time in that cluster, but a non-trivial percentage of the time a good chunk of it is missing, getting replaced, getting tweaked, etc. etc.

@chu11
Copy link
Member Author

chu11 commented Apr 9, 2024

ok, so per discussion above I re-pushed with a new implementation

no more --diag option, diagnostic output is one per line, to stderr, by default. example from a test.

> powerman -h $testaddr -1 blade[0-7],cmm0,perif[0-7],t[0-15]
cmm0: cannot turn on parent and child
blade0: cannot turn on parent and child
blade1: cannot turn on parent and child
blade2: cannot turn on parent and child
blade3: cannot turn on parent and child
blade4: cannot turn on parent and child
blade5: cannot turn on parent and child
blade6: cannot turn on parent and child
blade7: cannot turn on parent and child
....
Command completed with errors

the errors at the top are to stderr, the "Command completed with errors" is to stdout.

@chu11 chu11 changed the title powerman: support new --diag option powerman: support error diagnostics with setresult Apr 9, 2024
@garlick
Copy link
Member

garlick commented Apr 9, 2024

That seems good. Any way to improve the errors since it may not be super clear what the parent and child are? Like

blade0: turn on chassis0 first
blade1: turn on cahssis1 first
...

or maybe that's a different pr :-)

@chu11
Copy link
Member Author

chu11 commented Apr 9, 2024

I experimented with what you’re talking about, but hit issues … which I figure is for another day, see #153

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I had a bunch of nitpicky changes but you can decide which ones are useful and which ones not. By and large this looks good!

Comment on lines +676 to +677
static void _diag_printf(int client_id, const char *fmt, ...)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way this printf like function could be protected with __attribute__ printf? Maybe it can be added to the typedef? (I don't recall ever doing it with a pointer to a function)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it wasn't there in the telemetry function so just didn't add it :-) I'll add to the telemetry one in an extra commit.

Comment on lines -683 to +688
client_id, arglist);
dpf_fun, client_id, arglist);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace issue there?

Copy link
Member Author

@chu11 chu11 Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case intentional. Powerman uses both

function(a, b, c, d
  e, f, g);

and

function(a, b, c, d
         e, f, g);

styles. I think the latter is the more common approach today. So in the few circumstances that a change half-forced a change from the former style to the latter, i made the change.

Comment on lines 1182 to 1183
char strbuf[1024 + 1] = {0};
snprintf(strbuf, 1024, "%s", arg->val);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should do here

char strbuf[1024]; // or 1025 whatever
snprintf(strbuf, sizeof (strbuf), "%s", arg->val);

snprintf always null terminates

Comment on lines +1184 to +1185
/* remove trailing carriage return or newline */
strbuf[strcspn(strbuf, "\r\n")] = '\0';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that seems like a clever way to do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

courtesy of stackoverflow :-)

Comment on lines 409 to 414
static bool _output_to_stderr(int num)
{
/* diag output goes to stderr */
if (num == 309)
return true;
return false;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:

static FILE *getstream (int num)
{
    return (num == 309) ? stderr : stdout
}
fprintf (getstream (num), "%s\n", buf + 4)

Comment on lines +257 to +268
static int is_bad_plug(int n)
{
if (bad_plugs_count) {
int i;
for (i = 0; i < bad_plugs_count; i++) {
if (bad_plugs[i] == n)
return 1;
}
}
return 0;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would an array of flags be better here? Then instead of a function call do

if (bad_plugs[i % NUM_PLUGS]...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a good idea, tried it out, but ended up keeping what I had. I think adding the flags makes the code a tad more confusing for a future developer. I think storing the actual plug makes things more obvious.

After seeing your comment, I even contemplated making the bad_plugs a bitmask. But figured that was overkill.

@chu11 chu11 force-pushed the powerman_diag branch 2 times, most recently from 6b56501 to b35b4ef Compare April 10, 2024 22:30
@chu11
Copy link
Member Author

chu11 commented Apr 10, 2024

re-pushed, with tweaks per comments above. Only notable addition is new commit adding __attribute__ ((format (printf ...))) to several typedefs.

chu11 added 4 commits April 10, 2024 18:40
Problem: Several printf-like function types do set the printf
format attribute.  This doesn't allow modern type checking to be
done on those functions.

Add print attribute format to several printf-like function types.
Problem: When power control to a target fails, there is no way for
a user to know why it failed except through the very verbose
--telemetry output.

When `setresult` is used in a device script, send text indicating
why the power operation failed to a specific plug.  From the client side,
have this be output to stderr.
Problem: The --bad-plug option in vpcd cannot be called multiple
times to specify multiple bad plugs.

Support calling --bad-plug multiple times by putting the bad plugs
into an array.
Problem: There is no coverage for the new result diagnostics
that can be sent back to the user over stderr.

Add tests in new t0036-diagnostics.t file.
@chu11
Copy link
Member Author

chu11 commented Apr 11, 2024

rebased, setting MWP ...

@mergify mergify bot merged commit 598d70d into chaos:master Apr 11, 2024
8 checks passed
@chu11 chu11 deleted the powerman_diag branch April 11, 2024 02:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants