Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redfishpower: single device specification for a chassis #126

Closed
chu11 opened this issue Feb 7, 2024 · 25 comments
Closed

redfishpower: single device specification for a chassis #126

chu11 opened this issue Feb 7, 2024 · 25 comments

Comments

@chu11
Copy link
Member

chu11 commented Feb 7, 2024

per conversation in #81,

I hadn't realized that the redfish "device" specification only defines one plug. Could we solve the above by defining a device spec for one chassis and then do plug substitution in the URIs?

OH I just realized the hostnames are the plugs. Well, then the hostname's index in the hostlist for that chassis?

@chu11
Copy link
Member Author

chu11 commented Feb 7, 2024

OH I just realized the hostnames are the plugs. Well, then the hostname's index in the hostlist for that chassis?

Hmmmm, I suppose this is possible. Although ... we get into some hairy stuff b/c I think I've seen some systems where they begin to index at 1 instead of 0. So now we need a special config for that.

I'm wondering if a giant config of "these URIs for these nodes" is what is needed.

@garlick
Copy link
Member

garlick commented Feb 7, 2024

For a single chassis, I think this is fairly trivial - define plug names that are the index (0 or 1 origin, whatever), then do like you did in the httppower example and put the URI in the on/off script, but substitute the plug name using %s.

@garlick
Copy link
Member

garlick commented Feb 7, 2024

Since that fits so naturally, I have to wonder how many redfishpower instances we could run concurrently if it were one per chassis in a really big system. Example: 8 slot chasiss scaled out to 8K nodes would be 1024. Maybe I'll do a quick experiment just for fun.

@chu11
Copy link
Member Author

chu11 commented Feb 7, 2024

Since that fits so naturally, I have to wonder how many redfishpower instances we could run concurrently if it were one per chassis in a really big system. Example: 8 slot chasiss scaled out to 8K nodes would be 1024. Maybe I'll do a quick experiment just for fun.

I was a little confused, until it occurred to me, i think you're recommending 1 chassis per redfishpower co-process? B/c I don't think we can specify a hostname and a plug on one powerman.conf line? i.e. something like

node "node1" "redfish1" "pnode1" "1"

can't be done? where the "1" is the "plug suffix" and "pnode1" is the hostname to power control.

@garlick
Copy link
Member

garlick commented Feb 7, 2024

Yes, but that is OK. You can specify a hostlist as before and just map the plugs in order. iow just specify

device "chassis0" "redfish" "redfishpower -h t[0-7] |&
node "t[0-7]" "chassis0"

@chu11
Copy link
Member Author

chu11 commented Feb 7, 2024

device "chassis0" "redfish" "redfishpower -h t[0-7] |&
node "t[0-7]" "chassis0"

What a second, are you specifying the chassis parent here? For the actual URI wouldn't we want something more like

device "chassis0" "redfish" "redfishpower -h t[8-15] |&
node "t[8-15]" "[0-7]"

where 0-7 are the "plugs"?

@garlick
Copy link
Member

garlick commented Feb 7, 2024

No, chassi0 is the device name, and the plugs are unspecified in the node line. They are implicitly "[0-7]".
So you can say

node "t[0-7]" "chassis0" "[0-7]"
node "t[8-15]" "chassis1" "[0-7]"

or equivalently

node "t[0-7]" "chassis0"
node "t[8-15]" "chassis1"

@garlick
Copy link
Member

garlick commented Feb 7, 2024

I had another idea about the chassis address but wanted to get this point across first.

@chu11
Copy link
Member Author

chu11 commented Feb 7, 2024

ohhh got it got it ... i was getting confused, yeah, all the URIs goto the same chassis.

Ugh ... maybe my prototype for #81 is a waste now ... maybe this has to be solved first.

@garlick
Copy link
Member

garlick commented Feb 7, 2024

Since the URI for the chassis power control is probably different from the slots, my thought was to have a special plug name c or something that is just mapped to a different URI than the rest of the plugs in redfishpower. If it's the last plug, e.g. "0", "1", "2", ... "7", "c" then

node "t[0-7],chassis0" "chassis0"
node "t[0-7],chassis1", "chassis1"

or equivalently

node "t[0-7],chassis0" "chassis0" "[0-7,c]"
node "t[0-7],chassis1", "chassis1" "[0-7,c]"

Maybe the "setconfig" stuff at the beginning of the device script could set the config for that special plug, including the hierarchical semantics.

@garlick
Copy link
Member

garlick commented Feb 7, 2024

If the URI is different for each blade, are we only talking to the chassis (one IP)?

Do we have an El Cap chassis to poke at? Because if we're only talking to the chassis, we don't care what nodes are in there!

@chu11
Copy link
Member Author

chu11 commented Feb 7, 2024

If the URI is different for each blade, are we only talking to the chassis (one IP)?

Of the one example I've seen yeah, the host is the same for each of the blades, just the suffix "path" is different (0, 1, 2, .., etc. different in each path).

@garlick
Copy link
Member

garlick commented Feb 7, 2024

For that type of a chassis I wouldn't think the hierarchical semantics we discussed would be required... The chassis probably remains responsive to queries about the nodes even when off (if it can even be turned off).

@chu11
Copy link
Member Author

chu11 commented Feb 7, 2024

as we go around in circles on some of this stuff, I'm beginning to think "mega-config file" is the right idea, because there's so many oddball cases with redfish.

  • non-bladed vs bladed
  • no-parents vs parents
  • different URI configuration for parents vs children
  • configuring "set" paths vs using "plugs" for the paths
  • different hardware in same cluster with different schemes
  • different vendors with different schemes in same cluster

i can't help but look at the proliferation of device files as evidence for the need.

@garlick
Copy link
Member

garlick commented Feb 7, 2024

On the first three items - I think we are zeroing in on how to do this simply without a separate config file. It seems like we have identified two cases that we may care about (but we should verify they really exist):

  • where there is a redfish chassis that you talk to to control the blades
  • where there is a redfish chassis and redfish blades, and when you turn off the chassis, the blades go off and potentially can no longer be contacted

Set vs plugs isn't an either or thing. You can set a URI template and then still substitute plugs.

On the last two items - this is what powerman does best. You can mix and match different schemes in one config. The device scripts provide the abstraction, and then you map "plugs" in each device to hostnames in the main config and powerman provides one interface to the admins.

It would feel like a design failure if we have to introduce a second config file so I think we should keep trying. Let's start by finding out exactly what we're dealing with in El Cap.

@chu11
Copy link
Member Author

chu11 commented Feb 7, 2024

On the last two items - this is what powerman does best.

The point on the last two items was the potential explosion of device specifications. Unlike previous device files in powerman, it seems that copy & modify the device files is going to a common pattern with redfish and some of these REST interfaces, as there are quirks in every system. And with blades and parents, we might be introducing additional quirks too. So perhaps a mega config just might be easier overall?

Bullet 3 above is the one that made me go "ugh" the most ... where we are crossing the line into different URI configs for different hosts within a single redfishpower process, so there was this ... "ugh ..."

@garlick
Copy link
Member

garlick commented Feb 7, 2024

I'm not convinced a new config file is the answer, particularly to this issue. If we could stay focused on this issue, let's look at what the admins had to do on hetchy with the following device script:

redfishpower-cray-olympus-blades.dev

This is apparently for an 8-blade chassis. They cut and pasted the same specification with all its scripts 8 times within the same .dev file and gave each spec's name a suffix like -blade0, -blade1, etc. and they (only) alter the URIs in each one, e.g.

send "setonpath redfish/v1/Chassis/Blade0/Actions/Chassis.Reset {\"ResetType\":\"On\"}\n
send "setonpath redfish/v1/Chassis/Blade1/Actions/Chassis.Reset {\"ResetType\":\"On\"}\n"
send "setonpath redfish/v1/Chassis/Blade2/Actions/Chassis.Reset {\"ResetType\":\"On\"}\n"
...

Then their config looks like this:

device "redfishpower-blade0" "redfishpower-cray-olympus-blade0" "/usr/sbin/redfishpower -h hetchy-cmm[1-2] |&"
device "redfishpower-blade1" "redfishpower-cray-olympus-blade1" "/usr/sbin/redfishpower -h hetchy-cmm[1-2] |&"
device "redfishpower-blade2" "redfishpower-cray-olympus-blade2" "/usr/sbin/redfishpower -h hetchy-cmm1 |&"
device "redfishpower-blade3" "redfishpower-cray-olympus-blade3" "/usr/sbin/redfishpower -h hetchy-cmm1 |&"

### Login/Compute Blades
node "hetchy-blade1" "redfishpower-blade0" "hetchy-cmm1"
node "hetchy-blade2" "redfishpower-blade1" "hetchy-cmm1"
node "hetchy-blade3" "redfishpower-blade2" "hetchy-cmm1"
node "hetchy-blade4" "redfishpower-blade3" "hetchy-cmm1"
node "hetchy-blade5" "redfishpower-blade0" "hetchy-cmm2"
node "hetchy-blade6" "redfishpower-blade1" "hetchy-cmm2"

So I guess they have two blade chassis, one with 4 blades installed and one with 2. They really had to stand on their heads to get this set up.

IMHO there should have been one device spec for this particular chassis with 8 plugs defined. Then their config would be more intuitive, like this

device "cmm1" "redfishpower-cray-olympus-cmm" "/usr/sbin/redfishpower -h hetchy-cmm1 |&"
device "cmm2" "redfishpower-cray-olympus-cmm" "/usr/sbin/redfishpower -h hetchy-cmm2 |&"

### Login/Compute Blades
node "hetchy-blade[1-4]" "cmm1" "[0-3]"
node "hetchy-blade[5-6]" "cmm2" "[0-1]"

@garlick
Copy link
Member

garlick commented Feb 7, 2024

Incidentally they have a separate dev specification in another .dev file for the chassis itself:

redfishpower-cray-olympus-cmm.dev

It's another cut & paste, identical to the blades except for the URIs e.g.

send "setonpath redfish/v1/Chassis/Blade0/Actions/Chassis.Reset {\"ResetType\":\"On\"}\n"

Their config is:

device "redfishpower-cmm" "redfishpower-cray-olympus-cmm" "/usr/sbin/redfishpower -h hetchy-cmm[1-2] |&"

### CMMs
node "hetchy-cmm1" "redfishpower-cmm" "hetchy-cmm1"
node "hetchy-cmm2" "redfishpower-cmm" "hetchy-cmm2"

Ideally we would figure out a way to represent the chassis as another plug like c in the single .dev spec proposed above. Then they would not have any new devices, just a node config for the chassis, e.g.

### CMMs
node "hetchy-cmm1" "cmm1" "c"
node "hetchy-cmm2" "cmm2" "c"

Or even combined with the blades, e.g.

node "hetchy-blade[1-4],hetchy-cmm1" "cmm1" "[0-3,c]"
node "hetchy-blade[5-6],hetchy-cmm2" "cmm2" "[0-1,c]"

@garlick
Copy link
Member

garlick commented Feb 7, 2024

And the entire blade config, with all its internal cut & paste, is cut and paste to another .dev script for the switches

redfishpower-cray-olympus-switches.dev

In this one the URIs are like

send "setonpath redfish/v1/Chassis/Perif0/Actions/Chassis.Reset {\"ResetType\":\"On\"}\n"
send "setonpath redfish/v1/Chassis/Perif1/Actions/Chassis.Reset {\"ResetType\":\"On\"}\n"
send "setonpath redfish/v1/Chassis/Perif2/Actions/Chassis.Reset {\"ResetType\":\"On\"}\n"
...

There doesn't seem to be chassis control for this one - not sure if that was just an omission or if there really isn't a capability. Anyway, 8 specs could be reduced to 1 with plugs.

@garlick
Copy link
Member

garlick commented Feb 7, 2024

So in summary I think the path forward is:

  • provide a way to optionally do %s substitution in the URIs within redfishpower
  • provide a way, like an alternate setonpath type command, to associate a special plug with a different URI for chassis support
  • optionally implement power control hierarchy support in redfish (maybe another set command to establish a parent plug) but first check and see if that actually helps with the El Cap stuff and defer if not.
  • build and test new dev scripts for the El Cap hardware

Edit: look at all the cut & paste this fixes! Does it go a little ways to address your concern

it seems that copy & modify the device files is going to a common pattern with redfish and some of these REST interfaces, as there are quirks in every system. And with blades and parents, we might be introducing additional quirks too

@chu11
Copy link
Member Author

chu11 commented Feb 7, 2024

hmmmm, I guess it's just a difference of opinion. In my mind, writing out something like the following would be easier? Now everything is in one place, vs multiple .dev files?

# not blades
[login]
login.hosts = nodes[0-7]
login.auth = ...
login.statpath = ...
login.onpath = ...

[blade]
blade.hosts = nodes[8-1024]
blade.auth = ...
blade.parent = chassis[0-63]
blade.statpath = ...%s...
blade.onpath = ...%s...
blade.chassisstatpath = ...

[chassis]
chassis.hosts = chassis[0-63]
chassis.auth = ...
chassis.statpath = ...
chassis.onpath = ...

[gateway]
gateway.hosts = other_node_type[0-7]
gateway.auth = ...
gateway.statpath = ...
gateway.onpath = ...

@chu11
Copy link
Member Author

chu11 commented Feb 7, 2024

look at all the cut & paste this fixes! Does it go a little ways to address your concern

Yeah. I guess here are just a few concerns:

  • would this approach lead to an unnecessary number of redfishpower co-procs on the system? In my mind, 16-64 is ok, but possibly 1000s?

  • I am also trying to think of systems that we haven't seen yet. Maybe this is me thinking too far ahead for imaginary scenarios we haven't witnessed, but I'm thinking more flexibility would be wise to engineer in now vs later. BUT ... I guess in the worst case, if there are strange systems that arrive in the future, admins could do what they are doing right now (i.e. -h node[0,8,16,32,...] is one kooky config, -h node[1,9,17,33] is another kooky config).

@garlick
Copy link
Member

garlick commented Feb 7, 2024

I'm not sure there is a problem with 1-2K coprocs or why we need to invest effort or add complexity to avoid it. See #127 - 2048 coprocs (for a fictitions 16K node system) even works in the tiny ci environment.

I am also trying to think of systems that we haven't seen yet. Maybe this is me thinking too far ahead for imaginary scenarios we haven't witnessed, but I'm thinking more flexibility would be wise to engineer in now vs later. BUT ... I guess in the worst case, if there are strange systems that arrive in the future, admins could do what they are doing right now (i.e. -h node[0,8,16,32,...] is one kooky config, -h node[1,9,17,33] is another kooky config).

I'm not sure what you're referring to here. I'd say let's stay focused on use cases we have in front of us (or that we at least can find extant somewhere).

redfishpower is essentially a powerman plugin, so it really should behave like one, not go too far off in the weeds doing its own thing (only as necessary to meet specific objectives).

@chu11
Copy link
Member Author

chu11 commented Feb 7, 2024

redfishpower is essentially a powerman plugin, so it really should behave like one, not go too far off in the weeds doing its own thing (only as necessary to meet specific objectives).

Good point. In my mind I might have been thinking of it like a separate utility.

@garlick
Copy link
Member

garlick commented Feb 7, 2024

Specific issues now open (#128 and #129) so let's close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants