Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job list is empty on Slurm 18.08 #17

Open
fleimgruber opened this issue Sep 28, 2023 · 15 comments · May be fixed by #20
Open

Job list is empty on Slurm 18.08 #17

fleimgruber opened this issue Sep 28, 2023 · 15 comments · May be fixed by #20

Comments

@fleimgruber
Copy link

As a user, running turm shows the TUI with the 3 main panes, but without any jobs. No keyboard press has a visible effect, only q for quitting.

I compiled turm myself and we use Slurm 18.08. Is it maybe a compatibility issue?

@kabouzeid
Copy link
Owner

It just parses the output of squeue. So if squeue works, so should turm. Not sure what's going on.

@fleimgruber
Copy link
Author

Can you give me a hint on how to best debug this from turm? I triple checked that squeue without args gives me a job list, but turm does not. Since I only have a CLI available I tried rust-gdb, but it's output interferes with the turm TUI.

@fleimgruber
Copy link
Author

I am not intending to sound cheeky, but if there were automated tests shipped with turm, I could try running these on the SLURM system...

@kabouzeid
Copy link
Owner

would love to have tests, but then you have to somehow setup a clean slurm environemnt and start dummy jobs there. not sure how to best do that.

this is the part you need to debug:

turm/src/job_watcher.rs

Lines 53 to 133 in f104c7c

let jobs: Vec<Job> = Command::new("squeue")
.args(&self.squeue_args)
.arg("--array")
.arg("--noheader")
.arg("--Format")
.arg(&output_format)
.output()
.expect("failed to execute process")
.stdout
.lines()
.map(|l| l.unwrap().trim().to_string())
.filter_map(|l| {
let parts: Vec<_> = l.split(output_separator).collect();
if parts.len() != fields.len() + 1 {
return None;
}
let id = parts[0];
let name = parts[1];
let state = parts[2];
let user = parts[3];
let time = parts[4];
let tres = parts[5];
let partition = parts[6];
let nodelist = parts[7];
let stdout = parts[8];
let stderr = parts[9];
let command = parts[10];
let state_compact = parts[11];
let reason = parts[12];
let array_job_id = parts[13];
let array_task_id = parts[14];
let node_list = parts[15];
let working_dir = parts[16];
Some(Job {
job_id: id.to_owned(),
array_id: array_job_id.to_owned(),
array_step: match array_task_id {
"N/A" => None,
_ => Some(array_task_id.to_owned()),
},
name: name.to_owned(),
state: state.to_owned(),
state_compact: state_compact.to_owned(),
reason: if reason == "None" {
None
} else {
Some(reason.to_owned())
},
user: user.to_owned(),
time: time.to_owned(),
tres: tres.to_owned(),
partition: partition.to_owned(),
nodelist: nodelist.to_owned(),
command: command.to_owned(),
stdout: Self::resolve_path(
stdout,
array_job_id,
array_task_id,
id,
node_list,
user,
name,
working_dir,
),
stderr: Self::resolve_path(
stderr,
array_job_id,
array_task_id,
id,
node_list,
user,
name,
working_dir,
), // TODO fill all fields
})
})
.collect();

@fleimgruber
Copy link
Author

fleimgruber commented Oct 5, 2023

True, maybe this could provide a clean environment for testing? https://hub.docker.com/r/hpcnow/slurm_simulator

Failing that I could also see a set of test job definitions maintained here to be run against an existing production Slurm installation that could be used for very basic testing, e.g. a few sleep jobs that print to stdout so that at least parts of the UI are tested.

Regarding the part to debug: I do not yet have a CLI debugging setup for Rust. Another idea that came to mind: there is a feature of other Slurm TUIs to use SSH to connect to a Slurm host so the TUI would run locally and could then be more easily debugged, e.g. visual debugger in VS Code. Did you think about remote Slurm access? Do you have experience with SSH in Rust?

@kabouzeid
Copy link
Owner

You can use the remote SSH VS Code extension for running and debugging on the slurm host.

@fleimgruber
Copy link
Author

Thanks for mentioning, a good idea! I tried debugging in VS Code which tells me to install LLDB extensions. After that LLDB fails with version `GLIBC_2.18' not found. Slurm is running on CentOS 7 which only has glibc 2.17. I think also other Rust dev tools need at least glibc 2.18? See also rust-lang/rust-analyzer#4706.

@fleimgruber
Copy link
Author

In the meantime, I would try "printf-debugging", but written to a file because stdout will be drawn with TUI main loop already. I have this template:

let path = "results.txt";
let mut output = File::create(path)?;
let job_command = ...
write!(output, "{}", job_command)

Could you provide guidance on what to insert at ... from jobs to get the full squeue command that will be tried?

@kabouzeid
Copy link
Owner

Just debug print the Command with

let cmd = Command::new("squeue") 
     .args(&self.squeue_args) 
     .arg("--array") 
     .arg("--noheader") 
     .arg("--Format") 
     .arg(&output_format)

println!("{:?}", cmd);

@fleimgruber
Copy link
Author

fleimgruber commented Oct 18, 2023

For me it only works with

let cmd = Command::new("squeue")
      .args(&self.squeue_args)
      .arg("--array")
      .arg("--noheader")
      .arg("--Format")
      .arg(&output_format)
      .output();
println!("{:?}", cmd);

which prints a string with the expected comma-separated fields.

@fleimgruber
Copy link
Author

fleimgruber commented Oct 18, 2023

Ok, I could further narrow it down to this check:

if parts.len() != fields.len() + 1 {

which always evaluates to true so it always returns None and never the Job.

@fleimgruber
Copy link
Author

fleimgruber commented Oct 18, 2023

And the actual cause I think is that:

let parts: Vec<_> = l.split(output_separator).collect();

does not split at ###turm### because it is not included in the output of squeue.

It seems that the expectation with respect to Slurm output is not met, i.e.:

squeue --array --noheader --Format jobid:###turm###

prints only the jobids to STDOUT. The manpages of the installed squeue and newer squeue differ:

@@ -1 +1 @@
-The format of each field is "type[:[.][size][suffix]]"
\ No newline at end of file
+The format of each field is "type[:[.][size]]"
\ No newline at end of file

@fleimgruber
Copy link
Author

fleimgruber commented Oct 18, 2023

So as mentioned in OP, it actually is a compatibility issue with Slurm 18.08. Do you see another way to do the string post-processing? E.g. split on a tab or a certain amount of blanks instead of the ###turm### sentinel.

Edit: I see now that the only way to parse the output is to not use the --noheader argument and look for the header column positions to correctly infer the field offsets for the actual output lines.

@kabouzeid
Copy link
Owner

Thanks for tracking this down!

Edit: I see now that the only way to parse the output is to not use the --noheader argument and look for the header column positions to correctly infer the field offsets for the actual output lines.

If someone implements this in a robust enough way, I would be willing to merge it. I won't have time to do this myself.

@fleimgruber fleimgruber changed the title Job list is empty Job list is empty on Slurm 18.08 Oct 25, 2023
@fleimgruber fleimgruber linked a pull request Nov 2, 2023 that will close this issue
@fleimgruber
Copy link
Author

I went ahead and implemented my suggested approach from #17 (comment) in #20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants