Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global identifier for the job monitoring #12192

Open
nikodemas opened this issue Dec 3, 2024 · 7 comments
Open

Global identifier for the job monitoring #12192

nikodemas opened this issue Dec 3, 2024 · 7 comments

Comments

@nikodemas
Copy link
Member

Impact of the new feature
WMArchive

Is your feature request related to a problem? Please describe.
It is impossible to match Jobs between the htcondor job monitoring and WMArchive monitoring. @RHofsaess found a way to match the jobs on the EOSLogURL, but it only works for the failed jobs.

Describe the solution you'd like
It would be nice to have a global identifier that would be available in the WMArchive and could also be propagated to htcondor monitoring.

Describe alternatives you've considered
EOSLogURL could be created for all jobs (not only the failed ones), however, this would require some string concats on the htcondor monitoring side that seem to be prone to breaking. See picture from the presentation of @RHofsaess (link to the full presentation):
image

FYI @leggerf

@vkuznet
Copy link
Contributor

vkuznet commented Dec 3, 2024

This is exactly why I was proposing usage of Data Tracers in recent USCMS monitoring talk. If we introduce traces into WM workflows we can easily correlate different sources.

@RHofsaess
Copy link

Hi all,
@vkuznet: This sounds good! however, probably more of a long term solution. Or is there already some sort of time line?

Just to clarify: but it only works for the failed jobs.
It also works for successful jobs, if I construct the EOSLogURL for both indices, as described on my slide.
I am currently doing the matching myself and push it to a self-hosted Opensearch 😅

@nikodemas Concerning the string concat: I dont think it is too critical. We could also think of adding a scripted field that just concats the parts non persistent.
This should be the least intruding short-term solution.

For condor_raw, it could look smth like:

if (doc.containsKey('data.Args') && doc['data.Args'].size() > 2) {
    String subTaskName = doc['data.WMAgent_SubTaskName'].value;
    String scheddName = doc['data.ScheddName'].value;
    String args = doc['data.Args'].value;
    
    String[] argsParts = args.split(" ");
    if (argsParts.length >= 3) {
        String arg1 = argsParts[1];  // Second argument
        String arg2 = argsParts[2];  // Third argument
        return subTaskName + "/" + scheddName + "-" + arg1 + "-" + arg2 + "-log.tar.gz";
     } else {
         return null; 
     }
} else {
    return null; 
}

I am not 100% sure, if the array accessing works like this, couldn't test it, as I do not have the permissions.

So please see it more as an idea than a proper implementation😅

Cheers,
Robin

@vkuznet
Copy link
Contributor

vkuznet commented Dec 4, 2024

@RHofsaess , I made a proposal to WM group about data traces, will it take into consideration is up to priority and stakeholds push. If you feel it is desired feature feel free to reach our WM L2's and express your desire to be included into WM plan for upcoming quarter(s).

@amaltaro
Copy link
Contributor

amaltaro commented Dec 4, 2024

@nikodemas the actual order of monitoring information is:

  1. first htcondor data gets published (as the job is pending, running and completed)
  2. then one of the components of the agent will parse the job information and log/framework report, and upload it to WMArchive.

Are you requesting to have these document ids (or A job id) common between these 2 monitoring systems?

Can you please remind us again what is the global id defined in condor-based job information?

@nikodemas
Copy link
Member Author

Are you requesting to have these document ids (or A job id) common between these 2 monitoring systems?

yes, it would be nice to have it to be able to match data from the two different data sources.

Currently we have a classad called GlobalJobId for the global id - see HTCondor docs and some example docs on the OpenSearch.

@amaltaro
Copy link
Contributor

amaltaro commented Dec 5, 2024

Ok, then we would have to persist this GlobalJobId into the WMAgent job report file (pickle or xml, I don't remember which one right now).

And in ArchiveDataReporter, when we construct documents to upload to WMArchive, we need to also add this GlobalJobId to the document.

@nikodemas do you know when this GlobalJobId gets populated in condor? Is it as soon as a job is pending? Is it defined in the spider script post job execution?

@nikodemas
Copy link
Member Author

@amaltaro I am not sure when, but it is available for all of the statuses such as Held and Idle, and spider script doesn't fill or modify this value at all and it is simply just taken as it is received from htcondor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: ToDo
Development

No branches or pull requests

4 participants