Global identifier for the job monitoring #12192

nikodemas · 2024-12-03T10:01:49Z

Impact of the new feature
WMArchive

Is your feature request related to a problem? Please describe.
It is impossible to match Jobs between the htcondor job monitoring and WMArchive monitoring. @RHofsaess found a way to match the jobs on the EOSLogURL, but it only works for the failed jobs.

Describe the solution you'd like
It would be nice to have a global identifier that would be available in the WMArchive and could also be propagated to htcondor monitoring.

Describe alternatives you've considered
EOSLogURL could be created for all jobs (not only the failed ones), however, this would require some string concats on the htcondor monitoring side that seem to be prone to breaking. See picture from the presentation of @RHofsaess (link to the full presentation):

FYI @leggerf

The text was updated successfully, but these errors were encountered:

vkuznet · 2024-12-03T14:52:57Z

This is exactly why I was proposing usage of Data Tracers in recent USCMS monitoring talk. If we introduce traces into WM workflows we can easily correlate different sources.

RHofsaess · 2024-12-04T09:51:04Z

Hi all,
@vkuznet: This sounds good! however, probably more of a long term solution. Or is there already some sort of time line?

Just to clarify: but it only works for the failed jobs.
It also works for successful jobs, if I construct the EOSLogURL for both indices, as described on my slide.
I am currently doing the matching myself and push it to a self-hosted Opensearch 😅

@nikodemas Concerning the string concat: I dont think it is too critical. We could also think of adding a scripted field that just concats the parts non persistent.
This should be the least intruding short-term solution.

For condor_raw, it could look smth like:

if (doc.containsKey('data.Args') && doc['data.Args'].size() > 2) {
    String subTaskName = doc['data.WMAgent_SubTaskName'].value;
    String scheddName = doc['data.ScheddName'].value;
    String args = doc['data.Args'].value;
    
    String[] argsParts = args.split(" ");
    if (argsParts.length >= 3) {
        String arg1 = argsParts[1];  // Second argument
        String arg2 = argsParts[2];  // Third argument
        return subTaskName + "/" + scheddName + "-" + arg1 + "-" + arg2 + "-log.tar.gz";
     } else {
         return null; 
     }
} else {
    return null; 
}

I am not 100% sure, if the array accessing works like this, couldn't test it, as I do not have the permissions.

So please see it more as an idea than a proper implementation😅

Cheers,
Robin

vkuznet · 2024-12-04T13:13:13Z

@RHofsaess , I made a proposal to WM group about data traces, will it take into consideration is up to priority and stakeholds push. If you feel it is desired feature feel free to reach our WM L2's and express your desire to be included into WM plan for upcoming quarter(s).

amaltaro · 2024-12-04T13:31:55Z

@nikodemas the actual order of monitoring information is:

first htcondor data gets published (as the job is pending, running and completed)
then one of the components of the agent will parse the job information and log/framework report, and upload it to WMArchive.

Are you requesting to have these document ids (or A job id) common between these 2 monitoring systems?

Can you please remind us again what is the global id defined in condor-based job information?

nikodemas · 2024-12-05T15:51:39Z

Are you requesting to have these document ids (or A job id) common between these 2 monitoring systems?

yes, it would be nice to have it to be able to match data from the two different data sources.

Currently we have a classad called GlobalJobId for the global id - see HTCondor docs and some example docs on the OpenSearch.

amaltaro · 2024-12-05T16:17:30Z

Ok, then we would have to persist this GlobalJobId into the WMAgent job report file (pickle or xml, I don't remember which one right now).

And in ArchiveDataReporter, when we construct documents to upload to WMArchive, we need to also add this GlobalJobId to the document.

@nikodemas do you know when this GlobalJobId gets populated in condor? Is it as soon as a job is pending? Is it defined in the spider script post job execution?

nikodemas · 2024-12-06T13:34:36Z

@amaltaro I am not sure when, but it is available for all of the statuses such as Held and Idle, and spider script doesn't fill or modify this value at all and it is simply just taken as it is received from htcondor.

nikodemas added the New Feature label Dec 3, 2024

amaltaro added this to WMCore quarterly developments Jan 7, 2025

amaltaro moved this to ToDo in WMCore quarterly developments Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global identifier for the job monitoring #12192

Global identifier for the job monitoring #12192

nikodemas commented Dec 3, 2024

vkuznet commented Dec 3, 2024

RHofsaess commented Dec 4, 2024

vkuznet commented Dec 4, 2024

amaltaro commented Dec 4, 2024

nikodemas commented Dec 5, 2024

amaltaro commented Dec 5, 2024

nikodemas commented Dec 6, 2024

Global identifier for the job monitoring #12192

Global identifier for the job monitoring #12192

Comments

nikodemas commented Dec 3, 2024

vkuznet commented Dec 3, 2024

RHofsaess commented Dec 4, 2024

vkuznet commented Dec 4, 2024

amaltaro commented Dec 4, 2024

nikodemas commented Dec 5, 2024

amaltaro commented Dec 5, 2024

nikodemas commented Dec 6, 2024