Subsequent resubmissions lose destination's resubmissions definitions, breaking subsequent resubmissions on the same destination #15208

pcm32 · 2022-12-13T18:53:06Z

Describe the bug

When a destination sets itself as the resubmission destination, this resubmission gets lost on the first resubmission, disabling subsequent resubmissions. This information is lost on the various resubmissions assignments here:

galaxy/lib/galaxy/jobs/runners/state_handlers/resubmit.py

Lines 70 to 89 in 35bf24a

    
           # fetch JobDestination for the id or tag 
        
           if destination: 
        
               new_destination = app.job_config.get_destination(destination) 
        
           else: 
        
               new_destination = job_state.job_destination 
        
           # Resolve dynamic if necessary 
        
           new_destination = job_state.job_wrapper.job_runner_mapper.cache_job_destination(new_destination) 
        
           # Reset job state 
        
           job_state.job_wrapper.clear_working_directory() 
        
           job = job_state.job_wrapper.get_job() 
        
           if resubmit.get("handler", None): 
        
               log.debug("%s Job reassigned to handler %s", job_log_prefix, resubmit["handler"]) 
        
               job.set_handler(resubmit["handler"]) 
        
               job_runner.sa_session.add(job) 
        
               # Is this safe to do here? 
        
               job_runner.sa_session.flush() 
        
           # Cache the destination to prevent rerunning dynamic after 
        
           # resubmit 
        
           job_state.job_wrapper.job_runner_mapper.cached_job_destination = new_destination

The above treatment doesn't happen of course on first submission, which explains why the first resubmission works.

Galaxy Version and/or server at which you observed the bug
Galaxy Version: 22.05
Commit: 3b068ee

To Reproduce
Steps to reproduce the behavior:

Setup a tvp as done here
Send a job which will fail due to memory needs.
See the logs, you will see one resubmission go through, but not a second one.

Expected behavior

Subsequent resubmissions (in this case with escalating memory allowance) should occur.

Additional context

I have seen subsequent resubmissions work, when the destination gets subsequently changed. Here we are trying subsequent resubmissions on the same destination.

galaxyproject/total-perspective-vortex#65

The text was updated successfully, but these errors were encountered:

pcm32 · 2022-12-14T10:49:46Z

I have tried removing various pieces of the L70 - L89 quoted above, but it either breaks with exceptions or it doesn't work at all. I wonder if the resubmission state handler could just do the minimal work so that the normal handler then resolves dynamic destinations and others as if it was a first submission. The issue here I think is that the resubmit is doing stuff differently to the original handler in terms of dealing with destinations.

bernt-matthias · 2022-12-14T20:53:40Z

Xref #9747

I think the best would be to have a test case in the first place. To show how it should work. Some fragments are in the linked PR.

As far as I remember I played around with the caching in the last line of the linked code fragment.

My impression was that the main reason is here

galaxy/lib/galaxy/jobs/handler.py

Line 215 in 737775f

# resubmits are not persisted (it's a good thing) so they

(not sure if the code fragment is still valid)

pcm32 · 2022-12-15T09:49:08Z

I can confirm that when using independent destinations to hop on from resubmission to resubmission this works. For instance:

execution:
  default: tpv_dispatcher
  environments:
    extra_mem_1:
      container_monitor: false
      docker_default_container_id: quay.io/galaxyproject/galaxy-min:22.05
      docker_enabled: true
      limits_cpu: 1
      limits_memory: 7Gi
      requests_cpu: 1
      requests_memory: 7Gi
      resubmit:
      - condition: memory_limit_reached and attempt <= 4
        destination: extra_mem_2
      - condition: any_failure and attempt <= 2
        destination: tpv_dispatcher
      runner: k8s
    extra_mem_2:
      container_monitor: false
      docker_default_container_id: quay.io/galaxyproject/galaxy-min:22.05
      docker_enabled: true
      limits_cpu: 1
      limits_memory: 10Gi
      requests_cpu: 1
      requests_memory: 10Gi
      resubmit:
      - condition: memory_limit_reached and attempt <= 4
        destination: extra_mem_3
      - condition: any_failure and attempt <= 2
        destination: tpv_dispatcher
      runner: k8s
    extra_mem_3:
      container_monitor: false
      docker_default_container_id: quay.io/galaxyproject/galaxy-min:22.05
      docker_enabled: true
      limits_cpu: 1
      limits_memory: 12Gi
      requests_cpu: 1
      requests_memory: 12Gi
      resubmit:
      - condition: any_failure and attempt <= 2
        destination: tpv_dispatcher
      runner: k8s
    k8s:
      runner: k8s
    local:
      runner: local
    tpv_dispatcher:
      container_monitor: false
      docker_default_container_id: 'quay.io/galaxyproject/galaxy-min:22.05'
      docker_enabled: true
      function: map_tool_to_destination
      rules_module: tpv.rules
      runner: dynamic
      tpv_config_files:
      - https://raw.githubusercontent.com/galaxyproject/tpv-shared-database/main/tools.yml
      - lib/galaxy/jobs/rules/tpv_rules_local.yml
      - /galaxy/server/config/persist-seq-extra-tpv.yaml
      type: python

so there it starts with tpv dispatcher, but then goes on and on through the extra_mem_* destinations/environments. I tried though a similar approach with two tpv environments (1 and 2, so that they would do the resubmit on each other), but didn't work. I wonder if the issue might be with the fact that they might be dynamic or not.

bernt-matthias · 2022-12-15T11:18:05Z

So this example works, i.e tpv -> mem1 -> mem2 -> mem3 is case of successive memory violations? And maybe tpv -> mem1 -> tpv in case of a single memory violation?

I wonder if the issue might be with the fact that they might be dynamic or not.

Most likely. TPV is simply a dynamic runner .. would have been surprised if it worked.

pcm32 · 2022-12-20T00:48:00Z

So I have done in other settings dynamic_1 -> dynamic_2 -> dynamic_3 without issues. A trial of tpv_1 -> tpv_2 -> tpv_1 didn't seem to work though, but maybe I messed up something there.

bernt-matthias · 2022-12-20T09:44:41Z

But dynamic_i are all existing in the job conf, right?

nuwang · 2022-12-21T09:20:39Z

@pcm32 There was a recent change to TPV that has not been released yet, that will always construct a JobDestination dynamically: https://github.com/galaxyproject/total-perspective-vortex/blob/1fb4f5be15651fbe6644e4c46eb87d938793574d/tpv/core/mapper.py#L73
With that change, tpv will no longer read destinations from job_conf.yml at all.
The only thing you will need to add is the runner parameter to existing tpv destinations.

I wonder whether that might solve this issue?
You could install it off source and use the migration guide here: https://github.com/galaxyproject/total-perspective-vortex/blob/main/MIGRATION.rst

nuwang · 2023-01-12T07:37:12Z

@pcm32 Does the latest TPV release solve the resubmission issue, for TPV at least?

pcm32 · 2023-01-16T13:30:59Z

Not yet, will try to get back to this soon :-).

nuwang · 2023-01-16T13:38:30Z

@bernt-matthias Has confirmed that it still doesn't work: galaxyproject/total-perspective-vortex#78
Makes sense, because it appears to be caching the final, resolved destination. This will need to be fixed in Galaxy. I don't know what the original reasons for caching things this way are. If it's just performance, we could probably stop caching it? ping @natefoo @jmchilton

natefoo · 2023-01-18T16:04:47Z

Yes, resolving a dynamic destination is a potentially expensive operation, and it would be performed on every loop of the job queue thread until the job is ready and dispatched, which is why dynamic destinations are cached.

bernt-matthias · 2023-01-18T16:17:21Z

Thanks for that insight - really good to know.

My guess is that the caching itself is not the main problem, but that some properties (like the resubmit destination, env variables, ...?) are not persisted in the database

galaxy/lib/galaxy/jobs/handler.py

Line 215 in 737775f

# resubmits are not persisted (it's a good thing) so they

so when the job is loaded on the 1st resubmission the resubmit destination is lost and it won't be resubmitted a 2nd time.
Some more thoughts might be hidden here #9747. But if this explanation is correct, then I do not yet understand why the 1st resubmission works.

nuwang · 2023-01-19T07:32:15Z

Thanks, that makes sense. However, is there a reason why the cache can't be ignored in the resubmit case? In fact, looking at the code, that appears to be exactly what it's doing:

galaxy/lib/galaxy/jobs/runners/state_handlers/resubmit.py

Line 77 in 35bf24a

    
           new_destination = job_state.job_wrapper.job_runner_mapper.cache_job_destination(new_destination)

. The way I understood it, calling cache_job_destination should cause the existing cache entry to be ignored, and the destination to be cached anew. It's not very obvious to me why that isn't working.

bernt-matthias · 2023-01-19T11:07:51Z

is there a reason why the cache can't be ignored in the resubmit case

The reason has been given by @natefoo #15208 (comment) ... but I guess you meant to ignore it only once, i.e. re-cache, in order to call the dynamic resolver once and then use the cached result in subsequent accesses

You are right, the code reads like this...

My approach would be

to take the test case from Defer evaluation of (dynamic) destinations to allow multiple resubmissions #9747
get a new PR and try to make sense of the log output (maybe adding some more new log statements)

In order to understand what is going on.

nuwang mentioned this issue May 25, 2023

input_size based rules for anndata_import tool galaxyproject/tpv-shared-database#24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subsequent resubmissions lose destination's resubmissions definitions, breaking subsequent resubmissions on the same destination #15208

Subsequent resubmissions lose destination's resubmissions definitions, breaking subsequent resubmissions on the same destination #15208

pcm32 commented Dec 13, 2022 •

edited

Loading

pcm32 commented Dec 14, 2022

bernt-matthias commented Dec 14, 2022

pcm32 commented Dec 15, 2022 •

edited

Loading

bernt-matthias commented Dec 15, 2022

pcm32 commented Dec 20, 2022

bernt-matthias commented Dec 20, 2022

nuwang commented Dec 21, 2022

nuwang commented Jan 12, 2023 •

edited

Loading

pcm32 commented Jan 16, 2023

nuwang commented Jan 16, 2023 •

edited

Loading

natefoo commented Jan 18, 2023

bernt-matthias commented Jan 18, 2023

nuwang commented Jan 19, 2023

bernt-matthias commented Jan 19, 2023

Subsequent resubmissions lose destination's resubmissions definitions, breaking subsequent resubmissions on the same destination #15208

Subsequent resubmissions lose destination's resubmissions definitions, breaking subsequent resubmissions on the same destination #15208

Comments

pcm32 commented Dec 13, 2022 • edited Loading

pcm32 commented Dec 14, 2022

bernt-matthias commented Dec 14, 2022

pcm32 commented Dec 15, 2022 • edited Loading

bernt-matthias commented Dec 15, 2022

pcm32 commented Dec 20, 2022

bernt-matthias commented Dec 20, 2022

nuwang commented Dec 21, 2022

nuwang commented Jan 12, 2023 • edited Loading

pcm32 commented Jan 16, 2023

nuwang commented Jan 16, 2023 • edited Loading

natefoo commented Jan 18, 2023

bernt-matthias commented Jan 18, 2023

nuwang commented Jan 19, 2023

bernt-matthias commented Jan 19, 2023

pcm32 commented Dec 13, 2022 •

edited

Loading

pcm32 commented Dec 15, 2022 •

edited

Loading

nuwang commented Jan 12, 2023 •

edited

Loading

nuwang commented Jan 16, 2023 •

edited

Loading