Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subsequent resubmissions lose destination's resubmissions definitions, breaking subsequent resubmissions on the same destination #15208

Open
pcm32 opened this issue Dec 13, 2022 · 14 comments

Comments

@pcm32
Copy link
Member

pcm32 commented Dec 13, 2022

Describe the bug

When a destination sets itself as the resubmission destination, this resubmission gets lost on the first resubmission, disabling subsequent resubmissions. This information is lost on the various resubmissions assignments here:

# fetch JobDestination for the id or tag
if destination:
new_destination = app.job_config.get_destination(destination)
else:
new_destination = job_state.job_destination
# Resolve dynamic if necessary
new_destination = job_state.job_wrapper.job_runner_mapper.cache_job_destination(new_destination)
# Reset job state
job_state.job_wrapper.clear_working_directory()
job = job_state.job_wrapper.get_job()
if resubmit.get("handler", None):
log.debug("%s Job reassigned to handler %s", job_log_prefix, resubmit["handler"])
job.set_handler(resubmit["handler"])
job_runner.sa_session.add(job)
# Is this safe to do here?
job_runner.sa_session.flush()
# Cache the destination to prevent rerunning dynamic after
# resubmit
job_state.job_wrapper.job_runner_mapper.cached_job_destination = new_destination

The above treatment doesn't happen of course on first submission, which explains why the first resubmission works.

Galaxy Version and/or server at which you observed the bug
Galaxy Version: 22.05
Commit: 3b068ee

To Reproduce
Steps to reproduce the behavior:

  1. Setup a tvp as done here
  2. Send a job which will fail due to memory needs.
  3. See the logs, you will see one resubmission go through, but not a second one.

Expected behavior

Subsequent resubmissions (in this case with escalating memory allowance) should occur.

Additional context

I have seen subsequent resubmissions work, when the destination gets subsequently changed. Here we are trying subsequent resubmissions on the same destination.

galaxyproject/total-perspective-vortex#65

@pcm32
Copy link
Member Author

pcm32 commented Dec 14, 2022

I have tried removing various pieces of the L70 - L89 quoted above, but it either breaks with exceptions or it doesn't work at all. I wonder if the resubmission state handler could just do the minimal work so that the normal handler then resolves dynamic destinations and others as if it was a first submission. The issue here I think is that the resubmit is doing stuff differently to the original handler in terms of dealing with destinations.

@bernt-matthias
Copy link
Contributor

Xref #9747

I think the best would be to have a test case in the first place. To show how it should work. Some fragments are in the linked PR.

As far as I remember I played around with the caching in the last line of the linked code fragment.

My impression was that the main reason is here

# resubmits are not persisted (it's a good thing) so they

(not sure if the code fragment is still valid)

@pcm32
Copy link
Member Author

pcm32 commented Dec 15, 2022

I can confirm that when using independent destinations to hop on from resubmission to resubmission this works. For instance:

execution:
  default: tpv_dispatcher
  environments:
    extra_mem_1:
      container_monitor: false
      docker_default_container_id: quay.io/galaxyproject/galaxy-min:22.05
      docker_enabled: true
      limits_cpu: 1
      limits_memory: 7Gi
      requests_cpu: 1
      requests_memory: 7Gi
      resubmit:
      - condition: memory_limit_reached and attempt <= 4
        destination: extra_mem_2
      - condition: any_failure and attempt <= 2
        destination: tpv_dispatcher
      runner: k8s
    extra_mem_2:
      container_monitor: false
      docker_default_container_id: quay.io/galaxyproject/galaxy-min:22.05
      docker_enabled: true
      limits_cpu: 1
      limits_memory: 10Gi
      requests_cpu: 1
      requests_memory: 10Gi
      resubmit:
      - condition: memory_limit_reached and attempt <= 4
        destination: extra_mem_3
      - condition: any_failure and attempt <= 2
        destination: tpv_dispatcher
      runner: k8s
    extra_mem_3:
      container_monitor: false
      docker_default_container_id: quay.io/galaxyproject/galaxy-min:22.05
      docker_enabled: true
      limits_cpu: 1
      limits_memory: 12Gi
      requests_cpu: 1
      requests_memory: 12Gi
      resubmit:
      - condition: any_failure and attempt <= 2
        destination: tpv_dispatcher
      runner: k8s
    k8s:
      runner: k8s
    local:
      runner: local
    tpv_dispatcher:
      container_monitor: false
      docker_default_container_id: 'quay.io/galaxyproject/galaxy-min:22.05'
      docker_enabled: true
      function: map_tool_to_destination
      rules_module: tpv.rules
      runner: dynamic
      tpv_config_files:
      - https://raw.githubusercontent.com/galaxyproject/tpv-shared-database/main/tools.yml
      - lib/galaxy/jobs/rules/tpv_rules_local.yml
      - /galaxy/server/config/persist-seq-extra-tpv.yaml
      type: python

so there it starts with tpv dispatcher, but then goes on and on through the extra_mem_* destinations/environments. I tried though a similar approach with two tpv environments (1 and 2, so that they would do the resubmit on each other), but didn't work. I wonder if the issue might be with the fact that they might be dynamic or not.

@bernt-matthias
Copy link
Contributor

So this example works, i.e tpv -> mem1 -> mem2 -> mem3 is case of successive memory violations? And maybe tpv -> mem1 -> tpv in case of a single memory violation?

I wonder if the issue might be with the fact that they might be dynamic or not.

Most likely. TPV is simply a dynamic runner .. would have been surprised if it worked.

@pcm32
Copy link
Member Author

pcm32 commented Dec 20, 2022

So I have done in other settings dynamic_1 -> dynamic_2 -> dynamic_3 without issues. A trial of tpv_1 -> tpv_2 -> tpv_1 didn't seem to work though, but maybe I messed up something there.

@bernt-matthias
Copy link
Contributor

But dynamic_i are all existing in the job conf, right?

@nuwang
Copy link
Member

nuwang commented Dec 21, 2022

@pcm32 There was a recent change to TPV that has not been released yet, that will always construct a JobDestination dynamically: https://github.com/galaxyproject/total-perspective-vortex/blob/1fb4f5be15651fbe6644e4c46eb87d938793574d/tpv/core/mapper.py#L73
With that change, tpv will no longer read destinations from job_conf.yml at all.
The only thing you will need to add is the runner parameter to existing tpv destinations.

I wonder whether that might solve this issue?
You could install it off source and use the migration guide here: https://github.com/galaxyproject/total-perspective-vortex/blob/main/MIGRATION.rst

@nuwang
Copy link
Member

nuwang commented Jan 12, 2023

@pcm32 Does the latest TPV release solve the resubmission issue, for TPV at least?

@pcm32
Copy link
Member Author

pcm32 commented Jan 16, 2023

Not yet, will try to get back to this soon :-).

@nuwang
Copy link
Member

nuwang commented Jan 16, 2023

@bernt-matthias Has confirmed that it still doesn't work: galaxyproject/total-perspective-vortex#78
Makes sense, because it appears to be caching the final, resolved destination. This will need to be fixed in Galaxy. I don't know what the original reasons for caching things this way are. If it's just performance, we could probably stop caching it? ping @natefoo @jmchilton

@natefoo
Copy link
Member

natefoo commented Jan 18, 2023

Yes, resolving a dynamic destination is a potentially expensive operation, and it would be performed on every loop of the job queue thread until the job is ready and dispatched, which is why dynamic destinations are cached.

@bernt-matthias
Copy link
Contributor

Thanks for that insight - really good to know.

My guess is that the caching itself is not the main problem, but that some properties (like the resubmit destination, env variables, ...?) are not persisted in the database

# resubmits are not persisted (it's a good thing) so they

so when the job is loaded on the 1st resubmission the resubmit destination is lost and it won't be resubmitted a 2nd time.
Some more thoughts might be hidden here #9747. But if this explanation is correct, then I do not yet understand why the 1st resubmission works.

@nuwang
Copy link
Member

nuwang commented Jan 19, 2023

Thanks, that makes sense. However, is there a reason why the cache can't be ignored in the resubmit case? In fact, looking at the code, that appears to be exactly what it's doing:

new_destination = job_state.job_wrapper.job_runner_mapper.cache_job_destination(new_destination)
. The way I understood it, calling cache_job_destination should cause the existing cache entry to be ignored, and the destination to be cached anew. It's not very obvious to me why that isn't working.

@bernt-matthias
Copy link
Contributor

is there a reason why the cache can't be ignored in the resubmit case

The reason has been given by @natefoo #15208 (comment) ... but I guess you meant to ignore it only once, i.e. re-cache, in order to call the dynamic resolver once and then use the cached result in subsequent accesses

You are right, the code reads like this...

My approach would be

In order to understand what is going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants