-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subsequent resubmissions lose destination's resubmissions definitions, breaking subsequent resubmissions on the same destination #15208
Comments
I have tried removing various pieces of the L70 - L89 quoted above, but it either breaks with exceptions or it doesn't work at all. I wonder if the resubmission state handler could just do the minimal work so that the normal handler then resolves dynamic destinations and others as if it was a first submission. The issue here I think is that the resubmit is doing stuff differently to the original handler in terms of dealing with destinations. |
Xref #9747 I think the best would be to have a test case in the first place. To show how it should work. Some fragments are in the linked PR. As far as I remember I played around with the caching in the last line of the linked code fragment. My impression was that the main reason is here galaxy/lib/galaxy/jobs/handler.py Line 215 in 737775f
(not sure if the code fragment is still valid) |
I can confirm that when using independent destinations to hop on from resubmission to resubmission this works. For instance:
so there it starts with tpv dispatcher, but then goes on and on through the |
So this example works, i.e
Most likely. TPV is simply a dynamic runner .. would have been surprised if it worked. |
So I have done in other settings |
But |
@pcm32 There was a recent change to TPV that has not been released yet, that will always construct a JobDestination dynamically: https://github.com/galaxyproject/total-perspective-vortex/blob/1fb4f5be15651fbe6644e4c46eb87d938793574d/tpv/core/mapper.py#L73 I wonder whether that might solve this issue? |
@pcm32 Does the latest TPV release solve the resubmission issue, for TPV at least? |
Not yet, will try to get back to this soon :-). |
@bernt-matthias Has confirmed that it still doesn't work: galaxyproject/total-perspective-vortex#78 |
Yes, resolving a dynamic destination is a potentially expensive operation, and it would be performed on every loop of the job queue thread until the job is ready and dispatched, which is why dynamic destinations are cached. |
Thanks for that insight - really good to know. My guess is that the caching itself is not the main problem, but that some properties (like the resubmit destination, env variables, ...?) are not persisted in the database galaxy/lib/galaxy/jobs/handler.py Line 215 in 737775f
so when the job is loaded on the 1st resubmission the resubmit destination is lost and it won't be resubmitted a 2nd time. |
Thanks, that makes sense. However, is there a reason why the cache can't be ignored in the resubmit case? In fact, looking at the code, that appears to be exactly what it's doing:
cache_job_destination should cause the existing cache entry to be ignored, and the destination to be cached anew. It's not very obvious to me why that isn't working.
|
The reason has been given by @natefoo #15208 (comment) ... but I guess you meant to ignore it only once, i.e. re-cache, in order to call the dynamic resolver once and then use the cached result in subsequent accesses You are right, the code reads like this... My approach would be
In order to understand what is going on. |
Describe the bug
When a destination sets itself as the resubmission destination, this resubmission gets lost on the first resubmission, disabling subsequent resubmissions. This information is lost on the various resubmissions assignments here:
galaxy/lib/galaxy/jobs/runners/state_handlers/resubmit.py
Lines 70 to 89 in 35bf24a
The above treatment doesn't happen of course on first submission, which explains why the first resubmission works.
Galaxy Version and/or server at which you observed the bug
Galaxy Version: 22.05
Commit: 3b068ee
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Subsequent resubmissions (in this case with escalating memory allowance) should occur.
Additional context
I have seen subsequent resubmissions work, when the destination gets subsequently changed. Here we are trying subsequent resubmissions on the same destination.
galaxyproject/total-perspective-vortex#65
The text was updated successfully, but these errors were encountered: