Resubmission with OOM doesn't seem to be kicked off #65

pcm32 · 2022-12-13T13:08:13Z

On my k8s setup, having fixed the OOM labeling, I use the following tpv setup:

  /galaxy/server/config/persist-seq-extra-tpv.yaml:
    applyToJob: true
    applyToWeb: true
    applyToSetupJob: false
    applyToWorkflow: true
    applyToNginx: false
    tpl: false
    content: |
      global:
        default_inherits: default
      tools:
        default:
          mem: 6
          env:
            OOM_TOOL_MEMORY: "{mem * int(job.destination_params.get('SCALING_FACTOR', 1)) if job.destination_params else 1}"
          params:
            SCALING_FACTOR: "{2 * int(job.destination_params.get('SCALING_FACTOR', 2)) if job.destination_params else 2}"
        resubmit:
          with_more_mem_on_failure:
            condition: memory_limit_reached and attempt <= 3
            destination: tpv_dispatcher

following I bit one of the test cases here.

on top of the existing one by default on the Helm chart:

  job_conf.yml:
    runners:
      k8s:
        k8s_extra_job_envs:
          HDF5_USE_FILE_LOCKING: 'FALSE'
    execution:
      environments:
        tpv_dispatcher:
          tpv_config_files:
            - https://raw.githubusercontent.com/galaxyproject/tpv-shared-database/main/tools.yml
            - lib/galaxy/jobs/rules/tpv_rules_local.yml
            - /galaxy/server/config/persist-seq-extra-tpv.yaml

on run I see the following DEBUG outputs from tpv:

tpv.core.entities DEBUG 2022-12-13 12:56:45,941 [pN:job_handler_0,p:8,tN:JobHandlerQueue.monitor_thread] Ranking destinations: [<class 'tpv.core.entities.Destination'> id=k8s, cores=None, mem=None, gpus=None, env=None, params={'limits_cpu': '{cores}', 'limits_memory': '{mem}Gi', 'requests_cpu': '{cores}', 'requests_memory': '{mem}Gi'}, resubmit=None, tags=<class 'tpv.core.entities.TagSetManager'> tags=[<Tag: name=scheduling, value=docker, type=TagType.ACCEPT>], rank=, inherits=None, context=None, rules={}] for entity: <class 'tpv.core.entities.Tool'> id=default, cores=1, mem=6, gpus=None, env={'OOM_TOOL_MEMORY': '1'}, params={'container_monitor': False, 'docker_default_container_id': 'quay.io/galaxyproject/galaxy-min:22.05', 'docker_enabled': 'true', 'tmp_dir': 'true', 'SCALING_FACTOR': '2'}, resubmit={}, tags=<class 'tpv.core.entities.TagSetManager'> tags=[<Tag: name=scheduling, value=local, type=TagType.REJECT>, <Tag: name=scheduling, value=offline, type=TagType.REJECT>], rank=helpers.we, inherits=None, context={}, rules={} using custom function

which looks fine I guess besides the fact that resubmit is empty or "{}" depending on where you look. But then I see a more worrying:

galaxy.security.object_wrapper WARNING 2022-12-13 12:56:46,202 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-0] Unable to create dynamic subclass SafeStringWrapper(galaxy.model.none_like.None:<class 'NoneType'>,<class 'NotImplementedType'>,<class 'bool'>,<class 'bytearray'>,<class 'ellipsis'>,<class 'galaxy.security.object_wrapper.SafeStringWrapper'>,<class 'galaxy.tools.wrappers.ToolParameterValueWrapper'>,<class 'numbers.Number'>) for <class 'galaxy.model.none_like.NoneDataset'>, None: type() doesn't support MRO entry resolution; use types.new_class()

which I suspect is related. On Galaxy, the jobs shows the OOM message on the UI ("Tool failed due to insufficient memory. Try with more memory."). Any idea of what might be going wrong? Is the DEBUG showing what one would expect? Thanks.

I also tried the setup in the readthedocs docs, it didn't work for me either, will post the results here as well.

The text was updated successfully, but these errors were encountered:

pcm32 · 2022-12-13T13:27:03Z

When I stick to the example in readthedocs (maybe just changing some of the default integers):

  /galaxy/server/config/persist-seq-extra-tpv.yaml:
    applyToJob: true
    applyToWeb: true
    applyToSetupJob: false
    applyToWorkflow: true
    applyToNginx: false
    tpl: false
    content: |
      global:
        default_inherits: default
      tools:
        default:
          mem: 5 * int(job.destination_params.get('SCALING_FACTOR', 1)) if job.destination_params else 5
          params:
            SCALING_FACTOR: "{2 * int(job.destination_params.get('SCALING_FACTOR', 2)) if job.destination_params else 2}"
        resubmit:
          with_more_mem_on_failure:
            condition: memory_limit_reached and attempt <= 3
            destination: tpv_dispatcher

I get in the logs more or less the same messages with slight variations

tpv.core.entities DEBUG 2022-12-13 13:21:32,151 [pN:job_handler_0,p:8,tN:JobHandlerQueue.monitor_thread] Ranking destinations: [<class 'tpv.core.entities.Destination'> id=k8s, cores=None, mem=None, gpus=None, env=None, params={'limits_cpu': '{cores}', 'limits_memory': '{mem}Gi', 'requests_cpu': '{cores}', 'requests_memory': '{mem}Gi'}, resubmit=None, tags=<class 'tpv.core.entities.TagSetManager'> tags=[<Tag: name=scheduling, value=docker, type=TagType.ACCEPT>], rank=, inherits=None, context=None, rules={}] for entity: <class 'tpv.core.entities.Tool'> id=default, cores=1, mem=5, gpus=None, env={}, params={'container_monitor': False, 'docker_default_container_id': 'quay.io/galaxyproject/galaxy-min:22.05', 'docker_enabled': 'true', 'tmp_dir': 'true', 'SCALING_FACTOR': '2'}, resubmit={}, tags=<class 'tpv.core.entities.TagSetManager'> tags=[<Tag: name=scheduling, value=local, type=TagType.REJECT>, <Tag: name=scheduling, value=offline, type=TagType.REJECT>], rank=helpers.we, inherits=None, context={}, rules={} using custom function

again with resubmit either None or {} depending on where you look, and

galaxy.security.object_wrapper WARNING 2022-12-13 13:21:32,445 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-1] Unable to create dynamic subclass SafeStringWrapper(galaxy.model.none_like.None:<class 'NoneType'>,<class 'NotImplementedType'>,<class 'bool'>,<class 'bytearray'>,<class 'ellipsis'>,<class 'galaxy.security.object_wrapper.SafeStringWrapper'>,<class 'galaxy.tools.wrappers.ToolParameterValueWrapper'>,<class 'numbers.Number'>) for <class 'galaxy.model.none_like.NoneDataset'>, None: type() doesn't support MRO entry resolution; use types.new_class()

pcm32 · 2022-12-13T13:57:10Z

After hammering this a bit more I managed to get the resubmission working using:

      global:
        default_inherits: default
      tools:
        default:
          mem: 5 * int(job.destination_params.get('SCALING_FACTOR', 1)) if job.destination_params else 5
          params:
            SCALING_FACTOR: "{2 * int(job.destination_params.get('SCALING_FACTOR', 2)) if job.destination_params else 2}"
          rules: []
          resubmit:
            with_more_mem_on_failure:
              condition: memory_limit_reached and attempt <= 3
              destination: tpv_dispatcher
            on_failure:
              condition: any_failure and attempt <= 2
              destination: tpv_dispatcher

filled the resubmit in the logs:

tpv.core.entities DEBUG 2022-12-13 13:55:25,490 [pN:job_handler_0,p:8,tN:JobHandlerQueue.monitor_thread] Ranking destinations: ...

resubmit={'with_more_mem_on_failure': {'condition': 'memory_limit_reached and attempt <= 3', 'destination': 'tpv_dispatcher'}, 'on_failure': {'condition': 'any_failure and attempt <= 2', 'destination': 'tpv_dispatcher'}}, tags=<class 'tpv.core.entities.TagSetManager'>

Might be good to update the readthedocs slightly to reflect this difference.

pcm32 · 2022-12-13T14:32:59Z

Ok, this is resubmitting, but only once....

pcm32 · 2022-12-13T16:31:44Z

I remember someone once having issues when trying to do multiple resubmissions to the same destination all the time. Does this work for you guys? In the past all my resubmissions would go from destination A to destination B to destination C, never to the same one again... I think.

pcm32 · 2022-12-13T17:35:47Z

It seems to be that this is more of a galaxy problem than TVP. After the first resubmission, the resubmit_definition is empty: https://github.com/galaxyproject/galaxy/blob/release_22.05/lib/galaxy/jobs/runners/state_handlers/resubmit.py#L22

pcm32 · 2022-12-13T17:45:10Z

So, on the first failure, the k8s destination does have a job_state.job_destination.get("resubmit"), however, on the second go, the same destination no longer has that resubmit destination filled. I suspect this must be happening at the same resubmit.py code....

nuwang · 2022-12-13T19:36:59Z

I'm trying to get the resubmit integration tests working again: https://github.com/galaxyproject/total-perspective-vortex/blob/main/tests/test_mapper_resubmit.py, and see whether we can test this exact scenario. However, it sure does look like this line in the block you highlighted could cause issues in particular https://github.com/galaxyproject/galaxy/blob/e67cb9b615d3f373fdf3d8534a6e5208f20e94b9/lib/galaxy/jobs/runners/state_handlers/resubmit.py#L87-L89

If I understood that comment correctly, it won't re-evaluate the dynamic code, and instead uses the cached destination, which would explain this behaviour.

pcm32 mentioned this issue Dec 13, 2022

Subsequent resubmissions lose destination's resubmissions definitions, breaking subsequent resubmissions on the same destination galaxyproject/galaxy#15208

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resubmission with OOM doesn't seem to be kicked off #65

Resubmission with OOM doesn't seem to be kicked off #65

pcm32 commented Dec 13, 2022 •

edited

Loading

pcm32 commented Dec 13, 2022 •

edited

Loading

pcm32 commented Dec 13, 2022 •

edited

Loading

pcm32 commented Dec 13, 2022

pcm32 commented Dec 13, 2022

pcm32 commented Dec 13, 2022

pcm32 commented Dec 13, 2022

nuwang commented Dec 13, 2022 •

edited

Loading

Resubmission with OOM doesn't seem to be kicked off #65

Resubmission with OOM doesn't seem to be kicked off #65

Comments

pcm32 commented Dec 13, 2022 • edited Loading

pcm32 commented Dec 13, 2022 • edited Loading

pcm32 commented Dec 13, 2022 • edited Loading

pcm32 commented Dec 13, 2022

pcm32 commented Dec 13, 2022

pcm32 commented Dec 13, 2022

pcm32 commented Dec 13, 2022

nuwang commented Dec 13, 2022 • edited Loading

pcm32 commented Dec 13, 2022 •

edited

Loading

pcm32 commented Dec 13, 2022 •

edited

Loading

pcm32 commented Dec 13, 2022 •

edited

Loading

nuwang commented Dec 13, 2022 •

edited

Loading