Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No such file or directory: '/runner/artifacts/2975782/job_events' #15657

Open
5 of 11 tasks
stefjakobs opened this issue Nov 22, 2024 · 0 comments
Open
5 of 11 tasks

No such file or directory: '/runner/artifacts/2975782/job_events' #15657

stefjakobs opened this issue Nov 22, 2024 · 0 comments

Comments

@stefjakobs
Copy link

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

Sometimes jobs fail with the following error:

<10.193.67.10> ESTABLISH SSH CONNECTION FOR USER: root
<10.193.67.10> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o Port=60813 -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o 'ControlPath="/runner/cp/e9c44f7b2f"' 10.193.67.10 '/bin/sh -c '"'"'echo ~root && sleep 0'"'"''
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/ansible/executor/task_queue_manager.py", line 465, in send_callback
    method(*new_args, **kwargs)
  File "/runner/artifacts/2975782/callback/awx_display.py", line 630, in v2_playbook_on_stats
  File "/usr/lib64/python3.11/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/runner/artifacts/2975782/callback/awx_display.py", line 374, in capture_event_data
  File "/runner/artifacts/2975782/callback/awx_display.py", line 239, in dump_begin
  File "/runner/artifacts/2975782/callback/awx_display.py", line 111, in set
FileNotFoundError: [Errno 2] No such file or directory: '/runner/artifacts/2975782/job_events'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/ansible/cli/__init__.py", line 659, in cli_executor
    exit_code = cli.run()
                ^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/ansible/cli/playbook.py", line 156, in run
    results = pbex.run()
              ^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/ansible/executor/playbook_executor.py", line 246, in run
    self._tqm.send_callback('v2_playbook_on_stats', self._tqm._stats)
  File "/usr/local/lib/python3.11/site-packages/ansible/utils/lock.py", line 41, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/ansible/executor/task_queue_manager.py", line 468, in send_callback
    display.warning(u"Failure using method (%s) in callback plugin (%s): %s" % (to_text(method_name), to_text(callback_plugin), to_text(e)))
  File "/runner/artifacts/2975782/callback/awx_display.py", line 256, in wrapper
  File "/usr/local/lib/python3.11/site-packages/ansible/utils/display.py", line 134, in proxyit
    return method(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/ansible/utils/display.py", line 539, in warning
    self.display(new_msg, color=C.COLOR_WARN, stderr=True)
  File "/runner/artifacts/2975782/callback/awx_display.py", line 304, in wrapper
  File "/runner/artifacts/2975782/callback/awx_display.py", line 239, in dump_begin
  File "/runner/artifacts/2975782/callback/awx_display.py", line 111, in set
FileNotFoundError: [Errno 2] No such file or directory: '/runner/artifacts/2975782/job_events'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/bin/ansible-playbook", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/ansible/cli/playbook.py", line 240, in main
    PlaybookCLI.cli_executor(args)
  File "/usr/local/lib/python3.11/site-packages/ansible/cli/__init__.py", line 687, in cli_executor
    display.error("Unexpected Exception, this is probably a bug: %s" % to_text(e), wrap_text=False)
  File "/runner/artifacts/2975782/callback/awx_display.py", line 256, in wrapper
  File "/usr/local/lib/python3.11/site-packages/ansible/utils/display.py", line 594, in error
    self.display(new_msg, color=C.COLOR_ERROR, stderr=True)
  File "/runner/artifacts/2975782/callback/awx_display.py", line 304, in wrapper
  File "/runner/artifacts/2975782/callback/awx_display.py", line 239, in dump_begin
  File "/runner/artifacts/2975782/callback/awx_display.py", line 111, in set
FileNotFoundError: [Errno 2] No such file or directory: '/runner/artifacts/2975782/job_events'

But sometimes it happens that the job just hangs (no progress), and then the private-data-dir on the execution node is missing. The /runner directory that is mapped into the podman container is empty. In these cases there are still ansible-playbook processes that are running:

root           1  0.0  0.0   2496   940 ?        Ss   Nov21   0:00 dumb-init ssh-agent sh -c trap 'rm -f /runner/artifacts/2984819/ssh_key_data' EXIT && ssh-add /runner/artifacts/2984819/ssh_key_data && rm -f /runner/artifacts/2984819/ssh_key_data && ansible-playbook -u root --diff --ask-vault-pass --forks=8 -l myhosts -e @/runner/env/tmp4shnbyed -i /runner/inventory/hosts -e @/runner/env/extravars playbooks/myplay.yml
root          10  0.0  0.0   4320  3512 pts/0    Ss+  Nov21   0:00 sh -c trap 'rm -f /runner/artifacts/2984819/ssh_key_data' EXIT && ssh-add /runner/artifacts/2984819/ssh_key_data && rm -f /runner/artifacts/2984819/ssh_key_data && ansible-playbook -u root --diff --ask-vault-pass --forks=8 -l myhosts -e @/runner/env/tmp4shnbyed -i /runner/inventory/hosts -e @/runner/env/extravars playbooks/myplay.yml
root          11  0.0  0.0   9732  4452 ?        Ss   Nov21   0:05  \_ ssh-agent sh -c trap 'rm -f /runner/artifacts/2984819/ssh_key_data' EXIT && ssh-add /runner/artifacts/2984819/ssh_key_data && rm -f /runner/artifacts/2984819/ssh_key_data && ansible-playbook -u root --diff --ask-vault-pass --forks=8 -l myhosts -e @/runner/env/tmp4shnbyed -i /runner/inventory/hosts -e @/runner/env/extravars playbooks/myplay.yml
root          14  0.4  1.2 474940 400912 pts/0   Sl+  Nov21   3:21  \_ /usr/bin/python3.11 /usr/local/bin/ansible-playbook -u root --diff --ask-vault-pass --forks=8 -l myhosts -e @/runner/env/tmp4shnbyed -i /runner/inventory/hosts -e @/runner/env/extravars playbooks/myplay.yml
root        5734  0.0  1.1 474940 388156 pts/0   S+   Nov21   0:00      \_ /usr/bin/python3.11 /usr/local/bin/ansible-playbook -u root --diff --ask-vault-pass --forks=8 -l myhosts -e @/runner/env/tmp4shnbyed -i /runner/inventory/hosts -e @/runner/env/extravars playbooks/myplay.yml
root        5735  0.0  0.0      0     0 pts/0    Z+   Nov21   0:00      \_ [ansible-playboo] <defunct>
root        5736  0.0  1.1 474940 390604 pts/0   S+   Nov21   0:00      \_ /usr/bin/python3.11 /usr/local/bin/ansible-playbook -u root --diff --ask-vault-pass --forks=8 -l myhosts -e @/runner/env/tmp4shnbyed -i /runner/inventory/hosts -e @/runner/env/extravars playbooks/myplay.yml

If a timeout is configured on the AWX job, then it fails with "Failed to JSON parse a line from worker stream".

AWX version

24.6.1

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

docker development environment

Modifications

no

Ansible version

9

Operating system

Debian 12

Web browser

Firefox

Steps to reproduce

I don't know. It seems that the error appears on longer-running jobs more often. Sometimes it fails after 15 minutes sometimes after 150 minutes. The number of slices and forks makes no difference. Even with 1 fork and 1 slice it might fail. But it seems that the playbook must do some file lookups or delegate_to localhost to read some files.

Expected results

Job should succeed.

Actual results

Job fails with "file not found" error.

Additional information

I first suspected an issue with ansible-runner, because it is executed with the --delete flag and that it for some reason cleans up the private-data-dir. So I tried it with ansible-runner=2.3.6 on the exec nodes, but that doesn't change anything.
Then I downgraded receptor to 1.4.1 on the exec nodes, but that doesn't make a difference, too.

The "file not found" error and that there is a defunct ansible-playbook process are the only hints I have so far. I don't see anything else in the logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant