Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get thread and task backtraces before terminating a worker on timeout #157

Closed
wants to merge 13 commits into from

Conversation

kpamnany
Copy link
Collaborator

@kpamnany kpamnany commented May 6, 2024

Implements the alternative described in #105 as an option.

Alternatively, we could dump julia task backtraces and/or CPU stacktraces.

I think a similar incantation can be assembled for lldb, to support Mac. We could support FreeBSD/OpenBSD since they have gdb, but I don't have a machine that would let me test. Note that Sys.isbsd() returns true for Mac.

Closing comment.

kpamnany added 3 commits May 6, 2024 15:42
…eout

* Introduce `timeout_backtraces` to control timeout-triggered thread+task
  backtraces
To also ignore SIGUSR2 (used by Julia to pause a thread). Also, redirect
GDB output to a file: `gdb.btall`, otherwise thread backtraces are
dumped to the master process' `stdout`.
@kpamnany kpamnany force-pushed the kp-backtrace-timeouts branch from 5d64274 to 34001f5 Compare May 6, 2024 20:42
@kpamnany
Copy link
Collaborator Author

kpamnany commented May 6, 2024

CI doesn't seem to have gdb so the tests aren't running. They pass locally, FWIW.

Copy link
Collaborator

@nickrobinson251 nickrobinson251 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI doesn't seem to have gdb so the tests aren't running

I think this is worth investigating a little before we merge this

also left a few minor suggestions

src/ReTestItems.jl Outdated Show resolved Hide resolved
src/ReTestItems.jl Outdated Show resolved Hide resolved
src/ReTestItems.jl Outdated Show resolved Hide resolved
src/workers.jl Outdated Show resolved Hide resolved
src/workers.jl Show resolved Hide resolved
src/workers.jl Outdated
function trigger_backtraces(w::Worker, from::Symbol=:manual)
if Sys.islinux()
@debug "using GDB to get thread and task backtraces on worker $(w.pid) from $from"
gdb_cmd = `gdb -ex "handle SIGSEGV noprint nostop pass" -ex "handle SIGUSR2 noprint nostop pass" -ex "set pagination 0" -ex "set logging overwrite on" -ex "set logging file gdb.btall" -ex "set logging redirect on" -ex "set logging enabled on" -ex "thread apply all bt" -ex "set logging enabled off" -ex "call jl_print_task_backtraces(1)" --batch -p $(w.pid)`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a lot going on here... i think we need a comment (maybe a long one)

perhaps we write this out over multiple lines so we can add comments about blocks of -ex commands, like

Suggested change
gdb_cmd = `gdb -ex "handle SIGSEGV noprint nostop pass" -ex "handle SIGUSR2 noprint nostop pass" -ex "set pagination 0" -ex "set logging overwrite on" -ex "set logging file gdb.btall" -ex "set logging redirect on" -ex "set logging enabled on" -ex "thread apply all bt" -ex "set logging enabled off" -ex "call jl_print_task_backtraces(1)" --batch -p $(w.pid)`
gdb_cmd = Cmd([
"gdb",
# comment here about SIGSEGV and SIGUSR2
"-ex", "handle SIGSEGV noprint nostop pass",
"-ex", "handle SIGUSR2 noprint nostop pass",
"-ex", "set pagination 0",
"-ex", "set logging overwrite on",
"-ex", "set logging file gdb.btall",
"-ex", "set logging redirect on",
# comment here about getting traces for all threads
"-ex", "set logging enabled on",
"-ex", "thread apply all bt",
"-ex", "set logging enabled off",
# comment here...
"-ex", "call jl_print_task_backtraces(1)",
"--batch", "-p", "$(w.pid)"
])

e.g. i'm thinking a block like this needs explaining (probably not to a gdb connoisseur but to some of us less familiar with this wizardry):

-ex "set logging enabled on" -ex "thread apply all bt" -ex "set logging enabled off"

test/integrationtests.jl Show resolved Hide resolved
test/integrationtests.jl Show resolved Hide resolved
@testset "Backtraces timeout trigger" begin
function gdb_available()
try
run(`gdb -ex "exit"`)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to run CI on machines with gdb available? why don't the linux CI machines have it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need a step in the CI workflow that has:

      - name: Install GDB
        run: sudo apt-get install -y gdb

But CI.yml is written in a platform-independent way and I haven't figured out how to add that step for Ubuntu only. If you know, that would be great.

Copy link
Collaborator

@nickrobinson251 nickrobinson251 May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i think that just needs an if: line like

- name: Install GDB
  if: matrix.os == 'ubuntu-latest'
  run: sudo apt-get install -y gdb

not certain on the syntax, might need to be

- name: Install GDB
  if: ${{ matrix.os }} == 'ubuntu-latest'
  run: sudo apt-get install -y gdb

or

- name: Install GDB
  if: ${{ matrix.os == 'ubuntu-latest' }}
  run: sudo apt-get install -y gdb

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it, but the CI machine does not allow gdb to attach to the process:

Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Inappropriate ioctl for device.

So I've backed it out. 🤷‍♂️

kpamnany added 4 commits May 8, 2024 13:43
Thread backtraces are displayed on the GDB process' standard out
rather than on the worker's standard out. Use an IOBuffer on the main
process to capture the GDB process' output (instead of using a file) and
dump its contents after the captured logs.

Also add some comments per PR review comments.
This reverts commit b923167.
@kpamnany kpamnany force-pushed the kp-backtrace-timeouts branch from 0dce130 to d9bf1d3 Compare May 8, 2024 19:38
@kpamnany kpamnany force-pushed the kp-backtrace-timeouts branch from 1038ce5 to 4836727 Compare May 9, 2024 20:57
@kpamnany
Copy link
Collaborator Author

kpamnany commented May 9, 2024

After much testing, it turns out that gdb is actually pretty terrible when the number of threads in the process is large. And maybe also when you run multiple gdbs at once?

In my latest (still running) test:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
10695 kpamnany  20   0 7165224   6.0g 208860 R 100.3  20.4  87:57.87 gdb
10712 kpamnany  20   0 7259628   6.1g 208852 R 100.3  20.7  86:42.86 gdb
    1 root      20   0    2276   1416   1412 S   0.0   0.0   0:00.10 init(Ubuntu)

Whatever the cause, using gdb is not practical and we'll need another way to get thread/task backtraces from Julia. I'm closing this, won't work.

@kpamnany kpamnany closed this May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants