diff --git a/docs/overview.rst b/docs/overview.rst index 703f26b0..7f813603 100644 --- a/docs/overview.rst +++ b/docs/overview.rst @@ -54,6 +54,7 @@ Contents tutorials/deadlock tutorials/random_prime_number + tutorials/core_tutorial .. toctree:: :caption: Project Information diff --git a/docs/tutorials/core_tutorial.py b/docs/tutorials/core_tutorial.py new file mode 100644 index 00000000..2b2b6c46 --- /dev/null +++ b/docs/tutorials/core_tutorial.py @@ -0,0 +1,20 @@ +import gc + + +def types_found_in_tuples(): + elem_types = set() + + for obj in gc.get_objects(): + if isinstance(obj, tuple): + elem_types.update(map(type, obj)) + + for elem_type in elem_types: + yield elem_type + + +# printing all results with multiple calls to print +for t in types_found_in_tuples(): + print(t) + +# printing all results with one call to print +print(*types_found_in_tuples(), sep="\n") diff --git a/docs/tutorials/core_tutorial.rst b/docs/tutorials/core_tutorial.rst new file mode 100644 index 00000000..7bf87170 --- /dev/null +++ b/docs/tutorials/core_tutorial.rst @@ -0,0 +1,287 @@ +Debugging Core Files +==================== + +Intro +----- + +This lesson is meant to familiarize you with using the ``pystack core`` subcommand to analyze a core +dump file. The steps will help you enable core dumps in your Linux environment, run a script that +crashes and leaves a core dump, and then analyze the core dump file with PyStack. + +Ensuring core dumps are enabled +------------------------------- + +First, we need to ensure that crashing processes in our Linux environment leave core files behind. + +The system allows setting a limit on the size of a core dump, to avoid filling up your disk when +a process is crashing frequently. You can check the configured size limit by running: + +.. code:: shell + + ulimit -c + +If that reports ``0`` it means that core files have been disabled on your machine, and you'll need +to enable them before proceeding with this tutorial. You can enable them using the ``ulimit`` +command in your terminal: + +.. code:: shell + + # unlimited core size + ulimit -c unlimited + + # or if you want to limit the size to 100 MB instead + ulimit -c 100000 + +Note that this command only enables core files for the shell session you ran the command in. If you +want to make this change permanent, you can add that line to your ``~/.bashrc`` or ``~/.zshrc`` file +depending on your terminal preference. For instance: + +.. code:: shell + + echo "ulimit -c unlimited" >> ~/.bashrc + source ~/.bashrc + +Next, we want to check where core dump files will be written. Check what this command prints: + +.. code:: shell + + cat /proc/sys/kernel/core_pattern + +That may contain placeholders, prefixed with ``%``. You can learn more about what placeholders are +available in the "Naming of core dump files" section of `man 5 core `_. + +If you want to control the names used for generated core files, you can update the ``core_pattern`` +setting with ``sysctl``. Note that this affects all users and processes on the machine! The naming +of core dump files can be configured like so: + +.. code:: shell + + sudo sysctl -w kernel.core_pattern="/tmp/core-%e.%p.%h.%t" + +In this example, we configure the file pattern to include the executable name (``%e``), process ID +(``%p``), hostname (``%h``), and timestamp (``%t``), and to be written to the ``/tmp`` directory. If +you would like to learn more about core dumps, Red Hat has more information `here +`_. + +Generating the Core Dump +------------------------ + +You can test whether core dumps are being generated by running this command: + +.. code:: shell + + python3 -c 'import os; os.abort()' + +If everything is configured properly, you should see a message telling you that the process aborted +and a core file was dumped. If your shell is bash, that might say simply ``Aborted (core dumped)``. +If you use zsh, you'd see something like ``zsh: abort python3 -c 'import os; os.abort()'``. + +Now, we need to find the core file. Given the ``core_pattern`` that we configured above with +``sysctl``, the core dump should have been written to ``/tmp`` and its name should start with +``core-``, so you can find the core file with: + +.. code:: shell + + ls /tmp/core-* + +Analyzing the Core Dump +----------------------- + +Now that we have generated the core dump file and located it on disk, we can use the ``pystack +core`` command to analyze it like so: + +.. code:: shell + + pystack core /tmp/core-... + +The output will display the stack trace of the core dump file, which will help you identify the +source of the error. In the case of explicitly telling the process to abort itself, that stack isn't +all that interesting: + +.. code:: python + + Using executable found in the core file: /src/.venv/bin/python + + Core file information: + state: R zombie: True niceness: 0 + pid: 618 ppid: 473 sid: 473 + uid: 1000 gid: 1000 pgrp: 618 + executable: python3 arguments: python3 -c import os; os.abort() + + The process died due receiving signal SIGABRT + Traceback for thread 618 [Has the GIL] (most recent call last): + (Python) File "", line 1, in + +By default PyStack shows you the same Python stack as the interpreter would print if an exception +occurred. Since the Python stack only has one frame on it, there's not much to see here! For +debugging a crash, though, it's generally best to use the ``--native`` flag to display the +interpreter's native C frames, as well as any C or C++ or Rust frames from libraries you are using, +interleaved with the Python frames in the resulting stack trace. + +.. code:: shell + + pystack core --native /tmp/core-... + +This will produce output similar to the below: + +.. code:: python + + Using executable found in the core file: /src/.venv/bin/python + + Core file information: + state: R zombie: True niceness: 0 + pid: 618 ppid: 473 sid: 473 + uid: 1000 gid: 1000 pgrp: 618 + executable: python3 arguments: python3 -c import os; os.abort() + + The process died due receiving signal SIGABRT + Traceback for thread 618 [Has the GIL] (most recent call last): + (C) File "???", line 0, in _start (python3) + (C) File "../csu/libc-start.c", line 360, in __libc_start_main@@GLIBC_2.34 (libc.so.6) + (C) File "../sysdeps/nptl/libc_start_call_main.h", line 58, in __libc_start_call_main (libc.so.6) + (C) File "???", line 0, in Py_BytesMain (python3) + (C) File "???", line 0, in Py_RunMain (python3) + (C) File "???", line 0, in PyRun_SimpleStringFlags (python3) + (C) File "???", line 0, in PyRun_StringFlags (python3) + (Python) File "", line 1, in + (C) File "./stdlib/abort.c", line 79, in abort (libc.so.6) + (C) File "../sysdeps/posix/raise.c", line 26, in raise (libc.so.6) + (C) File "./nptl/pthread_kill.c", line 89, in pthread_kill@@GLIBC_2.34 (libc.so.6) + (C) File "./nptl/pthread_kill.c", line 78, in __pthread_kill_internal (inlined) (libc.so.6) + (C) File "./nptl/pthread_kill.c", line 44, in __pthread_kill_implementation (inlined) (libc.so.6) + +Note that this now includes not only the Python frames on the stack, but also the C frames that +led to the first Python call on the stack, as well as the C frames that the last Python call on the +stack called into. This lets us see how the interpreter implements the ``os.abort()`` function, as +well as how the C library (glibc) implements its own ``abort()`` function! + +Exercise +-------- + +Read through the code in ``core_tutorial.py`` and see if you can spot any bugs: + +.. literalinclude:: core_tutorial.py + :linenos: + +Given the context of this exercise, you won't be particularly surprised to learn that this code +crashes, and leaves a core file for you to investigate. Spend a few minutes investigating the +``pystack core --native`` report for the core file generated by running this script, and see if you +can figure out what's going wrong. If you aren't familiar with Python's internals this will be +tough, but these hints should help: + +.. raw:: html + +
+ Hint 1 + +This program uses `gc.get_objects` to find every tuple in the program's memory, and then finds the +set of distinct element types contained in tuples. The ``for`` loop on lines 16 and 17 runs +successfully to completion and produces a usable report. We know that it finishes because the +PyStack report shows that the crash happens inside the call on line 20. + +Printing each type one at a time in a loop ought to be equivalent to telling ``print()`` to print +them all at once with newlines between each. Why isn't it? + +.. raw:: html + +
+
+ Hint 2 + +Check the local variables with ``--locals`` as well. Can you spot anything wrong there? + +.. raw:: html + +
+
+ Hint 3 + +``--native`` mode shows us that the crash happens within a call to :c:func:`PySequence_Tuple`. Read +up on what that function does, and see if that gives you a clue about what's going wrong here. + +.. raw:: html + +
+
+ Hint 4 + +The documentation tells us that :c:func:`PySequence_Tuple` can be called on either a sequence or an +iterable. Can you tell from the stack what type of object it was being called on when the crash +happened? + +.. raw:: html + +
+
+ Hint 5 + +What gets printed out if you run this Python code? + +.. code:: python + + def args_type(*args): + print(type(args)) + + args_type() + +.. raw:: html + +
+ +Solution +-------- + +.. raw:: html + +
+ Toggle to see the explanation + +Just by using ``--native`` mode, we see some interesting things. First, the crash doesn't happen +when we loop over the values yielded by ``types_found_in_tuples()`` with a ``for`` loop, but does +happen when we pass all the yielded values to ``print`` using ``*`` for unpacking. Second, the crash +happens inside of a call to :c:func:`PySequence_Tuple`, which is the C function equivalent to the +Python code ``tuple(iterable)``. + +When we add ``--locals`` into the mix, we additionally see the very surprising value of ``obj``: + +.. code:: text + + obj: (, , ...) + +That's PyStack's representation of a tuple whose first two elements are `NULL pointers +`_! + +This is about as much as PyStack can tell us, and from here we need to figure things out on our own. +But all the clues are there: + +1. ``types_found_in_tuples()`` was called into by :c:func:`PySequence_Tuple`, which constructs a new + tuple from an iterable +2. ``types_found_in_tuples()`` went on to discover a tuple that contains NULL pointers +3. The problem only happens when we use ``*`` unpacking in the call to ``print``, and doesn't happen + when we use a ``for`` loop +4. Functions receive ``*args`` as a tuple! + +What's happening here is that, in order to execute ``print(*types_found_in_tuples())``, the +interpreter needs to create a new tuple to pass to ``print``, which will contain all of the values +yielded by ``types_found_in_tuples()``. The interpreter creates that tuple, and then starts +iterating over the generator produced by calling ``types_found_in_tuples()`` in order to populate +that new tuple and get it ready to pass to ``print``. But ``types_found_in_tuples()`` uses +`gc.get_objects` and discovers this new, partially initialized tuple, and chaos ensues when it tries +to use that object before the interpreter has finished assigning its elements. + +.. raw:: html + +
+ + +Conclusion +---------- + +In this tutorial, we learned how to use the ``core`` subcommand to get stack traces from core dumps. +We learned how to ensure core files are produced, and how to test that. We then analyzed the core +dump file using the ``pystack`` command to identify the source of the fault. Using the ``--native`` +flag, we were able to view what the Python interpreter itself was doing at the time of the crash, +which provided further insights into the cause of the error. By understanding how to analyze core +dump files, we can effectively debug and troubleshoot interpreter crashes. + +Thank you for following along with this tutorial. Happy coding!