bloomberg · pablogsal · Aug 6, 2024 · May 21, 2024 · Jul 31, 2024
diff --git a/docs/overview.rst b/docs/overview.rst
@@ -54,6 +54,7 @@ Contents
 
    tutorials/deadlock
    tutorials/random_prime_number
+   tutorials/core_tutorial
 
 .. toctree::
   :caption: Project Information

diff --git a/docs/tutorials/core_tutorial.py b/docs/tutorials/core_tutorial.py
@@ -0,0 +1,20 @@
+import gc
+
+
+def types_found_in_tuples():
+    elem_types = set()
+
+    for obj in gc.get_objects():
+        if isinstance(obj, tuple):
+            elem_types.update(map(type, obj))
+
+    for elem_type in elem_types:
+        yield elem_type
+
+
+# printing all results with multiple calls to print
+for t in types_found_in_tuples():
+    print(t)
+
+# printing all results with one call to print
+print(*types_found_in_tuples(), sep="\n")
diff --git a/docs/tutorials/core_tutorial.rst b/docs/tutorials/core_tutorial.rst
@@ -0,0 +1,287 @@
+Debugging Core Files
+====================
+
+Intro
+-----
+
+This lesson is meant to familiarize you with using the ``pystack core`` subcommand to analyze a core
+dump file. The steps will help you enable core dumps in your Linux environment, run a script that
+crashes and leaves a core dump, and then analyze the core dump file with PyStack.
+
+Ensuring core dumps are enabled
+-------------------------------
+
+First, we need to ensure that crashing processes in our Linux environment leave core files behind.
+
+The system allows setting a limit on the size of a core dump, to avoid filling up your disk when
+a process is crashing frequently. You can check the configured size limit by running:
+
+.. code:: shell
+
+    ulimit -c
+
+If that reports ``0`` it means that core files have been disabled on your machine, and you'll need
+to enable them before proceeding with this tutorial. You can enable them using the ``ulimit``
+command in your terminal:
+
+.. code:: shell
+
+    # unlimited core size
+    ulimit -c unlimited
+
+    # or if you want to limit the size to 100 MB instead
+    ulimit -c 100000
+
+Note that this command only enables core files for the shell session you ran the command in. If you
+want to make this change permanent, you can add that line to your ``~/.bashrc`` or ``~/.zshrc`` file
+depending on your terminal preference. For instance:
+
+.. code:: shell
+
+    echo "ulimit -c unlimited" >> ~/.bashrc
+    source ~/.bashrc
+
+Next, we want to check where core dump files will be written. Check what this command prints:
+
+.. code:: shell
+
+    cat /proc/sys/kernel/core_pattern
+
+That may contain placeholders, prefixed with ``%``. You can learn more about what placeholders are
+available in the "Naming of core dump files" section of `man 5 core <https://man7.org/linux/man-pages/man5/core.5.html>`_.
+
+If you want to control the names used for generated core files, you can update the ``core_pattern``
+setting with ``sysctl``. Note that this affects all users and processes on the machine! The naming
+of core dump files can be configured like so:
+
+.. code:: shell
+
+    sudo sysctl -w kernel.core_pattern="/tmp/core-%e.%p.%h.%t"
+
+In this example, we configure the file pattern to include the executable name (``%e``), process ID
+(``%p``), hostname (``%h``), and timestamp (``%t``), and to be written to the ``/tmp`` directory. If
+you would like to learn more about core dumps, Red Hat has more information `here
+<https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/developer_guide/debugging-crashed-application#debugging-crashed-application_understanding-core-dumps6>`_.
+
+Generating the Core Dump
+------------------------
+
+You can test whether core dumps are being generated by running this command:
+
+.. code:: shell
+
+    python3 -c 'import os; os.abort()'
+
+If everything is configured properly, you should see a message telling you that the process aborted
+and a core file was dumped. If your shell is bash, that might say simply ``Aborted (core dumped)``.
+If you use zsh, you'd see something like ``zsh: abort      python3 -c 'import os; os.abort()'``.
+
+Now, we need to find the core file. Given the ``core_pattern`` that we configured above with
+``sysctl``, the core dump should have been written to ``/tmp`` and its name should start with
+``core-``, so you can find the core file with:
+
+.. code:: shell
+
+    ls /tmp/core-*
+
+Analyzing the Core Dump
+-----------------------
+
+Now that we have generated the core dump file and located it on disk, we can use the ``pystack
+core`` command to analyze it like so:
+
+.. code:: shell
+
+    pystack core /tmp/core-<executable_name>.<process_id>.<hostname>.<timestamp>
+
+The output will display the stack trace of the core dump file, which will help you identify the
+source of the error. In the case of explicitly telling the process to abort itself, that stack isn't
+all that interesting:
+
+.. code:: python
+
+    Using executable found in the core file: /src/.venv/bin/python
+
+    Core file information:
+    state: R zombie: True niceness: 0
+    pid: 618 ppid: 473 sid: 473
+    uid: 1000 gid: 1000 pgrp: 618
+    executable: python3 arguments: python3 -c import os; os.abort()
+
+    The process died due receiving signal SIGABRT
+    Traceback for thread 618 [Has the GIL] (most recent call last):
+        (Python) File "<string>", line 1, in <module>
+
+By default PyStack shows you the same Python stack as the interpreter would print if an exception
+occurred. Since the Python stack only has one frame on it, there's not much to see here! For
+debugging a crash, though, it's generally best to use the ``--native`` flag to display the
+interpreter's native C frames, as well as any C or C++ or Rust frames from libraries you are using,
+interleaved with the Python frames in the resulting stack trace.
+
+.. code:: shell
+
+    pystack core --native /tmp/core-<executable_name>.<process_id>.<hostname>.<timestamp>
+
+This will produce output similar to the below:
+
+.. code:: python
+
+    Using executable found in the core file: /src/.venv/bin/python
+
+    Core file information:
+    state: R zombie: True niceness: 0
+    pid: 618 ppid: 473 sid: 473
+    uid: 1000 gid: 1000 pgrp: 618
+    executable: python3 arguments: python3 -c import os; os.abort()
+
+    The process died due receiving signal SIGABRT
+    Traceback for thread 618 [Has the GIL] (most recent call last):
+        (C) File "???", line 0, in _start (python3)
+        (C) File "../csu/libc-start.c", line 360, in __libc_start_main@@GLIBC_2.34 (libc.so.6)
+        (C) File "../sysdeps/nptl/libc_start_call_main.h", line 58, in __libc_start_call_main (libc.so.6)
+        (C) File "???", line 0, in Py_BytesMain (python3)
+        (C) File "???", line 0, in Py_RunMain (python3)
+        (C) File "???", line 0, in PyRun_SimpleStringFlags (python3)
+        (C) File "???", line 0, in PyRun_StringFlags (python3)
+        (Python) File "<string>", line 1, in <module>
+        (C) File "./stdlib/abort.c", line 79, in abort (libc.so.6)
+        (C) File "../sysdeps/posix/raise.c", line 26, in raise (libc.so.6)
+        (C) File "./nptl/pthread_kill.c", line 89, in pthread_kill@@GLIBC_2.34 (libc.so.6)
+        (C) File "./nptl/pthread_kill.c", line 78, in __pthread_kill_internal (inlined) (libc.so.6)
+        (C) File "./nptl/pthread_kill.c", line 44, in __pthread_kill_implementation (inlined) (libc.so.6)
+
+Note that this now includes not only the Python frames on the stack, but also the C frames that
+led to the first Python call on the stack, as well as the C frames that the last Python call on the
+stack called into. This lets us see how the interpreter implements the ``os.abort()`` function, as
+well as how the C library (glibc) implements its own ``abort()`` function!
+
+Exercise
+--------
+
+Read through the code in ``core_tutorial.py`` and see if you can spot any bugs:
+
+.. literalinclude:: core_tutorial.py
+   :linenos:
+
+Given the context of this exercise, you won't be particularly surprised to learn that this code
+crashes, and leaves a core file for you to investigate. Spend a few minutes investigating the
+``pystack core --native`` report for the core file generated by running this script, and see if you
+can figure out what's going wrong. If you aren't familiar with Python's internals this will be
+tough, but these hints should help:
+
+.. raw:: html
+
+   <details>
+   <summary><i>Hint 1</i></summary>
+
+This program uses `gc.get_objects` to find every tuple in the program's memory, and then finds the
+set of distinct element types contained in tuples. The ``for`` loop on lines 16 and 17 runs
+successfully to completion and produces a usable report. We know that it finishes because the
+PyStack report shows that the crash happens inside the call on line 20.
+
+Printing each type one at a time in a loop ought to be equivalent to telling ``print()`` to print
+them all at once with newlines between each. Why isn't it?
+
+.. raw:: html
+
+   </details>
+   <details>
+   <summary><i>Hint 2</i></summary>
+
+Check the local variables with ``--locals`` as well. Can you spot anything wrong there?
+
+.. raw:: html
+
+   </details>
+   <details>
+   <summary><i>Hint 3</i></summary>
+
+``--native`` mode shows us that the crash happens within a call to :c:func:`PySequence_Tuple`. Read
+up on what that function does, and see if that gives you a clue about what's going wrong here.
+
+.. raw:: html
+
+   </details>
+   <details>
+   <summary><i>Hint 4</i></summary>
+
+The documentation tells us that :c:func:`PySequence_Tuple` can be called on either a sequence or an
+iterable. Can you tell from the stack what type of object it was being called on when the crash
+happened?
+
+.. raw:: html
+
+   </details>
+   <details>
+   <summary><i>Hint 5</i></summary>
+
+What gets printed out if you run this Python code?
+
+.. code:: python
+
+    def args_type(*args):
+        print(type(args))
+
+    args_type()
+
+.. raw:: html
+
+   </details>
+
+Solution
+--------
+
+.. raw:: html
+
+   <details>
+   <summary><i>Toggle to see the explanation</i></summary>
+
+Just by using ``--native`` mode, we see some interesting things. First, the crash doesn't happen
+when we loop over the values yielded by ``types_found_in_tuples()`` with a ``for`` loop, but does
+happen when we pass all the yielded values to ``print`` using ``*`` for unpacking. Second, the crash
+happens inside of a call to :c:func:`PySequence_Tuple`, which is the C function equivalent to the
+Python code ``tuple(iterable)``.
+
+When we add ``--locals`` into the mix, we additionally see the very surprising value of ``obj``:
+
+.. code:: text
+
+   obj: (<invalid object at 0x0>, <invalid object at 0x0>, ...)
+
+That's PyStack's representation of a tuple whose first two elements are `NULL pointers
+<https://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare/>`_!
+
+This is about as much as PyStack can tell us, and from here we need to figure things out on our own.
+But all the clues are there:
+
+1. ``types_found_in_tuples()`` was called into by :c:func:`PySequence_Tuple`, which constructs a new
+   tuple from an iterable
+2. ``types_found_in_tuples()`` went on to discover a tuple that contains NULL pointers
+3. The problem only happens when we use ``*`` unpacking in the call to ``print``, and doesn't happen
+   when we use a ``for`` loop
+4. Functions receive ``*args`` as a tuple!
+
+What's happening here is that, in order to execute ``print(*types_found_in_tuples())``, the
+interpreter needs to create a new tuple to pass to ``print``, which will contain all of the values
+yielded by ``types_found_in_tuples()``. The interpreter creates that tuple, and then starts
+iterating over the generator produced by calling ``types_found_in_tuples()`` in order to populate
+that new tuple and get it ready to pass to ``print``. But ``types_found_in_tuples()`` uses
+`gc.get_objects` and discovers this new, partially initialized tuple, and chaos ensues when it tries
+to use that object before the interpreter has finished assigning its elements.
+
+.. raw:: html
+
+   </details>
+
+
+Conclusion
+----------
+
+In this tutorial, we learned how to use the ``core`` subcommand to get stack traces from core dumps.
+We learned how to ensure core files are produced, and how to test that. We then analyzed the core
+dump file using the ``pystack`` command to identify the source of the fault. Using the ``--native``
+flag, we were able to view what the Python interpreter itself was doing at the time of the crash,
+which provided further insights into the cause of the error. By understanding how to analyze core
+dump files, we can effectively debug and troubleshoot interpreter crashes.
+
+Thank you for following along with this tutorial. Happy coding!