Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exanic: avoid receiving corrupted frames when lapped during memcpy #44

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Commits on Jul 8, 2021

  1. exanic: avoid receiving corrupted frames when lapped during memcpy

    Under certain timing conditions, it is possible for
    `exanic_receive_frame` and `exanic_receive_chunk` (and friends) to
    return corrupt data without reporting an error.
    
    We used the experimental setup described below to quickly (within tens
    of seconds) reproduce this issue.
    
    Two UDP senders were pointed to a UDP receiver, sending checksummed UDP
    packets at close to line rate. We found using one sender did not
    reproduce the issue reliably. For posterity, we used cached PIO sends
    from a Solarflare NIC for this step.
    
    Sender A sent a 393-byte payload full of `p` (hex `70`), while Sender
    B striped a 1371-byte payload with all bytes in the range [0, 255].
    There's no particular signifiance to these numbers or payloads other
    than that they were easy to test with.
    
    On another machine, the UDP receiver process used `exanic_receive_frame`
    to receive packets. Upon reception, it verified the UDP checksum, and
    reported an error if is was invalid. (A variant using
    `exanic_receive_chunk` exhibited the same bug.)
    
    Between each packet, the receiver busy-looped for 30us to simulate
    application load.
    
    The issue does not reproduce within a reasonable timeframe if the delay
    is too large or too small. 30us was picked as being the ~smallest
    possible delay on our hardware such that the receiver lapped (i.e.,
    `exanic_receive_frame` returned `-EXANIC_RX_FRAME_SWOVFL`) around 10
    times a second.
    
    Running the experiment outlined above, the receiver quickly received
    a bad-checksum UDP packet. A example of one is shown below (redacted MAC
    and IP).
    
                  00 01 02 03 04 05 06 07  08 09 0a 0b 0c 0d 0e 0f
        00000000  XX XX XX XX XX XX XX XX  XX XX XX XX 08 00 45 00  |XXXXXXXXXXXX..E.|
        00000010  05 6d 00 00 40 00 10 11  e8 2b XX XX XX XX XX XX  |.m..@....+XXXXXX|
        00000020  XX XX af af 1b 9e 05 59  7d 14 00 01 02 03 04 05  |XX.....Y}.......|
        00000030  06 07 08 09 0a 0b 0c 0d  0e 0f 10 11 12 13 14 15  |................|
        00000040  70 70 70 70 70 70 70 70  70 70 70 70 70 70 70 70  |pppppppppppppppp|
        00000050  70 70 70 70 70 70 70 70  70 70 70 70 70 70 70 70  |pppppppppppppppp|
        00000060  70 70 70 70 70 70 70 70  70 0d 1e 66 1b 00 00 00  |ppppppppp..f....|
        00000070  9e 9f a0 a1 a2 a3 a4 a5  4e 4f 50 51 52 53 54 55  |........NOPQRSTU|
        ...
        00000560  36 37 38 39 3a 3b 3c 3d  3e 3f 40 41 42 43 44 45  |6789:;<=>?@abcde|
        00000570  46 47 48 49 4a 4b 4c 4d  4e 4f 50 e0 e3 ef 2c     |FGHIJKLMNOP...,|
    
    Some narration:
    - The first chunk of this packet begins at `0x00` and ends at `0x77`
      (120).
    - At `0x2a` we see Sender A's payload, `00`, `01`, ..., `15`.
    - At `0x40`, the packet is corrupted with part of the contents of
      another packet from Sender B: a stream of `70`.
    - At `0x69` that we see a 32-bit FCS `0d 1e 66 1b` (matching the
      expected FCS of the Sender B's `70`-only packets).
    - Presumably the hardware writes at least 32-bit blocks, so we
      furthermore see padding of `00 00 00` to round up to `0x70`.
    - Between `0x70` and `0x77` we see whatever was left over in the chunk
      that got overwritten by sender B's final chunk.
    - From `0x78` onwards (i.e, the next chunk), Sender A's packet's
      contents resume uncorrupted.
    
    In short: the first chunk of the packet is corrupted from `0x40` to
    `0x77`.
    
    Minor note: the chunk is actually corrupted *backwards*. On the system
    under test, glibc chooses `__memcpy_ssse3_back` as its `memcpy`
    implementation. This implementation copies bytes back-to-front, which is
    why the beginning of the chunk is fine but the end is not.
    
    Given that the relevant code has a `TODO` around the buggy region,
    I imagine the potential for this issue was known, but the probability of
    its occurrence thought to be very slim. Hopefully, the example above
    shows it is easy to reproduce, both in test and in production.
    
    In this commit, we check that the generation number of the preceding
    chunk has not changed after `memcpy` returns (as that would mean we may
    have gotten lapped), and return an error instead.
    
    It is necessary to check the generation of the preceding chunk as
    opposed to re-checking the generation of the current chunk since
    128-byte DMAs are not atomic (even though they fit within a single TLP
    -- [1] suggests the maximum possible atomic width to be 64 bytes; on our
    hardware we are occasionally seeing less than this).
    
    Our understanding is that this should be safe even in the case of DMA
    TLP reordering on the PCIe bus, as writes will still be made visible to
    the host in increasing address order (so, if the contents of chunk N are
    in the process of being changed, chunk N-1's generation number has
    changed even if chunk N's hasn't yet).
    
    A similar issue in the kernel driver was addressed in 2016 as part
    b3257af. Note that the fix there is to re-check the generation of the
    current chunk after copying it; this is not sufficient to entirely
    eliminate the race (though it does, in practice, make it significantly
    less likely to reproduce). A fix for the kernel driver is not included
    in this commit.
    
    [1]: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/pcie-burst-transfer-paper.pdf
    Xyene committed Jul 8, 2021
    Configuration menu
    Copy the full SHA
    a492f26 View commit details
    Browse the repository at this point in the history