Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exanic: avoid receiving corrupted frames when lapped during memcpy #44

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Xyene
Copy link

@Xyene Xyene commented Jun 16, 2021

Under certain timing conditions, it is possible for
exanic_receive_frame and exanic_receive_chunk (and friends) to
return corrupt data without reporting an error.

We used the experimental setup described below to quickly (within tens
of seconds) reproduce this issue.

Two UDP senders were pointed to a UDP receiver, sending checksummed UDP
packets at close to line rate. We found using one sender did not
reproduce the issue reliably. For posterity, we used cached PIO sends
from a Solarflare NIC for this step.

Sender A sent a 393-byte payload full of p (hex 70), while Sender
B striped a 1371-byte payload with all bytes in the range [0, 255].
There's no particular signifiance to these numbers or payloads other
than that they were easy to test with.

On another machine, the UDP receiver process used exanic_receive_frame
to receive packets. Upon reception, it verified the UDP checksum, and
reported an error if is was invalid. (A variant using
exanic_receive_chunk exhibited the same bug.)

Between each packet, the receiver busy-looped for 30us to simulate
application load.

The issue does not reproduce within a reasonable timeframe if the delay
is too large or too small. 30us was picked as being the ~smallest
possible delay on our hardware such that the receiver lapped (i.e.,
exanic_receive_frame returned -EXANIC_RX_FRAME_SWOVFL) around 10
times a second.

Running the experiment outlined above, the receiver quickly received
a bad-checksum UDP packet. A example of one is shown below (redacted MAC
and IP).

          00 01 02 03 04 05 06 07  08 09 0a 0b 0c 0d 0e 0f
00000000  XX XX XX XX XX XX XX XX  XX XX XX XX 08 00 45 00  |XXXXXXXXXXXX..E.|
00000010  05 6d 00 00 40 00 10 11  e8 2b XX XX XX XX XX XX  |.m..@....+XXXXXX|
00000020  XX XX af af 1b 9e 05 59  7d 14 00 01 02 03 04 05  |XX.....Y}.......|
00000030  06 07 08 09 0a 0b 0c 0d  0e 0f 10 11 12 13 14 15  |................|
00000040  70 70 70 70 70 70 70 70  70 70 70 70 70 70 70 70  |pppppppppppppppp|
00000050  70 70 70 70 70 70 70 70  70 70 70 70 70 70 70 70  |pppppppppppppppp|
00000060  70 70 70 70 70 70 70 70  70 0d 1e 66 1b 00 00 00  |ppppppppp..f....|
00000070  9e 9f a0 a1 a2 a3 a4 a5  4e 4f 50 51 52 53 54 55  |........NOPQRSTU|
...
00000560  36 37 38 39 3a 3b 3c 3d  3e 3f 40 41 42 43 44 45  |6789:;<=>?@ABCDE|
00000570  46 47 48 49 4a 4b 4c 4d  4e 4f 50 e0 e3 ef 2c     |FGHIJKLMNOP...,|

Some narration:

  • The first chunk of this packet begins at 0x00 and ends at 0x77
    (120).
  • At 0x2a we see Sender A's payload, 00, 01, ..., 15.
  • At 0x40, the packet is corrupted with part of the contents of
    another packet from Sender B: a stream of 70.
  • At 0x69 that we see a 32-bit FCS 0d 1e 66 1b (matching the
    expected FCS of the Sender B's 70-only packets).
  • Presumably the hardware writes at least 32-bit blocks, so we
    furthermore see padding of 00 00 00 to round up to 0x70.
  • Between 0x70 and 0x77 we see whatever was left over in the chunk
    that got overwritten by sender B's final chunk.
  • From 0x78 onwards (i.e, the next chunk), Sender A's packet's
    contents resume uncorrupted.

In short: the first chunk of the packet is corrupted from 0x40 to
0x77.

Minor note: the chunk is actually corrupted backwards. On the system
under test, glibc chooses __memcpy_ssse3_back as its memcpy
implementation. This implementation copies bytes back-to-front, which is
why the beginning of the chunk is fine but the end is not.

Given that the relevant code has a TODO around the buggy region,
I imagine the potential for this issue was known, but the probability of
its occurrence thought to be very slim. Hopefully, the example above
shows it is easy to reproduce, both in test and in production.

In this commit, we check that the generation number of the preceding
chunk has not changed after memcpy returns (as that would mean we may
have gotten lapped), and return an error instead.

It is necessary to check the generation of the preceding chunk as
opposed to re-checking the generation of the current chunk since
128-byte DMAs are not atomic (even though they fit within a single TLP
-- 1 suggests the maximum possible atomic width to be 64 bytes; on our
hardware we are occasionally seeing less than this).

Our understanding is that this should be safe even in the case of DMA
TLP reordering on the PCIe bus, as writes will still be made visible to
the host in increasing address order (so, if the contents of chunk N are
in the process of being changed, chunk N-1's generation number has
changed even if chunk N's hasn't yet).

A similar issue in the kernel driver was addressed in 2016 as part
b3257af. Note that the fix there is to re-check the generation of the
current chunk after copying it; this is not sufficient to entirely
eliminate the race (though it does, in practice, make it significantly
less likely to reproduce). A fix for the kernel driver is not included
in this commit.

@Xyene
Copy link
Author

Xyene commented Jun 16, 2021

Assumptions made in the analysis above:

  1. The hardware DMAs up to the host in 128-byte chunks.
  2. Each chunk always fits in a PCIe TLP.
  3. The CPU is not somehow locked out of RAM while the DMA TLP is being processed.
    (Empirically, does not seem to be the case.)
  4. Atomic RAM writes are at most 64-bytes wide, so a DMA write is not atomic.
    • Suggested by 1;
    • Empirically appears to be the case;
    • PCIe v3.0, Section 2.4.3, "Update Ordering and Granularity Provided by a Write
      Transaction"
      , paragraph 20:

      If a single write transaction containing multiple DWORDs and the Relaxed Ordering bit
      clear is accepted by a Completer, the observed ordering of the updates to locations
      within the Completer's data buffer must be in increasing address order. [...] However,
      observed granularity of the updates to the Completer's data buffer is outside the scope
      of the specification.

Necessary assumptions for this code to be anywhere approaching correct:

  1. Within a DMA TLP, writes will be made visible to the host in increasing address order.
    • Ref 2, suggests this is dependent on the Relaxed Ordering bit being set to 0 (the default).
      What is it set to on ExaNICs?
    • PCIe v3.0, Section 2.4.3, "Update Ordering and Granularity Provided by a Write
      Transaction"
      , paragraph 20 (same as above).
  2. DMA TLPs might be reordered on the PCIe bus, but they will always be made visible to the
    host in increasing address order. That is, if we see a change in the first byte of chunk N, chunk
    N-1 has already been fully overwritten.
    • PCIe v3.0, Section 2.4.1 "Transaction Ordering Rules", section A2a:

      A Posted Request must not pass another Posted Request unless A2b applies.

      (A2b is a mention of the Relaxed Ordering bit.)

I am not a hardware engineer, so verification of the points above would be much appreciated.

Update 06-17: added references to the PCIe v3.0 spec where I could find them.

@Xyene Xyene force-pushed the fix-corruption-on-lap-during-memcpy branch from 14cbe6c to 07e550f Compare July 8, 2021 17:10
Under certain timing conditions, it is possible for
`exanic_receive_frame` and `exanic_receive_chunk` (and friends) to
return corrupt data without reporting an error.

We used the experimental setup described below to quickly (within tens
of seconds) reproduce this issue.

Two UDP senders were pointed to a UDP receiver, sending checksummed UDP
packets at close to line rate. We found using one sender did not
reproduce the issue reliably. For posterity, we used cached PIO sends
from a Solarflare NIC for this step.

Sender A sent a 393-byte payload full of `p` (hex `70`), while Sender
B striped a 1371-byte payload with all bytes in the range [0, 255].
There's no particular signifiance to these numbers or payloads other
than that they were easy to test with.

On another machine, the UDP receiver process used `exanic_receive_frame`
to receive packets. Upon reception, it verified the UDP checksum, and
reported an error if is was invalid. (A variant using
`exanic_receive_chunk` exhibited the same bug.)

Between each packet, the receiver busy-looped for 30us to simulate
application load.

The issue does not reproduce within a reasonable timeframe if the delay
is too large or too small. 30us was picked as being the ~smallest
possible delay on our hardware such that the receiver lapped (i.e.,
`exanic_receive_frame` returned `-EXANIC_RX_FRAME_SWOVFL`) around 10
times a second.

Running the experiment outlined above, the receiver quickly received
a bad-checksum UDP packet. A example of one is shown below (redacted MAC
and IP).

              00 01 02 03 04 05 06 07  08 09 0a 0b 0c 0d 0e 0f
    00000000  XX XX XX XX XX XX XX XX  XX XX XX XX 08 00 45 00  |XXXXXXXXXXXX..E.|
    00000010  05 6d 00 00 40 00 10 11  e8 2b XX XX XX XX XX XX  |.m..@....+XXXXXX|
    00000020  XX XX af af 1b 9e 05 59  7d 14 00 01 02 03 04 05  |XX.....Y}.......|
    00000030  06 07 08 09 0a 0b 0c 0d  0e 0f 10 11 12 13 14 15  |................|
    00000040  70 70 70 70 70 70 70 70  70 70 70 70 70 70 70 70  |pppppppppppppppp|
    00000050  70 70 70 70 70 70 70 70  70 70 70 70 70 70 70 70  |pppppppppppppppp|
    00000060  70 70 70 70 70 70 70 70  70 0d 1e 66 1b 00 00 00  |ppppppppp..f....|
    00000070  9e 9f a0 a1 a2 a3 a4 a5  4e 4f 50 51 52 53 54 55  |........NOPQRSTU|
    ...
    00000560  36 37 38 39 3a 3b 3c 3d  3e 3f 40 41 42 43 44 45  |6789:;<=>?@abcde|
    00000570  46 47 48 49 4a 4b 4c 4d  4e 4f 50 e0 e3 ef 2c     |FGHIJKLMNOP...,|

Some narration:
- The first chunk of this packet begins at `0x00` and ends at `0x77`
  (120).
- At `0x2a` we see Sender A's payload, `00`, `01`, ..., `15`.
- At `0x40`, the packet is corrupted with part of the contents of
  another packet from Sender B: a stream of `70`.
- At `0x69` that we see a 32-bit FCS `0d 1e 66 1b` (matching the
  expected FCS of the Sender B's `70`-only packets).
- Presumably the hardware writes at least 32-bit blocks, so we
  furthermore see padding of `00 00 00` to round up to `0x70`.
- Between `0x70` and `0x77` we see whatever was left over in the chunk
  that got overwritten by sender B's final chunk.
- From `0x78` onwards (i.e, the next chunk), Sender A's packet's
  contents resume uncorrupted.

In short: the first chunk of the packet is corrupted from `0x40` to
`0x77`.

Minor note: the chunk is actually corrupted *backwards*. On the system
under test, glibc chooses `__memcpy_ssse3_back` as its `memcpy`
implementation. This implementation copies bytes back-to-front, which is
why the beginning of the chunk is fine but the end is not.

Given that the relevant code has a `TODO` around the buggy region,
I imagine the potential for this issue was known, but the probability of
its occurrence thought to be very slim. Hopefully, the example above
shows it is easy to reproduce, both in test and in production.

In this commit, we check that the generation number of the preceding
chunk has not changed after `memcpy` returns (as that would mean we may
have gotten lapped), and return an error instead.

It is necessary to check the generation of the preceding chunk as
opposed to re-checking the generation of the current chunk since
128-byte DMAs are not atomic (even though they fit within a single TLP
-- [1] suggests the maximum possible atomic width to be 64 bytes; on our
hardware we are occasionally seeing less than this).

Our understanding is that this should be safe even in the case of DMA
TLP reordering on the PCIe bus, as writes will still be made visible to
the host in increasing address order (so, if the contents of chunk N are
in the process of being changed, chunk N-1's generation number has
changed even if chunk N's hasn't yet).

A similar issue in the kernel driver was addressed in 2016 as part
b3257af. Note that the fix there is to re-check the generation of the
current chunk after copying it; this is not sufficient to entirely
eliminate the race (though it does, in practice, make it significantly
less likely to reproduce). A fix for the kernel driver is not included
in this commit.

[1]: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/pcie-burst-transfer-paper.pdf
@Xyene Xyene force-pushed the fix-corruption-on-lap-during-memcpy branch from 07e550f to a492f26 Compare July 8, 2021 17:14
@eigenmatt
Copy link
Contributor

eigenmatt commented Jul 10, 2021

Hi Tudor,

Thanks for the very detailed analysis here and in separate emails, bits of which I've seen via various channels.

To answer your questions about PCIe - ExaNICs send 128 byte TLPs with relaxed ordering bit 0. As per the PCIe spec clauses you've referenced, this should make the chunks "visible" in the expected order and chunk data "visible" in increasing address order (if we're pedantic there could be some gray area between the PCIe spec and any particular processor architecture over what "visible" means, but experience suggests that on x86 the data becomes visible to a read on any processor in the expected order without any special barrier-type instructions required, or ExaNIC RX would be liable to break in lots of ways beyond the overflow case you have mentioned). The generation is deliberately the last byte of each chunk so that the RX scheme should work regardless of write granularity. I'm pretty sure on all modern processors the write granularity is a cache line when a TLP hits a whole cache line - this avoids the processor that receives the PCIe TLP potentially having to fetch partial cache line data from another processor in this common case. But this is a moot point, since a chunk spans two cache lines [on x86], so the whole chunk is not written atomically in any case.

We were aware of the general issue with exanic_receive_frame not rechecking the chunk wasn't overwritten, hence the TODO. In the past we've assumed that no-one would operate close to the software overflow regime in production (because being 2MB behind in processing data means being at least 2ms behind the market). So this was something that wasn't high priority. Of course, this may not be the case for people who use ExaNICs for capture, and especially with increasing data rates over the years, and during bursts, etc,.. so it's probably worth fixing if we can do it with minimal performance impact (thank you for sharing your data on that).

The point you make about having to recheck the previous chunk footer rather than the current one is a very good point, and one that had escaped all of us (including myself as the original designer of this architecture many years ago)!

We'll have to apply similar changes to the kernel version and exanic_chunk_recheck as well, and do some more testing and discussion internally, but thanks again, the analysis you've put into this is impressive and much appreciated.

Best regards,
Matt

@afuruhed
Copy link

Thumbs up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants