Jitter in RT communication under system load (Linux) #1374

Schrolli91 · 2022-05-24T10:51:35Z

Schrolli91
May 24, 2022

Hey folks,

we are currently investigating the suitability of Iceoryx for applications with hard realtime requirements under Linux. We noticed a lot of unexpected jitter, especially compared to a custom/self-made lib we are using internally which uses a shared memory mechanism. After some tests we are not sure where this extreme jitter comes from. It occurs mainly under load of the system.

The only difference we see to our simpler tool is the usage of atomic operations (CAS) in lockfree-queue handling and chunk management. According to the "perf" tool we saw that these operations have the greatest overhead and therefore the highest impact on latency during Pub/Sub communication.

Maybe one of you has an idea what the reason could be why Iceoryx is so much more susceptible to jitter, or what we should investigate further.

Best regards

Operating system:
Debian 11 - Kernel 5.10.84-rt58
Iceoryx Version 2.0.1
Hardware: Industrial PC with Quadcore CPU

Compiler version:
gcc 10.2.1

Observed result or behavior:
We see a lot of jitter within the communication. This occurs especially under increased system load.
A custom lib based on a simple "shmem" approach (p2p/one-way) does not show this extreme jitter.

Expected result or behavior:
Deterministic transmission times of the packets by the usage of Iceoryx.

Conditions where it occurred / Performed steps:
Test Application:
Tests were executed with a simple Publisher/Subscriber application based on provided examples.
Both Application subscribe to each other and proceed to ping-pong messages.
One million packets were sent for warm-up, and then RTTs were measured individually for 10 million messages.

System configuration:

RouDi
Affinity: taskset –c 2 chrt –f 80
Prio: FIFO 80
Pinned to Core 2
Publisher
Affinity: taskset –c 2 chrt –f 80
Prio: FIFO 80
Pinned to Core 2
Subscriber
Affinity: taskset –c 3 chrt –f 80
Prio: FIFO 80
Pinned to Core 3

Stress/Systemload is generated via:
stress-ng --cpu 4 --io 2 --vm 2 --vm-bytes 128M --fork 4 --timeout 0

Here is a comparison of the jitter we could observe using Iceoryx compared to our simpler internal library.
(The absolute values are not the problem, but the enormous jitter at Iceoryx).

Iceoryx (untyped API) – polling – 80 bytes – with stress

Custom implementation – polling – 80 bytes – with stress

elBoberido · 2022-05-24T19:26:44Z

elBoberido
May 24, 2022
Maintainer

@Schrolli91 thanks for doing this tests. Could you please repeat them with RouDi on a dedicated core? We suspect that this is causing the jitter since RouDi has some threads which run every 100ms and will compete for CPU time with the publisher.

Another question, does your custom implementation also have a broker? If yes, does it run on the same core as one of the clients?

4 replies

Schrolli91 May 25, 2022
Author

Hey @elBoberido, thanks for your fast reply.

Our custom implementation does not have a broker. We use a shared memory for a one-way transfer.
But our implementation is not a true zero copy, we use one single memcpy.

We have also tried to isolate the RouDi. But our measurement does not show any improvement in the jitter values.

System configuration:
Core 0 and Core 1 are used for all the main linux stuff.

RouDi
Affinity: taskset –c 1 chrt –f 80
Prio: FIFO 80
Pinned to Core 1 (is not a RT isolated Core)

Publisher
Affinity: taskset –c 2 chrt –f 80
Prio: FIFO 80
Pinned to Core 2

Subscriber
Affinity: taskset –c 3 chrt –f 80
Prio: FIFO 80
Pinned to Core 3

What we have seen with the perf record tool (https://man7.org/linux/man-pages/man1/perf-record.1.html) is that the "queue pop" and "decrement ref counter" functions are very expensive in terms of processor time. Do you think thats ok?

Are there any other ideas what we can try against our problem? Or do you think this is a normal problem under heavy load of the system with iceoryx?

Best regards,
Basti

elfenpiff May 25, 2022
Collaborator

@Schrolli91

We use a different queue when we have one publisher to many subscriber. The default that we support are multiple publishers which publish on the same topic to many subscribers. Maybe the cmake switch -DONE_TO_MANY_ONLY=ON solves the issue when you can restrict your setup to one publisher per topic. Here I would expect that the cpu load in tryGetChunk is maybe reduced.

I think the cpu load in {increment|decrement}ReferenceCounter is maybe caused by cache misses. When you look into the application you see a nullptr check and then a dereferencing of m_chunkManagement to acquire m_referenceCounter (here cache misses may occur since the whole ChunkManagement object is loaded into the cache just to access a member). It may improve the performance when m_referenceCounter is stored as additional member variable pointer in SharedChunk.
I will try to add this as a playground branch somewhere and ping you so that you can try it out.

How do you sync your memory during the one way transfer? Do you use a mutex?

Schrolli91 May 26, 2022
Author

hey @elfenpiff
we will try the compile flag -DONE_TO_MANY_ONLY=ON and give feedback tomorrow. Thanks for the tip first.
We will also do a perf-record recording again.

I will try to add this as a playground branch somewhere and ping you so that you can try it out.

That would be great. Thanks a lot for that.

How do you sync your memory during the one way transfer? Do you use a mutex?

One of our test applications does sync through a sempahore, but I'm not sure if this is the case with the application for our load test.
I am out of the office right now. But will check again tomorrow and let you know.

Schrolli91 May 27, 2022
Author

@elfenpiff
We made a new perf-record recording with the -DONE_TO_MANY_ONLY=ON flag you mentioned. Here you can see the result of our recording:

Unfortunately the jitter in our measurements is not appreciably better than before but more constant overall. All in all, a slight improvement, but not sufficient for our purposes.

How do you sync your memory during the one way transfer? Do you use a mutex?

Yes, memory access is locked via mutex.

elfenpiff · 2022-05-30T07:07:55Z

elfenpiff
May 30, 2022
Collaborator

@Schrolli91 I played around with it over the long weekend and at the moment I suspect maybe our lock-free algorithms. They have the advantage that we do not require a mutex which is perfect for a safety-critical system but may have the disadvantage that they require longer under high CPU load. In most lock-free structures you find some kind of loop where the work is performed and if the structure is unchanged after the work is done you change the structure but if the structure was modified you start again.
And when I see the IndexQueue::pop I suspect that this is exactly what happens here.

If this is the case we could implement a queue which is guarded by a mutex which does not have this problem at all. This maybe solves your problem but would not be usable in a safety critical context since blocking calls are forbidden.

18 replies

elfenpiff Jun 3, 2022
Collaborator

@elBoberido the whole thread in the application is disabled.

I rebased a little and pushed a variant where all the threads are running but the keepalive message is not send and the monitoring in roudi is always off.

@Schrolli91 could you measure it again and tell us if the jittering is still gone?

Schrolli91 Jun 3, 2022
Author

@elfenpiff, @elBoberido
Fresh pulled on same branch - Jitter is still gone with this variant.

By the way, We would like to thank you for your efforts so far.

elfenpiff Jun 3, 2022
Collaborator

@Schrolli91 no problem at all. You revealed a real issue so we have to thank you as well for debugging and benchmarking this together with us.

If you like you can join our developer meetup next thursday, see: https://github.com/eclipse-iceoryx/iceoryx/wiki/Developer-meetup then we will meet virtually "in person".

elBoberido Jun 3, 2022
Maintainer

@Schrolli91 thanks for your input. Btw. how did you create those charts?

Schrolli91 Jun 8, 2022
Author

@elBoberido we created these charts using matplotlib/numpy in Python.
Our benchmark application saves the measurements on the drive and we then create these plots using the script.

elfenpiff · 2022-06-03T09:39:02Z

elfenpiff
Jun 3, 2022
Collaborator

@Schrolli91

I have seen that this branch belongs to the following PR #1370.
We have noticed that on prio FIFO80 under high CPU load an error occurs.
This error occurs in the original as well as on the PR Branch!

This problem you can mitigate by starting roudi with ./build/iox-roudi -m off which turns off the monitoring. Here roudi tracks the incoming keepalive messages and when no message was received for 1500ms roudi kills the app which causes your problem.

0 replies

elfenpiff · 2022-06-30T08:05:54Z

elfenpiff
Jun 30, 2022
Collaborator

@Schrolli91 I created the issue #1436 which should solve your jitter problem when you turn of the monitoring in roudi. I would like to add you as a reviewer if it's alright with you so that you are able to test if the jittering is really gone.

1 reply

Schrolli91 Jun 30, 2022
Author

yes please add me and we will check it once again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jitter in RT communication under system load (Linux) #1374

{{title}}

Replies: 4 comments 23 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Jitter in RT communication under system load (Linux) #1374

Schrolli91 May 24, 2022

Replies: 4 comments · 23 replies

elBoberido May 24, 2022 Maintainer

Schrolli91 May 25, 2022 Author

elfenpiff May 25, 2022 Collaborator

Schrolli91 May 26, 2022 Author

Schrolli91 May 27, 2022 Author

elfenpiff May 30, 2022 Collaborator

elfenpiff Jun 3, 2022 Collaborator

Schrolli91 Jun 3, 2022 Author

elfenpiff Jun 3, 2022 Collaborator

elBoberido Jun 3, 2022 Maintainer

Schrolli91 Jun 8, 2022 Author

elfenpiff Jun 3, 2022 Collaborator

elfenpiff Jun 30, 2022 Collaborator

Schrolli91 Jun 30, 2022 Author

Schrolli91
May 24, 2022

Replies: 4 comments 23 replies

elBoberido
May 24, 2022
Maintainer

Schrolli91 May 25, 2022
Author

elfenpiff May 25, 2022
Collaborator

Schrolli91 May 26, 2022
Author

Schrolli91 May 27, 2022
Author

elfenpiff
May 30, 2022
Collaborator

elfenpiff Jun 3, 2022
Collaborator

Schrolli91 Jun 3, 2022
Author

elfenpiff Jun 3, 2022
Collaborator

elBoberido Jun 3, 2022
Maintainer

Schrolli91 Jun 8, 2022
Author

elfenpiff
Jun 3, 2022
Collaborator

elfenpiff
Jun 30, 2022
Collaborator

Schrolli91 Jun 30, 2022
Author