Worse performance compared to Architecture Proposal ROS1 version? #2636

VRichardJP · 2022-05-30T09:52:07Z

VRichardJP
May 30, 2022
Collaborator

I have been using ROS1 version of https://github.com/tier4/AutowareArchitectureProposal.iv in our vehicle for quite some time, and I have recently started to pick up with autoware.universe.

After a few tests on field, I can't help but feel that the overall performance of autoware on my machine is worse than what it used to be with AAP ROS 1. For instance, with AAP I did not have any localization issue on our test field, but with universe the localization is quite unstable and is easily lost if I drive too fast or turn suddenly. Similarly, the whole object tracking pipeline seems to be slower than with AAP (1~2 second delay from VLP cloud to tracked objects). So far, I have only been able to reach real-time performance by heavily downsampling the VLP sensor cloud (which was not necessary with AAP).

I don't doubt universe algorithms and features are better and more reliable than AAP, but I somewhat expected that the change from ROS1 to ROS2, the use of efficient DDS implementation and intraprocess communication would compensate the extra processing.

Is it only my experience/feeling? Is autoware heavier (slower?) than before or it just a matter of configuration/tuning? (e.g. #204 (comment))

Answered by VRichardJP

Jun 3, 2022

I think I finally managed to reach a good performance!

At the end of the day, it was only a matter of a few changes:

Tweaking BIOS and governor configuration to max out the CPU performance
Use @takam5f2 configuration for CycloneDDS
Fix a few slow algorithms (e.g. autowarefoundation/autoware.universe#1019)
Tune a few parameters and change some "bad" defaults (e.g. https://github.com/autowarefoundation/autoware/discussions/392)

View full answer

kenji-miyake · 2022-05-30T10:39:32Z

kenji-miyake
May 30, 2022

Although I'm not so familiar with this performance issue, I remember one of the biggest reasons is the overhead of existing executors.
I hope Events Executor, which is released in Humble, will improve the performance.
https://discourse.ros.org/t/ros2-middleware-change-proposal/15863

Regarding Autoware itself, I believe it didn't get so slower than TIER IV's proposal version. But as the features have been increasing gradually, it requires more machine resources if you use the full features.

1 reply

kenji-miyake May 30, 2022

Also, if you want to optimize performance, you need to tune your DDS settings.
I asked my colleague @takam5f2 to provide some information about this.

takam5f2 · 2022-05-30T11:18:05Z

takam5f2
May 30, 2022
Collaborator

@VRichardJP
After migrating from ROS 1 to ROS 2, we tend to take care of performance issue as you mentioned. There are many possible cause to degrade performance; executors, multicast communication, and so on. As you know, unlike ROS 1, ROS 2 does not have master node, so that all nodes try to discover another to communicate autonomously. Consequently, ROS 2 requires more workload of nodes than ROS 1.

For instance, with AAP I did not have any localization issue on our test field, but with universe the localization is quite unstable and is easily lost if I drive too fast or turn suddenly. Similarly, the whole object tracking pipeline seems to be slower than with AAP (1~2 second delay from VLP cloud to tracked objects).

However, we have not confronted with such disastrous performance issue as you said. Considering behavior you mentioned, I suspects that your system has something wrong with DDS configuration or memory bandwidth.

but I somewhat expected that the change from ROS1 to ROS2, the use of efficient DDS implementation and intraprocess communication would compensate the extra processing.

Inter-process communication via multicast on DDS costs CPU time. Each ROS 2 process has a thread to receive topic message, named "recvMC". recvMC threads cost about 20-30% of total CPU time in Autoware.
Please try unicast instead of multicast if you have not tried yet.
In TIER IV, it is recommended to set spdp on AllowMulticast to use unicast mainly as below.

You will find some useful references on CycloneDDS.

<?xml version="1.0" encoding="UTF-8" ?>
<CycloneDDS xmlns="https://cdds.io/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://cdds.io/config https://raw.githubusercontent.com/eclipse-cyclonedds/cyclonedds/master/etc/cyclonedds.xsd">
    <Domain Id="any">
        <General>
            <Interfaces>
                <NetworkInterface autodetermine="true" priority="default" multicast="default" />
            </Interfaces>
            <AllowMulticast>spdp</AllowMulticast>
            <MaxMessageSize>65500B</MaxMessageSize>
        </General>
        <Discovery>
            <EnableTopicDiscoveryEndpoints>true</EnableTopicDiscoveryEndpoints>
        </Discovery>
        <Internal>
            <Watermarks>
                <WhcHigh>500kB</WhcHigh>
            </Watermarks>
        </Internal>
        <Tracing>
            <Verbosity>config</Verbosity>
            <OutputFile>cdds.log.${CYCLONEDDS_PID}</OutputFile>
        </Tracing>
    </Domain>
</CycloneDDS>

3 replies

takam5f2 May 30, 2022
Collaborator

I don't doubt universe algorithms and features are better and more reliable than AAP

We added new features to Autoware after migrating ROS 2. Compared with ROS-1 based Autoware implemented till last year, current version requires more computing resources than it.
Sorry, but I can't describe it quantitatively.

VRichardJP May 31, 2022
Collaborator Author

Using custom cycloneDDS configuration seems indeed to help quite a lot.
The xml configuration you gave is for 0.9.x version I think, but it seems it is possible to get the equivalent for 0.8.0 (the one bundled with galactic)

<?xml version="1.0" encoding="UTF-8" ?>
<CycloneDDS xmlns="https://cdds.io/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://cdds.io/config https://raw.githubusercontent.com/eclipse-cyclonedds/cyclonedds/master/etc/cyclonedds.xsd">
    <Domain id="any">
        <General>
            <NetworkInterfaceAddress>auto</NetworkInterfaceAddress>
            <AllowMulticast>spdp</AllowMulticast>
            <MaxMessageSize>65500B</MaxMessageSize>
            <FragmentSize>4000B</FragmentSize>
        </General>
        <Internal>
            <Watermarks>
                <WhcHigh>500kB</WhcHigh>
            </Watermarks>
        </Internal>
        <Tracing>
            <Verbosity>config</Verbosity>
            <OutputFile>/home/sig/.ros/cdds.log.${CYCLONEDDS_PID}</OutputFile>
        </Tracing>
    </Domain>
</CycloneDDS>

Regarding cycloneDDS version, is there any significative advantage using newer version?

takam5f2 May 31, 2022
Collaborator

The configuration looks good.

is there any significative advantage using newer version?
I believed that it is improved day by day, but we will evaluate it during migration from Galactic to Humble.

VRichardJP · 2022-05-31T07:58:16Z

VRichardJP
May 31, 2022
Collaborator Author

To illustrate my situation, I have made a small script to track message frequency (like ros2 topic hz) and age (msg.header.stamp - recv_time) over few key topics. Autoware performance is deeply impacted by the choice of DDS and its configuration.

In the following situation I use cycloneDDS with its default configuration:

Topics are ordered. For example the VLP preprocessing pipeline from top/velodyne_packets to top/pointcloud takes 0.196 - 0.105 = 91ms. Another example: the VLP scan is already 661ms old by the time the new localization is computed (/localization/pose_estimator/pose_with_covariance). It gets worse as we go down, with tracked objects being already 3 seconds old by the time there are published.

As I said, I need to heavily downsample the VLP cloud to get "acceptable" performance:

Then, using ROS_LOCALHOST_ONLY=1 and the cycloneDDS configuration @takam5f2 suggested, the performance can be improved again:

But still, it is far from being ideal. In particular voxel_grid_downsampling, labeled_clusters->objects->validation/objects and tracking processing are still very slow on my machine. But as far as I remember, these algorithms also existed in AAP (maybe simpler?)

Last but not least. I observe that despite all my effort configuring cycloneDDS, FastRTPS seems always way faster:

I am wondering why FastRTPS is not recommended for autoware.
Is it just faster in my setup? can I meet that performance with cycloneDDS with a better configuration?

2 replies

takam5f2 May 31, 2022
Collaborator

@VRichardJP
Thank you for sharing your experiment.
I suppose that another bottleneck like memory bandwidth or thread pool size limits Autoware's performance.

Could you show me your system configuration?
Mine is here.

Resources	Description
CPU physical Cores	8
CPU logical Cores (threads)	16
CPU Freq	3.3 GHz
Memory size	32 GBytes
Memory channel	Dual
Memory speed	2666 GT/s

I'd like to know the number of CPU logical core because it decides thread pool size. If it is less than 8, sensing pipeline might not work correctly.
Sensing pipeline depends on memory speed (bandwidth), so I'd like to know them also.

Best,

VRichardJP Jun 1, 2022
Collaborator Author

The machine I used in for this test has a i7-9700K (8 cores, 8 threads, 3.60GHz to 4.9GHz) and 32GB DDR4 @2666MHz. The computer I used to have for my AAP tests was slightly less powerful.

VRichardJP · 2022-06-03T00:52:37Z

VRichardJP
Jun 3, 2022
Collaborator Author

I think I finally managed to reach a good performance!

At the end of the day, it was only a matter of a few changes:

Tweaking BIOS and governor configuration to max out the CPU performance
Use @takam5f2 configuration for CycloneDDS
Fix a few slow algorithms (e.g. Improve shape estimator performance for vehicles autoware.universe#1019)
Tune a few parameters and change some "bad" defaults (e.g. https://github.com/autowarefoundation/autoware/discussions/392)

1 reply

takam5f2 Jun 3, 2022
Collaborator

@VRichardJP
I'm glad to hear that. Thank you for your great contribution!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Autoware Foundation

Worse performance compared to Architecture Proposal ROS1 version? #2636

{{title}}

Replies: 0 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

The Autoware Foundation

Worse performance compared to Architecture Proposal ROS1 version? #2636

VRichardJP May 30, 2022 Collaborator

Replies: 0 comments · 11 replies

kenji-miyake May 30, 2022

kenji-miyake May 30, 2022

takam5f2 May 30, 2022 Collaborator

takam5f2 May 30, 2022 Collaborator

VRichardJP May 31, 2022 Collaborator Author

takam5f2 May 31, 2022 Collaborator

VRichardJP May 31, 2022 Collaborator Author

takam5f2 May 31, 2022 Collaborator

VRichardJP Jun 1, 2022 Collaborator Author

VRichardJP Jun 3, 2022 Collaborator Author

takam5f2 Jun 3, 2022 Collaborator

VRichardJP
May 30, 2022
Collaborator

Replies: 0 comments 11 replies

kenji-miyake
May 30, 2022

takam5f2
May 30, 2022
Collaborator

takam5f2 May 30, 2022
Collaborator

VRichardJP May 31, 2022
Collaborator Author

takam5f2 May 31, 2022
Collaborator

VRichardJP
May 31, 2022
Collaborator Author

takam5f2 May 31, 2022
Collaborator

VRichardJP Jun 1, 2022
Collaborator Author

VRichardJP
Jun 3, 2022
Collaborator Author

takam5f2 Jun 3, 2022
Collaborator