Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MessageOutput: thread-safety and dynamic memory #1212

Closed

Conversation

schlimmchen
Copy link
Contributor

Very early on I found issues with the serial output of OpenDTU, i.e., garbled messages, where one message would be interrupted with another message. Also, some output seemed to be missing. There was at least discussion about this, see #1130. Very recently, a change was implemented in 43cba67, which is a big step forward, but does not solve all issues:

  • The MessageOutput buffer for the web console is still limited to an arbitrary amount of 500 Bytes.
  • Lots of functions defined in the core's Print.h and implemented in Print.cpp, especially println(String&), still are "vurnerable" to scheduling. Example: Scheduling can happen after printing the contents of a line, but before printing the \r\n.
  • Not all messages are printed using println(), but are stitched together with multiple print()s. Those can still be garbled beyond recognition due to scheduling.
  • Minor issue: There is a small chance of a buffer overflow in MessageOutput::write(uint8_t) because the lock is not protecting the buffer position check.

My proposal with this MR is as follows:

  • Every task/context writes to its own buffer (managed by MessageOutput). This prevents messages being garbled at all.
  • These buffers are dynamic, such that there is no issue printing lots of text per line.
  • Each time a line is completed, and only then, the task's buffered message is printed on the serial console. Additionally, the whole line is moved to a line-buffer for transport to the websocket.
  • In the context of the main loop, the buffered lines are moved to the websocket. It is taken care that the websocket queue is not overrun, so all messages actually appear in the web console.
  • The line buffer for the web console within MessageOutput is dynamic, such that no text has to be dropped (RAM limits still apply, of course). Lines are retried once the websocket handler got to process its queue.
  • It is taken care that memory that can be freed is freed. Also, buffers of deleted tasks are freed in each loop.

@tbnobody
Copy link
Owner

tbnobody commented Aug 6, 2023

Did you make a longer check regarding the memory fragmentation? Keep in mind, we only have ~110kb of memory available from which a huge portion will be required when rendering the prometheus output and the livedata json output. Memory fragmentation was the main reason that no dynamic memory is used.

@schlimmchen
Copy link
Contributor Author

I really don't expect this to be a problem even on our memory-constrained ESP32 devices. The buffers aren't that big (lines of log output), so they can occupy gaps in the memory layout, and more importantly, they are freed as soon as possible and they are indeed all freed routinely as long as the main loop is running. Moreover, I took care that each buffer is not copied uneccesarily (it is moved to _lines and then moved to a shared_ptr for the websocket, not copied). There is a chance that a buffer must grow, in which case the existing data is copied to a new block. There also is the unordered_map that keeps a little memory allocated for each living task. That's the theory.

I hate it but have to admit that your point is of course totally valid. This change must not break existing functionality.

Let me point out though, that being unable to have "nice things" like this working solution to the problem of mangled logs because some other component insists on allocating an insanely huge block of consecutive memory (relative to the whole memory) is frustrating. In the long run, data should be processed in reasonable chunks instead. I hope that's achievable. Since I do want prometheus integration myself in the near future, I might look at this myself.

Having said that, I ran a test like this:

  • Uploaded a firmware including this change and started logging the serial output using minicom -C <file>.
  • Used websocat from https://github.com/vi/websocat/.
  • Started websocat ws://<hostname>/livedata.
  • Started websocat ws://<hostname>/console.
  • Used while true; do wget http://<hostname>/api/prometheus/metrics -O /tmp/metrics.txt &>/dev/null; [ $? -ne 0 ] && echo "failed $(date -Iseconds)"; sleep 0.1; done to query the prometheus metrics approximately every 100ms.
  • (Made sure that the above command would actually log a message on error by compiling a firmware that always throws std::bad_alloc in the prometheus handler.)
  • Only one inverter (HM-400) is configured to be polled (does that affect the memory size required? AFAIK the block in the prometheus handler is always 40k?).
  • Reduce the inverter poll interval to 1 second by configuration.
  • Hacked the MQTT publish interval to 1 second (firmware change). MQTT is in use and connected to the broker.
  • Threw in a while true; do mosquitto_pub -h <host> -t 'solar-staging/<serial>/cmd/limit_nonpersistent_relative' -m 90; sleep 10; mosquitto_pub -h <host> -t 'solar-staging/<serial>/cmd/limit_nonpersistent_relative' -m 100; sleep 10; done for good measure (the MQTT task logs messages in its own context).
  • The heap is 42% used at the start of the test. I can actually observe it jumping to a value of 55% to 58% used, then jump back to 42%, at /info/system of the web application, when refreshing the page often.
  • My browser window was also using the live view websocket most of the time.

I let this run for one and a half hour and observed no wget error and the uptime of the OpenDTU actually reached more than 60 minutes, so no reboot due to an uncatched bad allocation. The minicom log file did not show any "out of resources" messages.

Also, this change is running for ~20 hours on my productive board, with no signs of issues. However, that board runs OpenDTU-OnBattery, so it's not perfectly comparable. Still, I did not cause any issues by querying the prometheus API on that node.

What do you think? Was 1 hour long enough (with the intervals so short)? Have I missed something that I should have run in parallel? Is this setup worth documenting for future stress tests?

@schlimmchen
Copy link
Contributor Author

The newly added commit fixes a problem that existed before, but has been made worse by the thread-safety proposed with this pull request. Also before, very long lines that were printed through MessageOutput would be chopped off at around 256 Bytes (a guess, I did not check) because of a limited buffer in HardwareSerial (or the hardware, I don't know for sure).

However, nobody probably observed this problem since usually no single call to print something would result in a line longer than that limit. Multiple print statements give the HardwareSerial and the actual hardware enough time to make room for the next characters.

@schlimmchen
Copy link
Contributor Author

I noticed again that the web console is missing messages in logs provided by a user of the OpenDTU-OnBattery fork, see the last paragraph in hoylabs#393 (comment).

That's not your problem until you notice missing messages in web console output of your project's users as well, I understand. However, before I create a pull request against said fork, hoping that @helgeerbe would be willing to merge it, I wanted to ask you @tbnobody if I can do anything else to convince you that this proposed change is a valuable addition?

@tbnobody
Copy link
Owner

I tested the PR today. The memory usage was enormous. At some times it consumed 20kb more RAM then without this PR. The response of the system info api was also a little bit laggy. The flash consumption was at about 50kb if I remember correctly.
This was the reason why I didn't include it in todays release.

@schlimmchen
Copy link
Contributor Author

schlimmchen commented Aug 22, 2023

Thanks for considering and testing the change!

Would you mind elaborating on how you benchmarked the memory consumption, so I can try to optimize this change in this regard? What's the URL of the system info API you said was lagging (http://<host>/info/system?)?

What device are you testing against? Mine is an ESP32-S3 with two cores, maybe I need to use something else. Are we considering single-core ESP32?

+50k flash consumption really surprises me. I did not check it, to be honest. Maybe it's simple to fix by using different containers. I will have a look at this as well.

@tbnobody
Copy link
Owner

tbnobody commented Aug 22, 2023

I am using the normal esp32 without any extension. So it's dual core.
I did not really measure the ram usage in any professional way. I just pressed the refresh button on the info system page multiple times and saw a ram usage of ~145kb whereby without this PR it was at around 124-130kb (depending if the internal AP is on or off)

And I've just doublechecked the flash consumption.... It's 1530937 vs 1535157 so ~4,2kb (instead of the 50kb mentioned above)

@schlimmchen
Copy link
Contributor Author

Setup

I have an ESP32-WROOM-32U that I used for testing in more detail to address your concers. I build for the generic_esp32 environment. It polls one inverter every second. It is connected to an MQTT broker and has a valid local time (NTP). The Wi-Fi AP is stopped, i.e., the uptime is greater than 3 minutes. A web browser is connected to the web console while the tests are conducted. An instance of minicom is connected to the serial port.

This is how I inspected the heap usage:

rm /tmp/heap_used.txt; for i in $(seq 0 100); do wget http://192.168.16.155/api/system/status -O - | jq '.heap_used' >> /tmp/heap_used.txt; done
cat /tmp/heap_used.txt | sort | head -n1 # min used
cat /tmp/heap_used.txt | sort | tail -n1 # max used

This is how I tested the latency in the system info page:

wrk -t1 -c1 -d30s --latency http://192.168.16.155/api/system/status

This neglects any delays and any lagging that is caused by the client when executing the web interface. This especially includes checking for a new firmware version online, which can take more or less time depending on various factors.

Flash consumption

Thanks for checking the flash usage again. It's the same for me. Are you okay with spending ~4.2k of flash on this improvement? We can carve out ~1.7k if we switched to a std::map rather than a std::unordered_map, which might cause slightly worse performance.

Interesting problem

I noticed that the web console output and the serial output contains less lines per time while doing any of the two tests and the inverter is not polled successfully while the tests are running. This probably means that processing the HTTP response of api/system/status starves the Hoymiles library of processing time. This is true for firmware that includes the change and firmware that does not include the change. That's plausible since the change does not affect processing the system info JSON response. This might be worth an investigation on its own. Maybe the Hoymiles loop() should run in a different task with higher priority or even on CPU0. That will require making sure that the interlocking of the public interface is sufficiently tight, of course.

Results

Heap usage

Heap usage when the change is not included is between 124120 and 132708 Bytes.
Heap usage when the change is included is between 124940 and 131032 Bytes. Another run: 123256 to 130464 Bytes.

By chance, I managed to capture a moment where the firmware without the change used more heap memory as with the change included.

It seemed very odd to me that this feature would cause 20k of heap memory being hogged at all and especially for a long enough period that one would actually see it in the system info page.

I understand that memory is a scarce resource and that you have been burned before, but realistically, this change will only use memory in the region of a few kilobytes of memory at the very worst. That could happen if a lot of software components with their own loop() write a lot of lines (kilobytes of text) to MessageOutput before MessageOutput can clear out those lines and free the respective memory. That does not happen on a regular basis.

Latency

Without the change:

Running 30s test @ http://192.168.16.155/api/system/status
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   444.35ms    7.35ms 477.13ms   87.50%
    Req/Sec     1.83      0.38     2.00     82.81%
  Latency Distribution
     50%  442.18ms
     75%  447.32ms
     90%  452.91ms
     99%  477.13ms
  64 requests in 30.05s, 41.56KB read
Requests/sec:      2.13
Transfer/sec:      1.38KB

With the change:

Running 30s test @ http://192.168.16.155/api/system/status
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   444.42ms    5.72ms 466.10ms   73.85%
    Req/Sec     1.91      0.29     2.00     90.77%
  Latency Distribution
     50%  442.70ms
     75%  447.05ms
     90%  450.46ms
     99%  466.10ms
  65 requests in 30.04s, 42.21KB read
Requests/sec:      2.16
Transfer/sec:      1.41KB

I conclude that your observation regarding the lagging in the system info page might have something to do with the client-side processing of the page in the browser. It seems that this change has no influence on the time it takes to prepare the information required to render the system info page.

That being said, preparing the JSON response for that page might be in need of improvement. 444ms of exclusive processing time is quite a lot. It's probably that much because coming the memory to calculate the heap usage is quite intense.

Next Steps

What else should be look at to make sure this change does not cause issues elsewhere?

these adjustments guarantee unmangled log messages and more importantly,
guarantee that all messages from a particular component are printed to
the web console, which most people use to copy messages from when
reporting issues.

* use dynamic memory to allow handling of arbitrary message lenghts.
* keep a message buffer for every task so no task ever mangles the
  message of another task.
* every complete line is written to the serial console and moved to
  a line buffer for sending them through the websocket.
* the websocket is always fed complete lines.
* make sure to feed only as many lines as possible to the websocket
  handler, so that no lines are dropped.
* lock all MessageOutput state against concurrent access.
* respect HardwareSerial buffer size: the MessageOutput class buffers
  whole lines of output printed by any task in order to avoid mangling
  of text. that means we hand over full lines to the HardwareSerial
  instance, which might be too much in one call to write(buffer, size).
  we now check the return value of write(buffer, size) and call the
  function again with the part of the message that could not yet be
  written by HardwareSerial.
@schlimmchen
Copy link
Contributor Author

I would like to maintain that this would be a useful update to the MessageOutput class. The change is shipping for months now in the downstream project OpenDTU-OnBattery without issues.

Today I re-introduced this change to the downstream project in hoylabs#567 as its owner reverted the change when merging from this upstream project recently (TaskScheduler introduction). For that reason I am coming back here and updated the change to be compatible with TaskScheduler.

@schlimmchen schlimmchen closed this Feb 4, 2024
@schlimmchen schlimmchen deleted the thread-safe-messages branch February 4, 2024 14:45
Copy link

github-actions bot commented Apr 1, 2024

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants