Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: Add hardware watchdog for the RP2040 platform #4349

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

The3rdPlace
Copy link
Contributor

The watchdog must be enabled late in the boot stage since the rp2040's watchdog only allow for 8 seconds delay before resetting, and some of the initialization calls after the call to rp2040Setup() and even after the first call to rp2040Loop() blocks execution long enough to exceed this delay.

Due to this architectural problem, we still have a potential freeze on the rp2040 if it hangs during boot and initialization, but practical experience seems to indicate that in most cases, a freeze happens long into the running state (often days), so unless one builds a node to be deployed deep into the wilderness, we accept this risk.

There is not any likely solution to the short delay counter in the hardware watchdog since this is based on hardware registers.

@GUVWAF
Copy link
Member

GUVWAF commented Jul 30, 2024

The downside of adding a watchdog is that it hides (potential serious) bugs. Especially since the maximum timeout you can set is relatively short and it requires workarounds like this to make it work, I'm hesitant about it. For nRF52 we don't have a watchdog and for ESP32 it was increased from 45 to 90 seconds.

Have you confirmed it's properly resetting it when it receives a lot of packets back-to-back, reconnects to a client app when you have a lot of nodes in the DB, or reconnects to your Wi-Fi AP (this delay might be problematic), etc.?

The watchdog must be enabled late in the boot stage since the
rp2040's watchdog only allow for 8 seconds delay before resetting,
and some of the initialization calls after the call to rp2040Setup()
and even after the first call to rp2040Loop() blocks execution
long enough to exceed this delay.

Due to this architectural problem, we still have a potential
freeze on the rp2040 if it hangs during boot and initialization, but
practical experience seems to indicate that in most cases, a freeze
happens long into the running state (often days), so unless one
builds a node to be deployed deep into the wilderness, we accept
this risk.

There is not any likely solution to the short delay counter
in the hardware watchdog since this is based on hardware registers.
@The3rdPlace
Copy link
Contributor Author

Sorry for the late answer, vacation time got me :-D

I agree on the 8 seconds being to little time, my test devices had uptimes between 3 and 14 hours, so clearly a problem.

I have reworked the solution to provide the same 90 seconds timeout as for the esp32 device, I will mark this PR as draft until my devices has run for some days (aprox. 32 hours and counting into testing with my main node).

@The3rdPlace The3rdPlace marked this pull request as draft August 13, 2024 20:04
@The3rdPlace The3rdPlace changed the title Add hardware watchdog for the RP2040 platform DRAFT: Add hardware watchdog for the RP2040 platform Aug 13, 2024

// Update watchdog if timeout is below 90 seconds, same as esp32 watchdog
if (timeout < 90 * 1000) {
watchdog_update();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not going to work, right? In the worst case you only call watchdog_update() just before the 90 seconds expires, but it already triggers after 8 seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout needs to be below 90 seconds, in most cases its close to 0 (zero) (due to soft-updates from the loop or other places), so it will be called aprox. each. 4. second - but I will add a trace line so that we can verify this. np.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean I don't see why you're checking for less than 90 seconds? It will never be more than 8 seconds, because then it reboots already.

@GUVWAF
Copy link
Member

GUVWAF commented Aug 15, 2024

Sorry for the late answer, vacation time got me :-D

No problem, we're all doing this for fun :)

my test devices had uptimes between 3 and 14 hours, so clearly a problem.

If you have such short uptimes, it would be relatively easy to get to the root cause of the problem. You can e.g. connect another Pico as a debugger. The callstack when it hangs will likely give us a clue what's going wrong - it might be even something in the Arduino Pico core, like we had in e.g. #2558.

If it's now going to reboot every 3 hours, that's not a good solution either IMO, as you'll lose everything in RAM (e.g. received packet queue to be delivered to a phone), and every time it boots it sends out its NodeInfo to everyone and asks for a request, which leads to a storm of packets if it's in a good position.

}

// Time until this thread runs again
return 4 * 1000; // 4 seconds
Copy link
Member

@GUVWAF GUVWAF Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just what you ask it to do. If it doesn't have time to schedule this thread, it won't run this fast.

ArduinoThread (which this utilizes underneath) is not a real-time OS:

It should be noted that these are not “threads” in the real computer-science meaning of the term: tasks are implemented as functions that are run periodically. On the one hand, this means that the only way a task can yield the CPU is by returning to the caller, and it is thus inadvisable to delay() or do long waits inside any task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, so if some other threads are doing lengthy work in such a thread, this is absolutely a problem and we should find out where this happens (we already knows that wifi init is blocking!). If all other threads behaves well, and schedules lengthy work to slower loops or queues, this should work, but we can schedule it faster to leave more headroom in case calls is a bit delayed.

I will add some diagnostics output while fieldtesting to get some indication of the precision of the scheduler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should find out where this happens

Yes, I agree, but it can be anywhere in the firmware, under any kind of scenario. It's very hard to guarantee that if you fix one case, it won't appear somewhere else and if it leads to an endless reboot loop, that would be a nasty regression.

@The3rdPlace
Copy link
Contributor Author

Just to make it clear :-D the 3 hour reboots was after having only the 8 seconds timeout, so no need to debug into that ;)

I am fieldtesting the current code now to check what uptimes we get with the current revision, then I also need to make some tests to find a baseline - all of this takes some time, so I'll just leave it as a draft until I am a bit wiser on the general stability.

@GUVWAF
Copy link
Member

GUVWAF commented Aug 15, 2024

Just to make it clear :-D the 3 hour reboots was after having only the 8 seconds timeout, so no need to debug into that ;)

Ah, I see. Likely it wasn't returning to rp2040loop() often enough.
But still, in your current revision there's only an 8 second timeout. It's fixed in the hardware, I believe there's no way around it.

@fifieldt
Copy link
Contributor

fifieldt commented Oct 8, 2024

Hi @The3rdPlace , just checking in to see how your tests went ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants