On managing Zephyr baseline updates better #8909

kv2019i · 2024-03-06T10:37:52Z

kv2019i
Mar 6, 2024
Maintainer

Let's branch of the discussion from #8903 (comment) to here.

In short, now in start of 2024 in SOF2.9 cycle, we've had the first case where we've had to temporarily branch Zephyr main, in order to apply a revert of an upstream Zephyr commit to make SOF CI test suite to pass (see #8903 ).

Normally the process would be to fix the regression in Zephyr upstream, let that be reviewed and merged and then submit a new SOF PR that takes a newer Zephyr baseline (via west.yml modifcation in SOF).

This time around, we've got stuck in a cascade of problems, documented in #8818 .

I think this is a good time to gather input on how we manage this in the future. Basic options we have:

Current approach, developers submit Zephyr updates directly to SOF via west.yml updates, no possibility to cover SOF specific fixes or reverts
Maintain SOF view of Zephyr "main" in SOF github, allowing submit SOF specific backports and reverts. So taking queue from the "sof-dev" process used by SOF Linux kernel tream uses (https://thesofproject.github.io/latest/contribute/linux/development_tree.html#sof-maintainers-process), and use a similar approach to manage a mirror of Zephyr main. . This would establish a more formal staging tree in SOF project for Zephyr, allowing to carry some local reverts and fixups that are needed for SOF.
Stick to (1) but allow temporary branches of style (2) to handle complex interface changes.

Now we slipped into (3) by necessity, but the discussion is whether this is a one-off, or should we consider (2). Or how we ensure that (1) works better in the future.

cc:

Proprietary toolchain support sustainability zephyrproject-rtos/zephyr#63738

kv2019i · 2024-03-06T10:38:58Z

kv2019i
Mar 6, 2024
Maintainer Author

FYI @thesofproject/maintainers @LaurentiuM1234 @lyakh @abonislawski @marcinszkudlinski @tmleman @marc-hb @keqiaozhang

0 replies

kv2019i · 2024-03-06T10:45:18Z

kv2019i
Mar 6, 2024
Maintainer Author

With regard to actions taken in the past to ensure approach (1) works:

github CI jobs to build test SOF with latest Zephyr main -> in place, captures build failures
on Intel side, we have an internal daily job that runs on-hardware test with SOF and latest Zephyr -> captures runtime fails on a set of Intel targets

These have proven to be very effective to capture minor hickups, but where we run into problems is when we have more complex failures like seen Jan-Feb this year. By the time issue observed in Zephyr had been debugged, fixed/reviewed/merged, a new problem or interface issue had already surfaced.

0 replies

lgirdwood · 2024-03-06T15:11:30Z

lgirdwood
Mar 6, 2024
Maintainer

Adding @nashif @mwasko there is some low hanging fruit that can be implemented to ease

Remove all z_ and other private Zephyr APIs (in progress)
Namespace all SOF KConfigs, macros and APIs to avoid collisions with Zephyr and C library (started).
Make sure relevant developers are on CODEOWNERS for SOF and Zephyr to align on impactful changes.
Build simple UTs into Zephyr for drivers. i.e. DMA, SSP, HDA
Add more z_test functionality into SOF to stress SOF and Zephyr APIs for root causing any issue when they do occur.
more ?

0 replies

marc-hb · 2024-03-06T17:24:00Z

marc-hb
Mar 6, 2024
Collaborator

Summarizing what I just posted in #8903 (comment): (1) we CAN branch as much as we want; (2) we SHOULD branch as little as possible, only on a case by case basis. Ideally: only for major API changes and for rare regressions.

There is not a lot we can do about API changes. AFAIK the Zephyr project is already trying to minimize API changes:

doc: release_process: drop the paragraph about release freeze zephyrproject-rtos/zephyr#69695

There is no "magic" process change or silver bullet to avoid regressions either, only continuous improvements. This is just software development "business as usual". Identify coverage gaps; address them and raise the validation bar. Example:

samples: intel_adsp: Core Power Toggling zephyrproject-rtos/zephyr#69365

This test does not pass yet!

This time around, we've got stuck in a cascade of problems, documented in #8818 .

Again, this is a very typical integration issue, nothing unusual. The only, well known fix to avoid "Big Bang" integration issues: more Continuous Integration

https://martinfowler.com/articles/continuousIntegration.html#build-fast
https://martinfowler.com/articles/continuousIntegration.html#FixBrokenBuildsImmediately
https://martinfowler.com/articles/continuousIntegration.html#AutomateDeployment
https://martinfowler.com/articles/continuousIntegration.html#HideWork-in-progress
https://docs.zephyrproject.org/latest/contribute/contributor_expectations.html#defining-smaller-prs

etc.

20+ years old page, updated last month. It's all there.

In other words: to minimize branching, just test harder, better, faster, stronger - and earlier. No silver bullet, just hard validation work. When you must branch, branch as little as possible for the shortest time possible.

Test coverage will never be perfect and there will always be regressions. But we keep seeing some PRs being merged with quick reviews and not enough test coverage. For instance: risky ADSP PRs merged in Zephyr without any SOF CI run. Or even worse: PRs merged with some tests failing. Buggy work in progress can be shared outside the main branch. To make "cascade" situations like #8818 less frequent, just raise the validation bar and demand more testing before merge. Quality is just a numbers game, no ground-breaking process change required.

Some of our test gaps:
https://github.com/thesofproject/sof-test/issues?q=is%3Aissue+is%3Aopen+label%3A%22type%3Atest+coverage+gap%22

0 replies

andyross · 2024-03-07T15:20:41Z

andyross
Mar 7, 2024
Collaborator

Driving by: the core bug here is unavoidable, SOF share subsystems like PM with other Zephyr users doing different things, and so anything not regularly tested in Zephyr upstream will be fragile. As pointed out above, some SOF-maintained hardware smoke tests in (or near) Zephyr CI would go a long way to detecting this kind of thing before merge. Just something that boots a device and does a suspend/resume would have detected this AFAICT. There actually is one that does exactly that in tests/boards/intel_adsp/smoke that I wrote long ago[1]. Someone just needs to set it up to run on hardware regularly; something this small could even be done for every PR.

Regarding branching strategies, you can go a little farther and base SOF on a fixed Zephyr release tag, then hand-curate a branch on top of that containing only the patches that SOF requires. This is immune to regressions since you only ever do deliberate picks from upstream for needed features (plus a regular rebase on Zephyr release), but essentially doubles the effort of getting SOF-authored code into Zephyr as it has to be maintained in two places for a full release cycle.

[1] May be somewhat bitrotten on MTL. It requires host support in the cavstool.py script and I don't know if those features went along when acetool.py was forked.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sound Open Firmware

On managing Zephyr baseline updates better #8909

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Sound Open Firmware

On managing Zephyr baseline updates better #8909

kv2019i Mar 6, 2024 Maintainer

Replies: 5 comments

kv2019i Mar 6, 2024 Maintainer Author

kv2019i Mar 6, 2024 Maintainer Author

lgirdwood Mar 6, 2024 Maintainer

marc-hb Mar 6, 2024 Collaborator

andyross Mar 7, 2024 Collaborator

kv2019i
Mar 6, 2024
Maintainer

kv2019i
Mar 6, 2024
Maintainer Author

kv2019i
Mar 6, 2024
Maintainer Author

lgirdwood
Mar 6, 2024
Maintainer

marc-hb
Mar 6, 2024
Collaborator

andyross
Mar 7, 2024
Collaborator