Replies: 5 comments
-
FYI @thesofproject/maintainers @LaurentiuM1234 @lyakh @abonislawski @marcinszkudlinski @tmleman @marc-hb @keqiaozhang |
Beta Was this translation helpful? Give feedback.
-
With regard to actions taken in the past to ensure approach (1) works:
These have proven to be very effective to capture minor hickups, but where we run into problems is when we have more complex failures like seen Jan-Feb this year. By the time issue observed in Zephyr had been debugged, fixed/reviewed/merged, a new problem or interface issue had already surfaced. |
Beta Was this translation helpful? Give feedback.
-
Adding @nashif @mwasko there is some low hanging fruit that can be implemented to ease
|
Beta Was this translation helpful? Give feedback.
-
Summarizing what I just posted in #8903 (comment): (1) we CAN branch as much as we want; (2) we SHOULD branch as little as possible, only on a case by case basis. Ideally: only for major API changes and for rare regressions. There is not a lot we can do about API changes. AFAIK the Zephyr project is already trying to minimize API changes: There is no "magic" process change or silver bullet to avoid regressions either, only continuous improvements. This is just software development "business as usual". Identify coverage gaps; address them and raise the validation bar. Example: This test does not pass yet!
Again, this is a very typical integration issue, nothing unusual. The only, well known fix to avoid "Big Bang" integration issues: more Continuous Integration https://martinfowler.com/articles/continuousIntegration.html#build-fast etc. 20+ years old page, updated last month. It's all there. In other words: to minimize branching, just test harder, better, faster, stronger - and earlier. No silver bullet, just hard validation work. When you must branch, branch as little as possible for the shortest time possible. Test coverage will never be perfect and there will always be regressions. But we keep seeing some PRs being merged with quick reviews and not enough test coverage. For instance: risky ADSP PRs merged in Zephyr without any SOF CI run. Or even worse: PRs merged with some tests failing. Buggy work in progress can be shared outside the main branch. To make "cascade" situations like #8818 less frequent, just raise the validation bar and demand more testing before merge. Quality is just a numbers game, no ground-breaking process change required. Some of our test gaps: |
Beta Was this translation helpful? Give feedback.
-
Driving by: the core bug here is unavoidable, SOF share subsystems like PM with other Zephyr users doing different things, and so anything not regularly tested in Zephyr upstream will be fragile. As pointed out above, some SOF-maintained hardware smoke tests in (or near) Zephyr CI would go a long way to detecting this kind of thing before merge. Just something that boots a device and does a suspend/resume would have detected this AFAICT. There actually is one that does exactly that in tests/boards/intel_adsp/smoke that I wrote long ago[1]. Someone just needs to set it up to run on hardware regularly; something this small could even be done for every PR. Regarding branching strategies, you can go a little farther and base SOF on a fixed Zephyr release tag, then hand-curate a branch on top of that containing only the patches that SOF requires. This is immune to regressions since you only ever do deliberate picks from upstream for needed features (plus a regular rebase on Zephyr release), but essentially doubles the effort of getting SOF-authored code into Zephyr as it has to be maintained in two places for a full release cycle. [1] May be somewhat bitrotten on MTL. It requires host support in the cavstool.py script and I don't know if those features went along when acetool.py was forked. |
Beta Was this translation helpful? Give feedback.
-
Let's branch of the discussion from #8903 (comment) to here.
In short, now in start of 2024 in SOF2.9 cycle, we've had the first case where we've had to temporarily branch Zephyr main, in order to apply a revert of an upstream Zephyr commit to make SOF CI test suite to pass (see #8903 ).
Normally the process would be to fix the regression in Zephyr upstream, let that be reviewed and merged and then submit a new SOF PR that takes a newer Zephyr baseline (via west.yml modifcation in SOF).
This time around, we've got stuck in a cascade of problems, documented in #8818 .
I think this is a good time to gather input on how we manage this in the future. Basic options we have:
Current approach, developers submit Zephyr updates directly to SOF via west.yml updates, no possibility to cover SOF specific fixes or reverts
Maintain SOF view of Zephyr "main" in SOF github, allowing submit SOF specific backports and reverts. So taking queue from the "sof-dev" process used by SOF Linux kernel tream uses (https://thesofproject.github.io/latest/contribute/linux/development_tree.html#sof-maintainers-process), and use a similar approach to manage a mirror of Zephyr main. . This would establish a more formal staging tree in SOF project for Zephyr, allowing to carry some local reverts and fixups that are needed for SOF.
Stick to (1) but allow temporary branches of style (2) to handle complex interface changes.
Now we slipped into (3) by necessity, but the discussion is whether this is a one-off, or should we consider (2). Or how we ensure that (1) works better in the future.
cc:
Beta Was this translation helpful? Give feedback.
All reactions