Pipeline 2.0 RFC #8645

marcinszkudlinski · 2023-12-18T16:08:31Z

marcinszkudlinski
Dec 18, 2023
Collaborator

OBSOLETE - the newest version here:
thesofproject/sof-docs#497
readable version: https://marcinszkudlinski.github.io/sof-docs/PAGES/architectures/firmware/sof-common/pipeline_2_0/pipeline2_0_discussion.html

DOC version: 0.2 12/18/2023

This document describes a set of architecture changes known as “pipeline 2.0”.
The purpose of the changes is to make the pipeline:

more flexible
more optimal
more multi-core friendly
better fit to IPC4 (IPC3 backward compatibility must be maintained)
Implementation goal: try to keep as much common ipc3/4 core as possible.

initial headsup is here: #7261, some of the changes have already been implemented, but with a lot of workarounds i.e. in DP processing.

1. Module scheduling

Low Latency (LL) scheduling type
All LL modules will be called every 1ms, one-by-one, in a context of a single high-priority worker thread, staring from the first data producer to the last data consumer. All LL modules working on a single core must finish processing within 1ms and must process data in 1ms chunks. The latency of a complete modules chain (when all modules are on the same core) is 1ms.
2.0 new: Each module should consume all available data from the input buffer(s), there must not be any data left in input buffer. There are exceptions: if a module has certain needs to keep data in buffers between cycles (like ASRC module), it may request it at binding time. In that case binding code must create a buffer that fulfill this requirement.
2.0 new: Modules located on different cores may be freely bound and exchange data cross-core, even in case of LL-LL connection but 1) each core-cross bind will add 1ms latency to processing 2) there are certain limitations described later.

2. Data Processing (DP) scheduling type (partially 2.0 new)

Each module works in its own preemptible thread context, with lower priority than LL. Each of DP module’s thread has the same priority. If there are more than one module ready, the scheduler will choose a module that have the closest deadline, where a deadline is the last moment when a following module will have to start processing.
Modules will be scheduled for processing when they have enough data for processing at all inputs and enough space to store results on all outputs.
Current limitation/simplification: DP module must be surrounded by LL modules, there’s no possibility to connect DP to DP. DP to DP binding and proper deadline calculation is a complex task will be introduced later.
A DP module may be located at different core that modules bound to it, there’s no difference in processing or latency.

3. Sink/src interface

Sink/src API is an interface for data flow between bound modules. It was introduced some time ago.
The main principle of modules is to process audio data. To do it, a typical module needs a source to get data from and a sink to send a processed data to. In current pipeline there’s only one type of data flow, using comp_buffer/audio_stream structures and surrounding procedures.
Pipeline 2.0 introduces an API that allows flexibility of data flow – a sink / source API.
There are 2 sides of sink/source API:

An API provider, typically a buffer or (2.0 new) a module like DAI - a provider of sink/source api is an entity implementing all required API methods and providing a ready-to-use pointers to data to be processed / buffer to store processed data.
It is up to the provider to take care of buffer management, avoiding conflicts, taking care of cache coherency (if needed), etc.
Sink/source providers may have their properties and limitations – like DAI may not be able to provide data to other cores, etc. See following chapters for details
An API user, typically a processing module.
A user of sink/source API is an entity that simply call API methods and get a ready-to-use pointers to data for processing / buffer to store results.
Api user does not need to do any extra operations with data, like taking care of cache coherency, it can just simply use provided pointers. It is up to the pipeline code to use a proper api provider. See following chapters for details.

Sink/Src naming convention: always look from the API user (not API provider) point of view

source API is a data source from the point of view of the user of source API
source API is a data destination from the point of view of the provider of source API
sink API is a data destination from the point of view of the user of sink API
sink API is a data source from the point of view of the provider of sink API

(2.0 new) In typical case a user of API is a processing module, a provider is a buffer, but there are other possibilities. If a module – for any reason – need to have an internal data buffer, it may simply optimize the flow by exposing it to the others by providing sink/src API. Typical example of such module is DAI, that needs to have an internal buffer for DMA and may provide data to next module directly, without using additional buffer in the middle.
Currently, however, there’s an optimalization in the code – DAI may use a buffer between itself and next module for its own purposes, but it is an optimalization trick/hack. Sink/Src API allows to do it in natural and flexible way.
Another example of module providing sink interface may be a mixout, accepting one stream at input, keeping data in an internal buffer, and sending them out to several other modules (identical copies) by providing several instances of source API and exposing the same data buffer to several receivers. Also a unique pair mixin-mixout may use sink/src API to expose their internal buffers to each other.

4. Module binding

There may be 3 kinds of bindings:

4.1 entity using sink/source to entity using sink/source

Typically, a module a module. This is the most natural way of binding (at current code - the only way), requires a buffer in between:

flowchart LR
    markdown["Module (useSink)"]
    markdown -->markdown1["(sink) Buffer (source)"]
    markdown1 --> markdown2["(use source)Module"]

Using of a buffer provide a lot of flexibility, allowing cross-core binds, optional data linearization, LL to DP connections – just a proper buffer need to be used. See following chapter for details.

4.2 (2.0 new) direct connection entity exposing sink/source to entity using sink/source

flowchart LR
    markdown["Module (Source)"]
    markdown --> markdown1["(useSource) Module"]
    markdown2["Module (useSink)"]
    markdown2 --> markdown3["(Sink) Module"]

Typically a DAI providing/accepting data to/from a module. There’s no buffer between , but binding a module to a module without a buffer implies some limitations:

both modules must be LLs (that also enforces buffer linearity)
Connection must not be cross core
In a rare situation when any of the above conditions is not met (i.e. cross core or DP module), a proper buffer must be used with additional sink_to_source copier:

flowchart LR
    markdown["Module (Source)"]
    markdown --> markdown2["(useSource) S2Scopy (useSink)"]
    markdown2 --> markdown3["(sink) Buffer (source)"]
    markdown3 --> markdown4["(useSource) Module"]
    markdownA["Module (useSink)"]
    markdownA --> markdownA2["(sink) Buffer (source)"]
    markdownA2 --> markdownA3["(useSource) S2Scopy (useSink)"]
    markdownA3 --> markdownA4["(Sink) Module"]

4.3 (2.0 new) entity exposing sink/source to entity exposing sink/source

Extremely rare connection, like DAI to DAI. Both entities expose their internal buffers by sink/source. Connection requires a sink_to_source copier.

flowchart LR
    markdown["Module (Source)"]
    markdown --> markdown1["(useSource) S2Scopy (useSink)"]
    markdown1 --> markdown2["(Sink) Module"]

Again, modules must:

both modules must be LLs (that enforces buffer linearity)
Connection must not be cross core

In a rare situation when any of the above conditions is not met, a proper buffer must be used with 2 sink_to_source copiers:

flowchart LR
    markdown["Module (Source)"]
    markdown --> markdown1["(useSource) S2Scopy (useSink)"]
    markdown1 --> markdown2["(sink) Buffer (source)"]
    markdown2 --> markdown3["(useSource) S2Scopy (useSink)"]
    markdown3 --> markdown4["(sink) Module"]

It looks complicated, but probably will be a very rare case, like 2 DAIs on separate cores (!!) bound together. Maybe it is not worth to implement at all.

5. Module binding – module bind requirements.

a module should declare it needs on every input and output:

Input / output buffer size
Data formats
(2.0 new) Need for keeping data in buffer between cycles in LL (if it can’t consume all data in every cycle, like ASRC module)
(2.0 new) Data linearity: a module may require a linear data at input and/or a linear space for data at output.

Currently all buffers are circular, if a module needs to have linear data it is performing linearization by itself.
It is not optimal in many ways:

in case of LL-to-LL bind, modules are draining buffers completely in each cycle, so the data is linear in natural way. Additional linearization is waste of resources.
Linearization may be performed in a “smart way” on buffering level, see “types of buffers” below.

Special case: DP to LL bind (2.0 new)
This type of binding requires one adjustment. As stated before, LL modules need to process all available data from input. DP module, however, may and usually will work in longer cycles than 1ms providing several 1ms data chunks for LL module.
Solution: regardless of real amount of data stored in the buffer, LL module should be able to retrieve only data portion required for its 1ms processing (its IBS). It should be done at source API level.
Optional: There may be a “fast mode”, allowing LL modules to process all data from huge buffers at single step. Details todo – there may be a lot of problems, including sudden CPU load spikes, problems with LL internal buffers, etc.

6. Module binding – types of buffers (2.0 new)

As stated before, the most common type of bindings will be a “classic” connection of modules – users of sink/src APIs with a buffer providing source/sink between them.
To fulfill all modules’ requirements several types of buffers need to be used (buffer implementation “comp_buffer” and “audio_stream” will be removed):

6.1 shared buffer (2.0 new)

A connection between 2 LL modules in a chain
In case of a typical LL pipeline, each of the modules is processing a complete set of data on its input and produce a complete set of data on output. That typically means 16 - 48 audio frames per LL cycle.
The requirement is that the input buffer(s) is always drained completely.
In case of LL chain of modules located on the same core:

flowchart LR
    markdown1["LL1"]
    markdown1 --> markdown2["buffer1"]
    markdown2 --> markdown3["LL2"]
    markdown3 --> markdown4["buffer3"]
    markdown4 --> markdown5["LL3"]
    markdown5 --> markdown6["buffer3"]
    markdown6 --> markdown7[...]

A huge optimalization may be made – each of above buffers may share the same memory space.
Note that it does not apply to situations where there are several input pins:

flowchart LR
    classDef red fill:#F00
    classDef green fill:#0F0
    markdown1["LL1"]
    markdown1 --> markdown2["buffer1"]
    markdown2:::red --> markdown3["LL2"]
    markdown3 --> markdown4["buffer3"]
    markdown4:::red --> markdown5["LL3"]
    markdown5 --> markdown6["buffer3"]
    markdown6:::red --> markdown7["LL4"]
    markdown7 --> markdown8["buffer4"]:::red

    markdownA1[LL5]
    markdownA1 --> markdownA2["buffer5"]:::green
    markdownA2 --> markdownA3["LL6"]
    markdownA3 --> markdownA4["buffer6"]:::green
    markdownA4 --> markdown5

The order of processing will be:
LL1, LL2, LL5, LL6, LL3, LL4
I the example above 2 shared memory spaces should be used, marked above with red and green colors. Proper LL chain detector must be implemented.
Size of memory space should be 2 * MAX(all_OBSes, all_IBSes)

Note that shared buffer:

always contains linear data.
Must be drained completely at each cycle
That means that a shared buffer cannot be used if any of LL modules has special needs (like ASRC, must keep some data in buffers between cycles).

6.2 Lockless cross core data buffer

This kind of buffer is to be used at every place where a shared buffer cannot be used, and the data flow does not need linearization (in case of LL to LL connection data will be linear in natural way)
This buffer can provide:

different data chunk sizes on input/output,
cross core data passing with cached pointer aliases provided to modules,
circular data buffers
small overhead
The buffer code is currently upstreamed as “dpQueue”, as it was intended initially to work with DP modules. (2.0 new) this name should be changed.

6.3 Linearization cross core data buffer 2.0 new

This buffer is the most sophisticated of all. It needs to be able to combine all features of “Lockless cross core data buffer” – unfortunately except “small overhead” – enforcing linear data on input/output.
This buffer should be used if modules cannot be bound using shared buffer, at least one of the modules is DP and any of the modules requires linear data / linear buffer space.
Implementation details TBD, it will probably require some internal data copy/move etc. There’s space for optimalization like – avoid some data move if only one of the modules requires linear data, etc.

7. Binding pipelines to cores

Each module is bound to a single core at creation time and will never move to another core. Also during pipeline creation, a driver should declare on which core it wants the pipeline to be created. All pipeline operations (mostly iteration through modules) will be than performed by the core the pipeline is bound to.
partially 2.0 new A module belonging to a pipeline does need to be located at the same core as the pipeline, but in this case the pipeline would need to use time-consuming IDC calls to perform any operation on it (start/prepare/pause etc.). The most optimal setup would than be to locate the pipeline on the same core that most of the modules of the pipeline are located.

8. Iteration through the modules

The most common operation on modules is iteration through all modules in system or through a modules subset – like all modules belonging to a pipeline, all LL modules, etc.
In current code the order is determined by a sophisticated mechanism, based on the way the modules are bound to each other. This is 1) way too complicated 2) problematic in case of modules located at different cores 3) problematic in case of DP modules 4) requires “direction” what is not a part of IPC4.
(2.0 new) Fortunately, the modules need always be iterated in the very same order, and this order well known during the topology design. This order should be passed to the FW during module/pipeline creation and should not be modified later. In this case module iteration may be based on a very simple list (one list per core), modified only when a module is created/deleted (implementation should take extra care for avoiding races when the list is modified). In the reference FW there’s a requirement that the modules are created in the order they need to be iterated (module created first goes first) and it is up to the driver to create them in right order. All modules’ iterations – including LL scheduling order – is than based on this. cSOF should use same solution – that may require some modifications in the driver.
Backward compatibility: legacy recurrency based algorithm should be kept in case of IPC3 and be used for creating module iteration list.

mwasko · 2023-12-19T08:12:23Z

mwasko
Dec 19, 2023
Maintainer

Module scheduling
(...)
All LL modules working on a single core must finish processing within 1ms and must process data in 1ms chunks

Does the implementation enforce exact 1ms period? If not then I would recommend to replace 1ms with 'scheduling period', which default is 1ms.

Each module should consume all available data from the input buffer(s)

This statement conflicts with the previous one about processing data in 1ms chunks. It might need additional explanation what you meant by consume all data (when?).

0 replies

marcinszkudlinski · 2023-12-19T11:54:37Z

marcinszkudlinski
Dec 19, 2023
Collaborator Author

Discussion 12/19/23

Open (Liam): Hifi optimalization. Howto let modules comsume all data with HiFi processing inm 8 or 16 bytes chunks? Is it required? Is it worth to have flexible pipeline time, like 1.1ms for some cases?
LL modules creartion and iteration order (Change required in RFC, Marcin S.)
Module calling in creation order is limited to pipeline. Pipelines have their priorities and will be called (precise: all modules of a pipeline) in pipelines priority order. Within a pipeline, modules will be always called in order of creation
Buffer (or more precise - buffer connection) creation. Write about "buffer factory" idea (Change required in RFC, Marcin S.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sound Open Firmware

Pipeline 2.0 RFC #8645

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Sound Open Firmware

Pipeline 2.0 RFC #8645

marcinszkudlinski Dec 18, 2023 Collaborator

1. Module scheduling

2. Data Processing (DP) scheduling type (partially 2.0 new)

3. Sink/src interface

4. Module binding

4.1 entity using sink/source to entity using sink/source

4.2 (2.0 new) direct connection entity exposing sink/source to entity using sink/source

4.3 (2.0 new) entity exposing sink/source to entity exposing sink/source

5. Module binding – module bind requirements.

6. Module binding – types of buffers (2.0 new)

6.1 shared buffer (2.0 new)

6.2 Lockless cross core data buffer

6.3 Linearization cross core data buffer 2.0 new

7. Binding pipelines to cores

8. Iteration through the modules

Replies: 2 comments

mwasko Dec 19, 2023 Maintainer

marcinszkudlinski Dec 19, 2023 Collaborator Author

marcinszkudlinski
Dec 18, 2023
Collaborator

mwasko
Dec 19, 2023
Maintainer

marcinszkudlinski
Dec 19, 2023
Collaborator Author