Skip to content

Conversation

@pblaszko
Copy link
Contributor

This PR implements SOF logic in IPC4 handler to enable multicore LL pipelines.
This is first draft or changes, that allows to use basic multicore scenarios.

Main assumptions:

  • Pipeline and Module related IPC's forwarded to target core. Their data accessed only by this core, to avoid cache coherency problems.
  • To allow pipeline allocation by target core, adding core_id field into CreatePipeline IPC struct for IPC4
  • IPC's are forwarded to secondary core by IDC in non-blocking mode. Secondary core processes command and put response into IPC uplink queue.
  • MultiPipelineSetState IPC currently supports only cases when all pipelines in command are allocated on the same target core

This design is consistent with IPC3 handler implementation.

Future work to be done:

  • Pipeline and Module data allocated on application heap separated per target core
  • Support for connected pipelines allocated on different cores
  • IDC queue to handle IPC responses and async messages
  • Synchronized start of IPC's via MultiPipelineSetState
  • Support for DP modules in multicore scenarios

@pblaszko
Copy link
Contributor Author

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pblaszko good work ! I think we should also upstream an IPC4 multicore LL topology once the kernel part is ready so this will be part of E2E CI. @marc-hb @keqiaozhang fyi.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plbossart @ujfalusi - fyi driver will need a matching update, which may needed connected into core refcount/PM logic ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and @ranj063 and @bardliao as well. It looks straightforward to add as we do the same thing for IPC3 and the CORE_ID token is defined in topologies.

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good! And kudos for very good git commit messages. Clear explanation why a change is made and what has been considered. This speeds up the review a lot.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent commit message, thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and @ranj063 and @bardliao as well. It looks straightforward to add as we do the same thing for IPC3 and the CORE_ID token is defined in topologies.

Add wrapper that translates POSIX error codes from generic
ipc_process_on_core function to IPC4 status.

Signed-off-by: Przemyslaw Blaszkowski <przemyslaw.blaszkowski@intel.com>
When forwarding IPC to secondary core in a non-blocking mode,
skip ipc response step in cmd handler on primary core.
Response will be prepared by secondary core.

Signed-off-by: Przemyslaw Blaszkowski <przemyslaw.blaszkowski@intel.com>
Separate ipc_get_comp_by_ppl_id for ipc3 and ipc4 to allow searching
for pipeline core in case of multicore scenarios.

IPC4 design for multicore ll pipelines assumes all pipeline and module
data is accessed only by core on which resource is allocated.
Primary core use ipc_get_comp_by_ppl_id to check if the message should
be forwarded to secondary core.
Currently ipc_get_comp_by_ppl_id function calls ipc_comp_pipe_id which
access pipeline data to retrieve pipeline id. If pipeline is allocated
on secondary core, it causes a cache coherency issue.

In IPC4 implementation, ipc_comp_dev.id field for pipeline represents
pipeline id coming in configuration from driver. It is possible to
retrieve pipeline id from component device instead of pipeline data.
In IPC3 implementation, ipc_comp_dev is a unique component id coming in
configuration and it is not equal to pipeline id. For IPC3 leaving
implementation unchanged.

Signed-off-by: Przemyslaw Blaszkowski <przemyslaw.blaszkowski@intel.com>
Add core_id field into create_pipeline structure.

In multicore LL pipeline scenario, all pipeline and module data should
be allocated and accessed only by target core.

Current IPC4 create_pipeline IPC does not contain information about
target core. It is required to properly allocate pipeline.

Signed-off-by: Przemyslaw Blaszkowski <przemyslaw.blaszkowski@intel.com>
Pass all pipeline and module related IPC's to target core.
In result, all pipeline and module data will be accessed only by the
core on which this resource is allocated.
Such design helps to avoid cache coherency problems without adding
specific invalidate/writeback operations.
This implementation of multicore scenarios is consistent with IPC3 handler.

Signed-off-by: Przemyslaw Blaszkowski <przemyslaw.blaszkowski@intel.com>
Cleanup duplicated use of ipc_get_comp_by_ppl_id which searches for
pipeline component in components list in set pipeline state flow.
Cleanup pipeline ipc component device variable names. PCM device is
misleading.

Signed-off-by: Przemyslaw Blaszkowski <przemyslaw.blaszkowski@intel.com>
@lgirdwood lgirdwood modified the milestones: ABI-3.25, ABI-4.1 Nov 25, 2022
@lgirdwood
Copy link
Member

@pblaszko can you check internal CI. Thanks !

@mwasko
Copy link
Contributor

mwasko commented Nov 25, 2022

SOFCI TEST

@pblaszko
Copy link
Contributor Author

@pblaszko can you check internal CI. Thanks !

Looks like some Python exception, unrelated to FW:
Unhandled Exception: System.BadImageFormatException: Could not load file or assembly 'CoreAudioAPI, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null' or one of its dependencies. An attempt was made to load a program with an incorrect format. ---> System.BadImageFormatException: Could not load file or assembly '175106 bytes loaded from VerifyQuality, Version=1.4.0.0, Culture=neutral, PublicKeyToken=null' or one of its dependencies. An attempt was made to load a program with an incorrect format. ---> System.BadImageFormatException: Bad IL format.

Passed after rebuild.

@plbossart
Copy link
Member

@pblaszko @lgirdwood can someone explain to me the concept of multi-core LL pipelines?

In the traditional cAVS firmware, the LL modules are handled in a round-robin manner with the pipeline ID used as a priority.
What would be the benefit of a multi-core solution here?

@pblaszko
Copy link
Contributor Author

pblaszko commented Dec 1, 2022

@pblaszko @lgirdwood can someone explain to me the concept of multi-core LL pipelines?

In the traditional cAVS firmware, the LL modules are handled in a round-robin manner with the pipeline ID used as a priority. What would be the benefit of a multi-core solution here?

@plbossart In cAVS, pipelines have priorities. If priorities of two pipelines are equal, they are scheduled in the order of enabling them by driver. Scheduler does not use pipeline ID as a priority. It is driver role to start pipelines in appropriate order. Not sure how is this question related to benefits?
The main benefit is performance. On ACE, we have requirements for more and more complicated topologies (WoV + ACA + Ultrasound + Heavy Playbacks e.g. 24b/192kHz/4ch with reference to AEC etc.). According to MCPS calculations, we are not able to fit with all required LL modules on primary core only.

@plbossart
Copy link
Member

plbossart commented Dec 1, 2022

@pblaszko WoV/ACA/heavy playback use cases do NOT make sense as LL modules, even less so when such algorithms have buffering or frame alignment requirements (e.g. 10ms for AEC or powers of 2 for time-frequency changes). The low-latency domain should be reserved for lightweight input/output related filters which can work with 1ms data (edit: and have a nearly constant activity pattern without peak MCPS requirements).

It's already been the case with existing SOF that the deep-buffer stuff does not work at all once the buffering becomes large, and using multi-core solutions to work-around fundamental LL scheduling limitations is not a very good direction. I'd rather see support for DP modules....

@pblaszko
Copy link
Contributor Author

pblaszko commented Dec 2, 2022

@pblaszko WoV/ACA/heavy playback use cases do NOT make sense as LL modules, even less so when such algorithms have buffering or frame alignment requirements (e.g. 10ms for AEC or powers of 2 for time-frequency changes). The low-latency domain should be reserved for lightweight input/output related filters which can work with 1ms data (edit: and have a nearly constant activity pattern without peak MCPS requirements).

It's already been the case with existing SOF that the deep-buffer stuff does not work at all once the buffering becomes large, and using multi-core solutions to work-around fundamental LL scheduling limitations is not a very good direction. I'd rather see support for DP modules....

@plbossart DP modules with deeper buffering are must have. We are not mixing LL and DP domains. The thing is that driver should firstly fully allocate all modules on primary core and only then allocate next modules on secondary cores (does not matter if it is LL or DP). It is power optimization. One fully loaded core consumes less power than two medium-loaded.
We do not change low-latency requirement for LL modules, they still need to fit into one system-tick. But there are DP modules that must run on a primary core, like WoV/AEC/Ultrasound, to allow system enter low-power states. They are limiting resources on primary core for LL infrastructure, especially that we have new requests to support heavy playbacks with quite big data size.

@plbossart
Copy link
Member

ok, so the goal is to use multicore LL pipelines ONLY when the core0 is already loaded with too many DP+LL loads.
The problem I have is that we don't have DP just yet, so we have no real ability to test this scenario...The most urgent priority was the DP support IMHO.

@lyakh
Copy link
Collaborator

lyakh commented Dec 5, 2022

The thing is that driver should firstly fully allocate all modules on primary core and only then allocate next modules on secondary cores (does not matter if it is LL or DP). It is power optimization. One fully loaded core consumes less power than two medium-loaded.

While that makes sense from the power optimisation PoV, this would require dynamic core allocation or even migration. Currently pipelines are assigned to cores in the topology. So if you start just one pipeline, that is assigned to a secondary core, it will run on it, while the primary core will also be kept on for housekeeping. Also consider what happens if you first start several pipelines, which load core 0 completely, then one more pipeline that goes to core 1, then you close all pipelines on core 0. Will you want to migrate the still running pipeline from core 1 to core 0 to save power?

@pblaszko
Copy link
Contributor Author

pblaszko commented Dec 5, 2022

The thing is that driver should firstly fully allocate all modules on primary core and only then allocate next modules on secondary cores (does not matter if it is LL or DP). It is power optimization. One fully loaded core consumes less power than two medium-loaded.

While that makes sense from the power optimisation PoV, this would require dynamic core allocation or even migration. Currently pipelines are assigned to cores in the topology. So if you start just one pipeline, that is assigned to a secondary core, it will run on it, while the primary core will also be kept on for housekeeping. Also consider what happens if you first start several pipelines, which load core 0 completely, then one more pipeline that goes to core 1, then you close all pipelines on core 0. Will you want to migrate the still running pipeline from core 1 to core 0 to save power?

@lyakh I was trying to explain the sense of multicore LL scheduler. Performance optimization is one argument and power is another.
So far I have more experience with Windows OED. Dynamic resource allocation and loading primary core first is used there. The problems you write about had to be handled already even without multicore LL scheduler - there could be pipelines with heavy DP modules that need similar core allocation. So far OED also does not support multicore LL, but it will. There is a new requirement.
The scenario with some pipelines running on core 1 while other pipelines are released on core 0 is rather a theoretical case. In practice, Core 0 will be rather used for phrase detection algorithms that are enabled for all the system lifecycle. Also, I think it is still better to just have risk of some pipeline left on secondary core, than always allocate it on secondary core.
But I understand now SOF driver does not have dynamic allocation. Then performance is the main goal. As multicore LL was requested by customer, I assume there are customers who have topologies allocating so many LL modules that they do not fit in one core.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants