-
Notifications
You must be signed in to change notification settings - Fork 349
idc: zephyr: set of fixes to IDC in multicore scenario #8753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -73,8 +73,7 @@ static void idc_handler(struct k_p4wq_work *work) | |
| int payload = -1; | ||
| k_spinlock_key_t key; | ||
|
|
||
| /* A message is received from another core, invalidate local cache */ | ||
| sys_cache_data_invd_range(msg, sizeof(*msg)); | ||
| __ASSERT_NO_MSG(!is_cached(msg)); | ||
|
|
||
| if (msg->size == sizeof(int)) { | ||
| const int idc_handler_memcpy_err __unused = | ||
|
|
@@ -121,6 +120,32 @@ int idc_send_msg(struct idc_msg *msg, uint32_t mode) | |
| int ret; | ||
| int idc_send_memcpy_err __unused; | ||
|
|
||
| if (!cpu_is_core_enabled(target_cpu)) { | ||
| tr_err(&zephyr_idc_tr, "Core %u is down, cannot sent IDC message", target_cpu); | ||
| return -EACCES; | ||
| } | ||
|
|
||
| /* | ||
| * Handler is NULL when work object has never been submitted. | ||
| * In all other cases, we must use k_p4wq_wait() before reuse | ||
| * of the object. | ||
| */ | ||
| if (work->handler) { | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the thing is that IIUC this check is unprotected, so 2 cores can simultaneously check this and proceed to use the
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @lyakh This field is only set once object lifetime. It is NULL initially (as a static object in BSS). When the first message is sent to a core, handler is set to "handler = idc_handler". I.e. this check is only to prevent p4wq_wait on an uninitialized work queue object. I guess I could just skip this and and call p4wq_wait() always, but this is depending on p4wq internals a bit to work as expect (so that wait returns immediately if the object is fresh out from BSS).
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kv2019i rrright, yes, sorry, forgot about this, but I still think there's a race possibility here: it's the "check-and-take" operation that should be atomic and safe, right? And here it isn't. You added a wait in line 141, but before we submit a new work in line 171, it might be unsafe again? Another core might send an IDC to the same target core and the target core might already be processing it again?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @andyross is my understanding of this P4WQ use-case correct? What's the right way to handle this (see above comment)?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just seeing this question now, apologies: The API promises that after p4wq_wait() returns the item has been executed and the queue itself won't touch it anymore. It doesn't provide any synchronization otherwise, in particular if this code itself may be executed from more than one core then it can indeed race on the single work item. It looks to me like there's one work item instantiated for each target CPU, so on 3+ core systems that seems like a real risk unless the is more synchronization at a higher level. You could fix that by implementing a synchronized pool of such work items (c.f. k_mem_slab), or by having one for each unique sender/receiver pair (i.e. 6 on MTL, 12 on TGL). |
||
| /* | ||
| * If new request is in blocking mode, we must call | ||
| * k_p4wq in blocking mode. This is workaround for | ||
| * k_p4wq_wait() interface. | ||
| */ | ||
| work->sync = (mode == IDC_BLOCKING); | ||
|
|
||
| ret = k_p4wq_wait(work, K_USEC(IDC_TIMEOUT)); | ||
kv2019i marked this conversation as resolved.
Show resolved
Hide resolved
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. don't other
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @lyakh This is just to influence p4wq_wait() behaviour. We want to sleep on the semaphore (if send_msg is called with BLOCKING flag), but the semaphore is private, so we need to set the work->sync flag to get the same effect. In V2 version, I've modified this so to set the flag conditionally. If send_msg is called without BLOCKING flag, no need to wait here. |
||
| if (ret < 0) { | ||
| tr_err(&zephyr_idc_tr, "idc_send_msg error %d, target core %u", | ||
| ret, target_cpu); | ||
| return ret; | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. have we ever seen this race occurring? Between what and what? Non-blocking operations usually return an error when the action cannot be completed immediately. Wouldn't that be better?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can hit this on a MTL with modified multicore tplg. Still trying to find more info on the scenario as I agree this should not happen normally (and it's rare!). I now reworked the wait logic to only rely o p4wq semaphore, so this should be safe against races (and dropped the work_status field). |
||
| } | ||
| } | ||
|
|
||
| idc_send_memcpy_err = memcpy_s(msg_cp, sizeof(*msg_cp), msg, sizeof(*msg)); | ||
| assert(!idc_send_memcpy_err); | ||
| /* Same priority as the IPC thread which is an EDF task and under Zephyr */ | ||
|
|
@@ -129,10 +154,6 @@ int idc_send_msg(struct idc_msg *msg, uint32_t mode) | |
| work->handler = idc_handler; | ||
| work->sync = mode == IDC_BLOCKING; | ||
|
|
||
| if (!cpu_is_core_enabled(target_cpu)) { | ||
| tr_err(&zephyr_idc_tr, "Core %u is down, cannot sent IDC message", target_cpu); | ||
| return -EACCES; | ||
| } | ||
| if (msg->payload) { | ||
| idc_send_memcpy_err = memcpy_s(payload->data, sizeof(payload->data), | ||
| msg->payload, msg->size); | ||
|
|
@@ -145,13 +166,13 @@ int idc_send_msg(struct idc_msg *msg, uint32_t mode) | |
| /* Temporarily store sender core ID */ | ||
| msg_cp->core = cpu_get_id(); | ||
|
|
||
| /* Sending a message to another core, write back local message cache */ | ||
| sys_cache_data_flush_range(msg_cp, sizeof(*msg_cp)); | ||
| __ASSERT_NO_MSG(!is_cached(msg_cp)); | ||
|
|
||
| k_p4wq_submit(q_zephyr_idc + target_cpu, work); | ||
|
|
||
| switch (mode) { | ||
| case IDC_BLOCKING: | ||
| ret = k_p4wq_wait(work, K_FOREVER); | ||
| ret = k_p4wq_wait(work, K_USEC(IDC_TIMEOUT)); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i would add a specific message in log here if ret != 0 |
||
| if (!ret) | ||
| /* message was sent and executed successfully, get status code */ | ||
| ret = idc_msg_status_get(msg->core); | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could just remove these asserts in the final patch. It's nice here in the PR review, but I wonder should this be just dropped.
@lyakh any comment, you added these cache ops in the original version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kv2019i looks correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
best to keep and modify any inline comments around cache handling code. i.e. why are we checking the ptr type.