Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 30 additions & 9 deletions src/idc/zephyr_idc.c
Original file line number Diff line number Diff line change
Expand Up @@ -73,8 +73,7 @@ static void idc_handler(struct k_p4wq_work *work)
int payload = -1;
k_spinlock_key_t key;

/* A message is received from another core, invalidate local cache */
sys_cache_data_invd_range(msg, sizeof(*msg));
__ASSERT_NO_MSG(!is_cached(msg));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could just remove these asserts in the final patch. It's nice here in the PR review, but I wonder should this be just dropped.

@lyakh any comment, you added these cache ops in the original version?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kv2019i looks correct

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

best to keep and modify any inline comments around cache handling code. i.e. why are we checking the ptr type.


if (msg->size == sizeof(int)) {
const int idc_handler_memcpy_err __unused =
Expand Down Expand Up @@ -121,6 +120,32 @@ int idc_send_msg(struct idc_msg *msg, uint32_t mode)
int ret;
int idc_send_memcpy_err __unused;

if (!cpu_is_core_enabled(target_cpu)) {
tr_err(&zephyr_idc_tr, "Core %u is down, cannot sent IDC message", target_cpu);
return -EACCES;
}

/*
* Handler is NULL when work object has never been submitted.
* In all other cases, we must use k_p4wq_wait() before reuse
* of the object.
*/
if (work->handler) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the thing is that IIUC this check is unprotected, so 2 cores can simultaneously check this and proceed to use the work object simultaneously.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lyakh This field is only set once object lifetime. It is NULL initially (as a static object in BSS). When the first message is sent to a core, handler is set to "handler = idc_handler".

I.e. this check is only to prevent p4wq_wait on an uninitialized work queue object.

I guess I could just skip this and and call p4wq_wait() always, but this is depending on p4wq internals a bit to work as expect (so that wait returns immediately if the object is fresh out from BSS).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kv2019i rrright, yes, sorry, forgot about this, but I still think there's a race possibility here: it's the "check-and-take" operation that should be atomic and safe, right? And here it isn't. You added a wait in line 141, but before we submit a new work in line 171, it might be unsafe again? Another core might send an IDC to the same target core and the target core might already be processing it again?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyross is my understanding of this P4WQ use-case correct? What's the right way to handle this (see above comment)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nashif @ceolin @teburd can anyone help on P4WQ question above ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just seeing this question now, apologies: The API promises that after p4wq_wait() returns the item has been executed and the queue itself won't touch it anymore. It doesn't provide any synchronization otherwise, in particular if this code itself may be executed from more than one core then it can indeed race on the single work item.

It looks to me like there's one work item instantiated for each target CPU, so on 3+ core systems that seems like a real risk unless the is more synchronization at a higher level. You could fix that by implementing a synchronized pool of such work items (c.f. k_mem_slab), or by having one for each unique sender/receiver pair (i.e. 6 on MTL, 12 on TGL).

/*
* If new request is in blocking mode, we must call
* k_p4wq in blocking mode. This is workaround for
* k_p4wq_wait() interface.
*/
work->sync = (mode == IDC_BLOCKING);

ret = k_p4wq_wait(work, K_USEC(IDC_TIMEOUT));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't other work fields have to be set? E.g. .handler is called unconditionally.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lyakh This is just to influence p4wq_wait() behaviour. We want to sleep on the semaphore (if send_msg is called with BLOCKING flag), but the semaphore is private, so we need to set the work->sync flag to get the same effect.

In V2 version, I've modified this so to set the flag conditionally. If send_msg is called without BLOCKING flag, no need to wait here.

if (ret < 0) {
tr_err(&zephyr_idc_tr, "idc_send_msg error %d, target core %u",
ret, target_cpu);
return ret;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have we ever seen this race occurring? Between what and what? Non-blocking operations usually return an error when the action cannot be completed immediately. Wouldn't that be better?
In fact this seems racy too. What if the handler completes between checking for WORK_SUBMITTED and calling k_p4wq_wait()? Then we'll never be woken up?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can hit this on a MTL with modified multicore tplg. Still trying to find more info on the scenario as I agree this should not happen normally (and it's rare!). I now reworked the wait logic to only rely o p4wq semaphore, so this should be safe against races (and dropped the work_status field).

}
}

idc_send_memcpy_err = memcpy_s(msg_cp, sizeof(*msg_cp), msg, sizeof(*msg));
assert(!idc_send_memcpy_err);
/* Same priority as the IPC thread which is an EDF task and under Zephyr */
Expand All @@ -129,10 +154,6 @@ int idc_send_msg(struct idc_msg *msg, uint32_t mode)
work->handler = idc_handler;
work->sync = mode == IDC_BLOCKING;

if (!cpu_is_core_enabled(target_cpu)) {
tr_err(&zephyr_idc_tr, "Core %u is down, cannot sent IDC message", target_cpu);
return -EACCES;
}
if (msg->payload) {
idc_send_memcpy_err = memcpy_s(payload->data, sizeof(payload->data),
msg->payload, msg->size);
Expand All @@ -145,13 +166,13 @@ int idc_send_msg(struct idc_msg *msg, uint32_t mode)
/* Temporarily store sender core ID */
msg_cp->core = cpu_get_id();

/* Sending a message to another core, write back local message cache */
sys_cache_data_flush_range(msg_cp, sizeof(*msg_cp));
__ASSERT_NO_MSG(!is_cached(msg_cp));

k_p4wq_submit(q_zephyr_idc + target_cpu, work);

switch (mode) {
case IDC_BLOCKING:
ret = k_p4wq_wait(work, K_FOREVER);
ret = k_p4wq_wait(work, K_USEC(IDC_TIMEOUT));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would add a specific message in log here if ret != 0

if (!ret)
/* message was sent and executed successfully, get status code */
ret = idc_msg_status_get(msg->core);
Expand Down