zephyr: switch over to a simple priority-based LL scheduler #4377

lyakh · 2021-06-18T15:15:15Z

This replaces the original LL scheduler with a simplified version.

lgirdwood

Good stuff, it's a great simplification. It's also obvious where we can improve the external and internal APIs here.
One thing we do need to look at next week is when we add a LL task we probably also need to add LL synchronized logic that does the triggering too.

lgirdwood · 2021-06-18T16:08:01Z

src/schedule/zephyr_ll.c

ack, lets just say -ENOTSUP here as DMA trace will be using a thread with Zephyr.

zephyr/CMakeLists.txt

lgirdwood · 2021-06-18T16:12:13Z

src/schedule/zephyr_ll.c

what does this check ?

that the task's own period has expired

why does this matter here, can you add a comment. I though we are just using the states to determine what to run or cancel.

this should be removed together with .start and .next_tick in a follow-up, I'll add a comment

lgirdwood · 2021-06-18T16:15:13Z

src/schedule/zephyr_ll.c

We should really release the spinlock before we run() and then acquire it again after the run_task() returns.

we have to research well what contexts can race for which resources here. IIUC and scheduling can be triggered from interrupt handlers, then the list of tasks has to be protected. In that case we cannot just drop the lock in the middle of this loop. One solution would be what I proposed - first extract all runnable tasks from the global list into a local one, while holding the lock. Then run the tasks from the local list, with no locking this time. The disadvantage would be, that tasks, returning RESCHEDULE have to be re-added to the global list, but I don't have a better solution so far.

lgirdwood · 2021-06-18T16:17:43Z

src/schedule/zephyr_ll.c

I dont think we need to have the concept of task start time here as the task will only start at the next LL tick iff it's in the task list.

let's look at this once it's first working reliably as is, I'll want to understand this better.

any update here ? The start time is ignored since the task will run at next LL tick and LL tick timing is not changeable.

I assume this is removed in the subsequent PR ?

it will be, yes, adding comments for that

src/schedule/zephyr_ll.c

lgirdwood · 2021-06-18T16:21:53Z

src/schedule/zephyr_ll.c

I think this is probably fine.

src/schedule/zephyr_ll.c

kv2019i

Much easier to follow. Probably the hardest parts relate to places where the "domain" abstraction is used to call into zephyr_domain.c.

kv2019i · 2021-06-21T10:30:36Z

src/include/sof/schedule/ll_schedule.h

Could we have this on the implementation side, so the public interface would remain the same for both?

I first implemented both with keeping the "old" names, but then I thought that having proper namespace consistency in the .c file is better... But we can discuss this.

kv2019i · 2021-06-21T10:30:47Z

src/include/sof/schedule/ll_schedule_domain.h

kv2019i · 2021-06-21T10:31:56Z

src/schedule/zephyr_ll.c

we could use zepyr timer apis directly

well, they have 1 zephyr tick granularity, i.e. 50 or 48kHz

ack, we need to use the Zephyr APIs

have to change it in zephyr_domain.c too then

yes please - we need to change it in all zephyr files.

src/schedule/zephyr_ll.c

src/schedule/zephyr_domain.c

lgirdwood · 2021-06-23T13:07:38Z

src/schedule/zephyr_domain.c

comment needs more - why are we doing this ?

I guess this loop can be removed in the next PR ?

lgirdwood · 2021-06-23T13:08:52Z

src/schedule/zephyr_domain.c

not following why we need domain->next_tick when Zephyr will manage the tick wake ups ?

let's keep .next_tick and .start in this PR, I'll add a comment that they shouldn't be needed for a fully native Zephyr LL scheduling stack, and let's try to remove them in a follow up PR.

lgirdwood · 2021-06-23T13:10:49Z

src/schedule/zephyr_ll.c

ack, we need to use the Zephyr APIs

src/schedule/zephyr_ll.c

lgirdwood · 2021-06-23T13:17:27Z

src/schedule/zephyr_ll.c

why does this matter here, can you add a comment. I though we are just using the states to determine what to run or cancel.

lgirdwood · 2021-06-23T13:23:18Z

src/schedule/zephyr_ll.c

I think we have to say that all task in this list will be run, regardless of state change between above state check and here. i.e. once in this list it is run, but wont be run next time.

at this point we are committed to run everything in the list, so this can be simplified.

we know, that tasks can take a relatively long time to execute. If you have multiple tasks on the (temporary) scheduler list and you start executing the first of them, then the next one... There's a rather large window when tasks, further down on the list, can be cancelled. Why not use the chance and avoid running them, saving some execution cycles?

This is fine, as it simplifies the execution path and we will always be faster than the IPC time window

lgirdwood · 2021-06-23T13:24:18Z

src/schedule/zephyr_ll.c

can you add a comment on what we are checking here.

ok, terminator is confusing - this is really a zephyr thread of the waiting thread ?

this is the thread, that has called task_free(), yes. Freeer is ugly, liberator is pathetic :-) freeing_thread is verbose and clear, but long?

lgirdwood · 2021-06-23T13:27:01Z

src/schedule/zephyr_ll.c

any update here ? The start time is ignored since the task will run at next LL tick and LL tick timing is not changeable.

lgirdwood · 2021-06-23T13:28:32Z

src/schedule/zephyr_ll.c

we should set a time limit here based on a small multiple of LL tick and shout if we timeout.

lyakh · 2021-06-25T15:04:38Z

The latest update addresses comments and also fixes the Zephyr multicore case (where supported), which got broken with the new scheduler because it didn't account for a self-terminating thread

lgirdwood · 2021-06-25T15:55:35Z

src/schedule/zephyr_ll.c

yes please - we need to change it in all zephyr files.

lgirdwood · 2021-06-25T15:57:44Z

src/schedule/zephyr_ll.c

we should probably have a check "am I in IRQ context ?" and return an error if so.

this is a static internal function, maybe it's better to add such checks to callers, if any of them is uncertain

src/schedule/zephyr_ll.c

lgirdwood · 2021-06-25T16:04:32Z

src/schedule/zephyr_ll.c

why not just check task->state here ?

having removed PENDING it can be done, yes

lgirdwood · 2021-06-25T16:19:04Z

src/schedule/zephyr_ll.c

if this is optional then lets remove this state change here.

lgirdwood · 2021-06-25T16:26:11Z

src/schedule/zephyr_ll.c

Lets not panic, just complain and return. Panic stops the trace.

lgirdwood · 2021-06-25T16:29:36Z

src/schedule/zephyr_ll.c

I assume this is removed in the subsequent PR ?

lgirdwood · 2021-06-25T16:32:20Z

src/schedule/zephyr_ll.c

Why is the spinlock not automatically initialized at core boot as part of scheduler init ?

because it has to be initialised every time a core submits the first task for scheduling

This is broken and need fixed.

the reason is, that when the last task on a core completes, the LL-scheduling thread on that core terminates while holding the spin-lock. And the spin-lock is a part of per-core scheduler data, so, next time we start the scheduling thread on that core we re-initialise the spin-lock. So far I don't see a sufficiently clean way to eliminate this... Is it really that bad?

Yes, it's bad - lets at least put it int a static inline away from the main flow (for easier reading)

you're proposing a

static zephyr_ll_init_scheduler_for_first_task(struct zephyr_ll *sch) { spinlock_init(&sch->lock); }

? Yeah, maybe that would self-document it a bit and add a scope to it too.

lgirdwood · 2021-06-25T16:33:53Z

src/schedule/zephyr_ll.c

Should we not complain here and return an error ? Under what circumstances would be schedule a task twice ?

Don't know, this is taken from the original ll-scheduler. We can try to remove it and see what breaks, but I'd rather do it later in a separate PR. Let me add a warning and a comment here.

lgirdwood · 2021-06-25T16:45:54Z

src/schedule/zephyr_ll.c

ok, terminator is confusing - this is really a zephyr thread of the waiting thread ?

lgirdwood

Looks like some of these can do with a squash

lgirdwood · 2021-06-29T21:54:21Z

src/schedule/zephyr_domain.c

I guess this loop can be removed in the next PR ?

lgirdwood · 2021-06-29T21:55:26Z

src/include/sof/schedule/ll_schedule_domain.h

Can we state these cases in the comment.

lgirdwood · 2021-06-29T21:56:38Z

src/schedule/zephyr_domain.c

Lets make this a macro too.

lgirdwood · 2021-06-29T21:57:33Z

src/schedule/zephyr_ll.c

Should this be in the next PR ?

Actually this data is already used, I'll update the comment

kv2019i

Another look, the main functions look good. No major new issues found (a few minor ones inline).

kv2019i · 2021-06-30T08:10:03Z

src/schedule/zephyr_domain.c

Not sure if we need to finetune it now (if next_tick is removed later), but this looks a bit dangerous. Can next_tick be zero and this end up spinning for a longer time?

kv2019i · 2021-06-30T08:14:27Z

src/schedule/zephyr_ll.c

The above text for CANCEL is not really describing the state, but rather the transition to cancel.

When registering scheduling domains period is never used, remove it. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>

When the .domain_unregister() method is called, .total_num_tasks is still positive, it will only become 0 for the last task after .domain_unregister() returns. When cleaning up also set the user pointer to NULL. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>

.next_tick has to be initialised at domain registration and updated on each scheduling domain event. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>

zephyr_domain.c is a drop-in replacement for timer_domain.c. To avoid modifying initialisation code we used the same timer_domain_init() name for its initialisation function. However, the rest of the file uses the zephyr_domain_* namespace. Rename the function to stay within the same namespace and use a macro to redirect the call. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>

Under Zephyr LL scheduling is implemented by per-core threads. They are spawned when the first task on that core is scheduled. When the last task on that core is completed, the thread is terminated, which can happen in context of that very thread. This patch adapts the generic LL scheduler code and the Zephyr LL domain scheduler for that by making sure to call thread termination in a consistent state. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>

Currently a global semaphore is used to signal all schedulers on all cores, waiting for the next timer period. A more reliable solution is using a per-core semaphore. This patch switched over to that approach. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>

Switch SOF under Zephyr to use a simplified native low-latency scheduler implementation. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>

lgirdwood

Ok, as this provides a stable base point we can merge and address the improvents and simplifications in subsequent PRs.
@keqiaozhang will this now let us test in CI ?

marc-hb · 2021-07-11T01:45:52Z

I have to revert this PR locally because when I re-enable the trace_work() thread in Zephyr (with PR #4452) this new scheduler ignores DMA_TRACE_PERIOD and seems to re-run it as fast as it can. I bisected this to commit zephyr: ll-schedule: switch over to a simplified implementation. The same commit adds the following error message which could be relevant?

zephyr_ll_scheduler_init(): unsupported domain 2

This is on APL Up Squared.

lgirdwood · 2021-07-13T14:30:51Z

@lyakh any inputs ?

lyakh · 2021-07-14T08:56:56Z

I have to revert this PR locally because when I re-enable the trace_work() thread in Zephyr (with PR #4452) this new scheduler ignores DMA_TRACE_PERIOD and seems to re-run it as fast as it can. I bisected this to commit zephyr: ll-schedule: switch over to a simplified implementation. The same commit adds the following error message which could be relevant?
zephyr_ll_scheduler_init(): unsupported domain 2
This is on APL Up Squared.

@marc-hb @lgirdwood Right, the current Zephyr LL scheduler version runs all tasks with the same period (1ms by default). We can change that if needed. @lgirdwood proposed to use a counter for that. But I'm also wondering - does the DMA trace task have to be LL or should it rather be EDF?

marc-hb · 2021-07-14T13:29:51Z

the current Zephyr LL scheduler version runs all tasks with the same period (1ms by default).

There should be a warning or assert then the period is ignored.

does the DMA trace task have to be LL or should it rather be EDF?

I don't know why it has to be LL.

lgirdwood · 2021-07-14T14:22:33Z

Trace should be preemptable, is should be DP (aka EDF).

lyakh requested review from kv2019i and lgirdwood June 18, 2021 15:15

lgirdwood reviewed Jun 18, 2021

View reviewed changes

kv2019i reviewed Jun 21, 2021

View reviewed changes

lyakh force-pushed the zll branch from 1414150 to 00254e8 Compare June 22, 2021 15:22

lyakh marked this pull request as ready for review June 22, 2021 15:25

lyakh requested review from dbaluta, lbetlej, mmaka1, mrajwa and plbossart as code owners June 22, 2021 15:25

lyakh force-pushed the zll branch 2 times, most recently from d6489b0 to 320fef8 Compare June 22, 2021 16:51

lgirdwood requested changes Jun 23, 2021

View reviewed changes

lyakh force-pushed the zll branch from 320fef8 to 7c1fd15 Compare June 25, 2021 15:00

lgirdwood requested changes Jun 25, 2021

View reviewed changes

lyakh force-pushed the zll branch from 7c1fd15 to bb6938e Compare June 28, 2021 19:15

lgirdwood reviewed Jun 29, 2021

View reviewed changes

kv2019i reviewed Jun 30, 2021

View reviewed changes

lyakh force-pushed the zll branch 2 times, most recently from 4d332df to 5747a9c Compare July 2, 2021 09:43

lyakh requested a review from lgirdwood July 2, 2021 11:55

lyakh changed the title ~~[WiP] zephyr: switch over to a simple priority-based LL scheduler~~ zephyr: switch over to a simple priority-based LL scheduler Jul 2, 2021

lyakh added 6 commits July 2, 2021 14:05

schedule: remove an unused "period" parameter

a488413

When registering scheduling domains period is never used, remove it. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>

zephyr: ll-domain: update .next_tick on each timer interrupt

53eec38

.next_tick has to be initialised at domain registration and updated on each scheduling domain event. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>

zephyr: ll-schedule: switch over to a simplified implementation

26d7f6b

Switch SOF under Zephyr to use a simplified native low-latency scheduler implementation. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>

lyakh force-pushed the zll branch from 5747a9c to 26d7f6b Compare July 2, 2021 12:06

lgirdwood approved these changes Jul 2, 2021

View reviewed changes

lgirdwood merged commit a439ea9 into thesofproject:main Jul 2, 2021

lyakh deleted the zll branch July 2, 2021 13:13

This was referenced Jul 11, 2021

Re-enable DMA trace initialization in Zephyr #4452

Closed

[Query] How can I use sof-logger with Zephyr .ri image? #4420

Closed

marc-hb mentioned this pull request Jul 16, 2021

zephyr: sof-logger #4503

Merged

zephyr: switch over to a simple priority-based LL scheduler #4377

zephyr: switch over to a simple priority-based LL scheduler #4377

Uh oh!

Conversation

lyakh commented Jun 18, 2021

Uh oh!

lgirdwood left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kv2019i left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!