-
Notifications
You must be signed in to change notification settings - Fork 349
zephyr: switch over to a simple priority-based LL scheduler #4377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lgirdwood
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff, it's a great simplification. It's also obvious where we can improve the external and internal APIs here.
One thing we do need to look at next week is when we add a LL task we probably also need to add LL synchronized logic that does the triggering too.
src/schedule/zephyr_ll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, lets just say -ENOTSUP here as DMA trace will be using a thread with Zephyr.
src/schedule/zephyr_ll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this check ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that the task's own period has expired
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this matter here, can you add a comment. I though we are just using the states to determine what to run or cancel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be removed together with .start and .next_tick in a follow-up, I'll add a comment
src/schedule/zephyr_ll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should really release the spinlock before we run() and then acquire it again after the run_task() returns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have to research well what contexts can race for which resources here. IIUC and scheduling can be triggered from interrupt handlers, then the list of tasks has to be protected. In that case we cannot just drop the lock in the middle of this loop. One solution would be what I proposed - first extract all runnable tasks from the global list into a local one, while holding the lock. Then run the tasks from the local list, with no locking this time. The disadvantage would be, that tasks, returning RESCHEDULE have to be re-added to the global list, but I don't have a better solution so far.
src/schedule/zephyr_ll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think we need to have the concept of task start time here as the task will only start at the next LL tick iff it's in the task list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's look at this once it's first working reliably as is, I'll want to understand this better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any update here ? The start time is ignored since the task will run at next LL tick and LL tick timing is not changeable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is removed in the subsequent PR ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will be, yes, adding comments for that
src/schedule/zephyr_ll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is probably fine.
kv2019i
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much easier to follow. Probably the hardest parts relate to places where the "domain" abstraction is used to call into zephyr_domain.c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have this on the implementation side, so the public interface would remain the same for both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I first implemented both with keeping the "old" names, but then I thought that having proper namespace consistency in the .c file is better... But we can discuss this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
src/schedule/zephyr_ll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could use zepyr timer apis directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, they have 1 zephyr tick granularity, i.e. 50 or 48kHz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, we need to use the Zephyr APIs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have to change it in zephyr_domain.c too then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes please - we need to change it in all zephyr files.
d6489b0 to
320fef8
Compare
src/schedule/zephyr_domain.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment needs more - why are we doing this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this loop can be removed in the next PR ?
src/schedule/zephyr_domain.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not following why we need domain->next_tick when Zephyr will manage the tick wake ups ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep .next_tick and .start in this PR, I'll add a comment that they shouldn't be needed for a fully native Zephyr LL scheduling stack, and let's try to remove them in a follow up PR.
src/schedule/zephyr_ll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, we need to use the Zephyr APIs
src/schedule/zephyr_ll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this matter here, can you add a comment. I though we are just using the states to determine what to run or cancel.
src/schedule/zephyr_ll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have to say that all task in this list will be run, regardless of state change between above state check and here. i.e. once in this list it is run, but wont be run next time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at this point we are committed to run everything in the list, so this can be simplified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we know, that tasks can take a relatively long time to execute. If you have multiple tasks on the (temporary) scheduler list and you start executing the first of them, then the next one... There's a rather large window when tasks, further down on the list, can be cancelled. Why not use the chance and avoid running them, saving some execution cycles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, as it simplifies the execution path and we will always be faster than the IPC time window
src/schedule/zephyr_ll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment on what we are checking here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, terminator is confusing - this is really a zephyr thread of the waiting thread ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the thread, that has called task_free(), yes. Freeer is ugly, liberator is pathetic :-) freeing_thread is verbose and clear, but long?
src/schedule/zephyr_ll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any update here ? The start time is ignored since the task will run at next LL tick and LL tick timing is not changeable.
src/schedule/zephyr_ll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should set a time limit here based on a small multiple of LL tick and shout if we timeout.
|
The latest update addresses comments and also fixes the Zephyr multicore case (where supported), which got broken with the new scheduler because it didn't account for a self-terminating thread |
src/schedule/zephyr_ll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes please - we need to change it in all zephyr files.
src/schedule/zephyr_ll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably have a check "am I in IRQ context ?" and return an error if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a static internal function, maybe it's better to add such checks to callers, if any of them is uncertain
src/schedule/zephyr_ll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just check task->state here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having removed PENDING it can be done, yes
src/schedule/zephyr_ll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is optional then lets remove this state change here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
src/schedule/zephyr_ll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets not panic, just complain and return. Panic stops the trace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
src/schedule/zephyr_ll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is removed in the subsequent PR ?
src/schedule/zephyr_ll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the spinlock not automatically initialized at core boot as part of scheduler init ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because it has to be initialised every time a core submits the first task for scheduling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is broken and need fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason is, that when the last task on a core completes, the LL-scheduling thread on that core terminates while holding the spin-lock. And the spin-lock is a part of per-core scheduler data, so, next time we start the scheduling thread on that core we re-initialise the spin-lock. So far I don't see a sufficiently clean way to eliminate this... Is it really that bad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's bad - lets at least put it int a static inline away from the main flow (for easier reading)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're proposing a
static zephyr_ll_init_scheduler_for_first_task(struct zephyr_ll *sch)
{
spinlock_init(&sch->lock);
}
? Yeah, maybe that would self-document it a bit and add a scope to it too.
src/schedule/zephyr_ll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not complain here and return an error ? Under what circumstances would be schedule a task twice ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know, this is taken from the original ll-scheduler. We can try to remove it and see what breaks, but I'd rather do it later in a separate PR. Let me add a warning and a comment here.
src/schedule/zephyr_ll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, terminator is confusing - this is really a zephyr thread of the waiting thread ?
lgirdwood
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like some of these can do with a squash
src/schedule/zephyr_domain.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this loop can be removed in the next PR ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we state these cases in the comment.
src/schedule/zephyr_domain.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets make this a macro too.
src/schedule/zephyr_ll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be in the next PR ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this data is already used, I'll update the comment
kv2019i
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another look, the main functions look good. No major new issues found (a few minor ones inline).
src/schedule/zephyr_domain.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we need to finetune it now (if next_tick is removed later), but this looks a bit dangerous. Can next_tick be zero and this end up spinning for a longer time?
src/schedule/zephyr_ll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above text for CANCEL is not really describing the state, but rather the transition to cancel.
4d332df to
5747a9c
Compare
When registering scheduling domains period is never used, remove it. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
When the .domain_unregister() method is called, .total_num_tasks is still positive, it will only become 0 for the last task after .domain_unregister() returns. When cleaning up also set the user pointer to NULL. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
.next_tick has to be initialised at domain registration and updated on each scheduling domain event. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
zephyr_domain.c is a drop-in replacement for timer_domain.c. To avoid modifying initialisation code we used the same timer_domain_init() name for its initialisation function. However, the rest of the file uses the zephyr_domain_* namespace. Rename the function to stay within the same namespace and use a macro to redirect the call. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
Under Zephyr LL scheduling is implemented by per-core threads. They are spawned when the first task on that core is scheduled. When the last task on that core is completed, the thread is terminated, which can happen in context of that very thread. This patch adapts the generic LL scheduler code and the Zephyr LL domain scheduler for that by making sure to call thread termination in a consistent state. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
Currently a global semaphore is used to signal all schedulers on all cores, waiting for the next timer period. A more reliable solution is using a per-core semaphore. This patch switched over to that approach. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
Switch SOF under Zephyr to use a simplified native low-latency scheduler implementation. Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
lgirdwood
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, as this provides a stable base point we can merge and address the improvents and simplifications in subsequent PRs.
@keqiaozhang will this now let us test in CI ?
|
I have to revert this PR locally because when I re-enable the This is on APL Up Squared. |
|
@lyakh any inputs ? |
@marc-hb @lgirdwood Right, the current Zephyr LL scheduler version runs all tasks with the same period (1ms by default). We can change that if needed. @lgirdwood proposed to use a counter for that. But I'm also wondering - does the DMA trace task have to be LL or should it rather be EDF? |
There should be a warning or assert then the period is ignored.
I don't know why it has to be LL. |
|
Trace should be preemptable, is should be DP (aka EDF). |
This replaces the original LL scheduler with a simplified version.