Skip to content

Conversation

@lyakh
Copy link
Collaborator

@lyakh lyakh commented Jun 18, 2021

This replaces the original LL scheduler with a simplified version.

@lyakh lyakh requested review from kv2019i and lgirdwood June 18, 2021 15:15
Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff, it's a great simplification. It's also obvious where we can improve the external and internal APIs here.
One thing we do need to look at next week is when we add a LL task we probably also need to add LL synchronized logic that does the triggering too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, lets just say -ENOTSUP here as DMA trace will be using a thread with Zephyr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this check ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that the task's own period has expired

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this matter here, can you add a comment. I though we are just using the states to determine what to run or cancel.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be removed together with .start and .next_tick in a follow-up, I'll add a comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really release the spinlock before we run() and then acquire it again after the run_task() returns.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have to research well what contexts can race for which resources here. IIUC and scheduling can be triggered from interrupt handlers, then the list of tasks has to be protected. In that case we cannot just drop the lock in the middle of this loop. One solution would be what I proposed - first extract all runnable tasks from the global list into a local one, while holding the lock. Then run the tasks from the local list, with no locking this time. The disadvantage would be, that tasks, returning RESCHEDULE have to be re-added to the global list, but I don't have a better solution so far.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think we need to have the concept of task start time here as the task will only start at the next LL tick iff it's in the task list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's look at this once it's first working reliably as is, I'll want to understand this better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any update here ? The start time is ignored since the task will run at next LL tick and LL tick timing is not changeable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is removed in the subsequent PR ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be, yes, adding comments for that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably fine.

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much easier to follow. Probably the hardest parts relate to places where the "domain" abstraction is used to call into zephyr_domain.c.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have this on the implementation side, so the public interface would remain the same for both?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I first implemented both with keeping the "old" names, but then I thought that having proper namespace consistency in the .c file is better... But we can discuss this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could use zepyr timer apis directly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, they have 1 zephyr tick granularity, i.e. 50 or 48kHz

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, we need to use the Zephyr APIs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have to change it in zephyr_domain.c too then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please - we need to change it in all zephyr files.

@lyakh lyakh marked this pull request as ready for review June 22, 2021 15:25
@lyakh lyakh force-pushed the zll branch 2 times, most recently from d6489b0 to 320fef8 Compare June 22, 2021 16:51
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment needs more - why are we doing this ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this loop can be removed in the next PR ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not following why we need domain->next_tick when Zephyr will manage the tick wake ups ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep .next_tick and .start in this PR, I'll add a comment that they shouldn't be needed for a fully native Zephyr LL scheduling stack, and let's try to remove them in a follow up PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, we need to use the Zephyr APIs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this matter here, can you add a comment. I though we are just using the states to determine what to run or cancel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to say that all task in this list will be run, regardless of state change between above state check and here. i.e. once in this list it is run, but wont be run next time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at this point we are committed to run everything in the list, so this can be simplified.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we know, that tasks can take a relatively long time to execute. If you have multiple tasks on the (temporary) scheduler list and you start executing the first of them, then the next one... There's a rather large window when tasks, further down on the list, can be cancelled. Why not use the chance and avoid running them, saving some execution cycles?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, as it simplifies the execution path and we will always be faster than the IPC time window

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment on what we are checking here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, terminator is confusing - this is really a zephyr thread of the waiting thread ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the thread, that has called task_free(), yes. Freeer is ugly, liberator is pathetic :-) freeing_thread is verbose and clear, but long?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any update here ? The start time is ignored since the task will run at next LL tick and LL tick timing is not changeable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should set a time limit here based on a small multiple of LL tick and shout if we timeout.

@lyakh
Copy link
Collaborator Author

lyakh commented Jun 25, 2021

The latest update addresses comments and also fixes the Zephyr multicore case (where supported), which got broken with the new scheduler because it didn't account for a self-terminating thread

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please - we need to change it in all zephyr files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably have a check "am I in IRQ context ?" and return an error if so.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a static internal function, maybe it's better to add such checks to callers, if any of them is uncertain

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just check task->state here ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having removed PENDING it can be done, yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is optional then lets remove this state change here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets not panic, just complain and return. Panic stops the trace.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is removed in the subsequent PR ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the spinlock not automatically initialized at core boot as part of scheduler init ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it has to be initialised every time a core submits the first task for scheduling

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is broken and need fixed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason is, that when the last task on a core completes, the LL-scheduling thread on that core terminates while holding the spin-lock. And the spin-lock is a part of per-core scheduler data, so, next time we start the scheduling thread on that core we re-initialise the spin-lock. So far I don't see a sufficiently clean way to eliminate this... Is it really that bad?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's bad - lets at least put it int a static inline away from the main flow (for easier reading)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're proposing a

static zephyr_ll_init_scheduler_for_first_task(struct zephyr_ll *sch)
{
	spinlock_init(&sch->lock);
}

? Yeah, maybe that would self-document it a bit and add a scope to it too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not complain here and return an error ? Under what circumstances would be schedule a task twice ?

Copy link
Collaborator Author

@lyakh lyakh Jun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know, this is taken from the original ll-scheduler. We can try to remove it and see what breaks, but I'd rather do it later in a separate PR. Let me add a warning and a comment here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, terminator is confusing - this is really a zephyr thread of the waiting thread ?

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some of these can do with a squash

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this loop can be removed in the next PR ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we state these cases in the comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets make this a macro too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in the next PR ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this data is already used, I'll update the comment

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another look, the main functions look good. No major new issues found (a few minor ones inline).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need to finetune it now (if next_tick is removed later), but this looks a bit dangerous. Can next_tick be zero and this end up spinning for a longer time?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above text for CANCEL is not really describing the state, but rather the transition to cancel.

@lyakh lyakh force-pushed the zll branch 2 times, most recently from 4d332df to 5747a9c Compare July 2, 2021 09:43
@lyakh lyakh requested a review from lgirdwood July 2, 2021 11:55
@lyakh lyakh changed the title [WiP] zephyr: switch over to a simple priority-based LL scheduler zephyr: switch over to a simple priority-based LL scheduler Jul 2, 2021
lyakh added 6 commits July 2, 2021 14:05
When registering scheduling domains period is never used, remove it.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
When the .domain_unregister() method is called, .total_num_tasks is
still positive, it will only become 0 for the last task after
.domain_unregister() returns. When cleaning up also set the user
pointer to NULL.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
.next_tick has to be initialised at domain registration and updated
on each scheduling domain event.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
zephyr_domain.c is a drop-in replacement for timer_domain.c. To avoid
modifying initialisation code we used the same timer_domain_init()
name for its initialisation function. However, the rest of the file
uses the zephyr_domain_* namespace. Rename the function to stay
within the same namespace and use a macro to redirect the call.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
Under Zephyr LL scheduling is implemented by per-core threads. They
are spawned when the first task on that core is scheduled. When the
last task on that core is completed, the thread is terminated, which
can happen in context of that very thread. This patch adapts the
generic LL scheduler code and the Zephyr LL domain scheduler for
that by making sure to call thread termination in a consistent state.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
Currently a global semaphore is used to signal all schedulers on all
cores, waiting for the next timer period. A more reliable solution is
using a per-core semaphore. This patch switched over to that
approach.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
Switch SOF under Zephyr to use a simplified native low-latency
scheduler implementation.

Signed-off-by: Guennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, as this provides a stable base point we can merge and address the improvents and simplifications in subsequent PRs.
@keqiaozhang will this now let us test in CI ?

@lgirdwood lgirdwood merged commit a439ea9 into thesofproject:main Jul 2, 2021
@lyakh lyakh deleted the zll branch July 2, 2021 13:13
@marc-hb
Copy link
Collaborator

marc-hb commented Jul 11, 2021

I have to revert this PR locally because when I re-enable the trace_work() thread in Zephyr (with PR #4452) this new scheduler ignores DMA_TRACE_PERIOD and seems to re-run it as fast as it can. I bisected this to commit zephyr: ll-schedule: switch over to a simplified implementation. The same commit adds the following error message which could be relevant?

zephyr_ll_scheduler_init(): unsupported domain 2

This is on APL Up Squared.

@lgirdwood
Copy link
Member

@lyakh any inputs ?

@lyakh
Copy link
Collaborator Author

lyakh commented Jul 14, 2021

I have to revert this PR locally because when I re-enable the trace_work() thread in Zephyr (with PR #4452) this new scheduler ignores DMA_TRACE_PERIOD and seems to re-run it as fast as it can. I bisected this to commit zephyr: ll-schedule: switch over to a simplified implementation. The same commit adds the following error message which could be relevant?

zephyr_ll_scheduler_init(): unsupported domain 2

This is on APL Up Squared.

@marc-hb @lgirdwood Right, the current Zephyr LL scheduler version runs all tasks with the same period (1ms by default). We can change that if needed. @lgirdwood proposed to use a counter for that. But I'm also wondering - does the DMA trace task have to be LL or should it rather be EDF?

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 14, 2021

the current Zephyr LL scheduler version runs all tasks with the same period (1ms by default).

There should be a warning or assert then the period is ignored.

does the DMA trace task have to be LL or should it rather be EDF?

I don't know why it has to be LL.

@lgirdwood
Copy link
Member

Trace should be preemptable, is should be DP (aka EDF).

@marc-hb marc-hb mentioned this pull request Jul 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants