Skip to content

Conversation

@nyh
Copy link
Contributor

@nyh nyh commented Mar 29, 2023

This patch adds "real-time priority" to OSv threads, which can be used to implement Posix's SCHED_FIFO and SCHED_RR scheduling policies.

Refs #386. It doesn't yet "Fixes" it because this patch does not yet add the Posix front-end to this implementation (sched_setscheduler() et al.) - this will be added in a separate patch.

We add to each thread a _realtime._priority, an unsigned integer. Normal (non-real-time) threads have _realtime._priority of 0, and are time-shared fairly as before, while realtime threads have _realtime._priority > 0, and preempt any thread with a lower realtime priority, including all normal threads.

We add a new API, t->set_realtime_priority(priority), to make a thread real-time scheduled. The "priority" is an unsigned int as explained above (0 means normal thread). The resulting scheduling policy is fully compatible with POSIX's SCHED_FIFO. This includes following the intricate rules on what happens to the running thread's place in the queue when there are several real-time threads with the same priority and the running thread yields, waits, or is preempted by a higher-priority thread. This patch also adds an exercise for these cases in tests/misc-scheduler.cc.

This patch does not yet implement time slices as needed for supporting POSIX's SCHED_RR policy. This patch already includes a time-slice setting API, t->set_realtime_time_slice(duration), but if used it produces a warning message, and the scheduler currently ignores this setting. Adding support for real-time time slices is left for a future patch: To implement time slices, we will need to keep in _realtime the total time this thread ran since starting this slice (only zeroed when the thread is removed from the run queue), and each time we switch to a realtime thread with timeslice >= 0, we'll set an appropriate timer (for timeslice minus already used time).

I tested that although this patch adds code into the scheduler it does not noticably slow down context switches of normal threads. E.g., tests/misc-ctxsw.so measured colocated context switch time of 308ns before this patch, and exactly the same after it.

While developing this patch I tested it using added workloads in tests/misc-scheduler.cc, included in this patch, but unfortunately this is not a real regression test - it has no assertions and cannot automatically detect failure (and, like all the "misc-*" tests, it doesn't run automatically in "make check").

Copy link
Collaborator

@wkozaczuk wkozaczuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what specific changes but two nitpicks are required.

core/sched.cc Outdated
} else if (!called_from_yield) {
auto &t = *runqueue.begin();
if (p->_runtime.get_local() < t._runtime.get_local()) {
auto &t = *runqueue.begin();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nitpick! Please shift one space to the right :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how this happens, I didn't need to change that line at all ;-) Thanks.

core/sched.cc Outdated
return;
#endif
}
} else if (p->_runtime.get_local() < t._runtime.get_local()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we handling the case when p->_realtime._priority == 0 && t._realtime._priority > 0? The below seems to indicate we are not treating this case in any special case, should we? In other words, should _runtime override the priority of threads? What if the priority of t is greater than 0? Should it then preempt p?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I should && t._realtime._priority == 0 in this if (see also thread_runtime_compare()). We only stay in this thread - which we know is not real time, if the next thread is not real time AND it has a higher runtime.

core/sched.cc Outdated
#endif
}
if (t._realtime._priority == p->_realtime._priority &&
!called_from_yield) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the !called_from_yield is unnecessary (see line 291/292).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, and it's even worse - I think the if of line 291/292 is not the right thing. I think I have a bug. According to https://man7.org/linux/man-pages/man2/sched_yield.2.html, "If the calling thread is the only thread in the highest priority list at that time, it will continue to run after a call to sched_yield().". So the code wrongly runs some other thread even in non-real-time priorities. I'll need to fix this.

core/sched.cc Outdated
if (p->_realtime._priority > 0) {
// Only switch to a higher-priority realtime thread, or to
// the next one with equal priority if yielding.
if (t._realtime._priority < p->_realtime._priority) {
Copy link
Collaborator

@wkozaczuk wkozaczuk Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that if somehow the current thread p has a high enough priority (for example 99) and all other threads in the run queue have a lower priority, this thread p will keep running forever, not even kernel threads will get a chance to do anything. Am I wrong in my analysis? Is this by design?
I am especially concerned about the kernel threads in this context. I am not sure if any of the thoughts raised in this article apply - https://lwn.net/Articles/818388/.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is by design that a real-time thread that never yields, is never preempted by any thread with a lower priority or non-real-time priority.
However, the question about kernel threads is a good one, as in Linux/Unix the things that happen in the kernel weren't called "threads" and the priorities didn't apply to them. In particular cpu::load_balance() which can't even move the poor threads to other CPUs.

I think the idea is that real-time threads should be short-running - and if a real-time thread just starts an endless loop that never returns, it will cause problems also on Linux.

Maybe the right thing to set some sort of special priority to cpu::load_balance (and anything else?) to allow it to run also when a real-time thread is running. I don't know. I think we can improve this later.

core/sched.cc Outdated
return;
#endif
}
if (t._realtime._priority == p->_realtime._priority &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this condition right? If the priority of t is equal to p, then we keep p running. Is this correct?

This patch adds "real-time priority" to OSv threads, which can be used
to implement Posix's SCHED_FIFO and SCHED_RR scheduling policies.

Refs cloudius-systems#386. It doesn't yet "Fixes" it because this patch does not yet add
the Posix front-end to this implementation (sched_setscheduler() et al.) -
this will be added in a separate patch.

We add to each thread a _realtime._priority, an unsigned integer.
Normal (non-real-time) threads have _realtime._priority of 0, and are
time-shared fairly as before, while realtime threads have
_realtime._priority > 0, and preempt any thread with a lower realtime
priority, including all normal threads.

We add a new API, t->set_realtime_priority(priority), to make a thread
real-time scheduled. The "priority" is an unsigned int as explained above
(0 means normal thread). The resulting scheduling policy is fully
compatible with POSIX's SCHED_FIFO. This includes following the intricate
rules on what happens to the running thread's place in the queue when there
are several real-time threads with the same priority and the running
thread yields, waits, or is preempted by a higher-priority thread.
This patch also adds an exercise for these cases in tests/misc-scheduler.cc.

This patch does not yet implement time slices as needed for supporting
POSIX's SCHED_RR policy. This patch already includes a time-slice setting
API, t->set_realtime_time_slice(duration), but if used it produces a
warning message, and the scheduler currently ignores this setting.
Adding support for real-time time slices is left for a future patch:
To implement time slices, we will need to keep in _realtime the total
time this thread ran since starting this slice (only zeroed when the
thread is removed from the run queue), and each time we switch to a
realtime thread with timeslice >= 0, we'll set an appropriate timer
(for timeslice minus already used time).

I tested that although this patch adds code into the scheduler it does
not noticably slow down context switches of normal threads. E.g.,
tests/misc-ctxsw.so measured colocated context switch time of 240ns
before this patch, and exactly the same after it.

While developing this patch I tested it using added workloads in
tests/misc-scheduler.cc, included in this patch, but unfortunately
this is not a real regression test - it has no assertions and cannot
automatically detect failure (and, like all the "misc-*" tests, it
doesn't run automatically in "make check").

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
@nyh
Copy link
Contributor Author

nyh commented Apr 17, 2023

Pushed a new version with the bugs that @wkozaczuk noticed (thanks!) fixed.

While fixing these bugs, I decided the scheduler is written in a too complicated way - e.g., it's ugly and confusing that we have two separate places in the context switch code which re-queue the outgoing thread. But unfortunately, after doing this cleanup the resulting scheduler's context switches wre much slower - and I couldn't figure out why even though I spent several hours trying to fix it :-( So the version I pushed now just fixes the bugs that Waldek found, without trying to improve the understandability of the code.

@nyh
Copy link
Contributor Author

nyh commented Apr 17, 2023

By the way, @wkozaczuk, the Travis CI fails saying that "Error: Cannot perform an interactive login from a non TTY device" when trying to pass docker a password from a TTY. Do you have any idea where/how to fix it?

@wkozaczuk
Copy link
Collaborator

@nyh I believe I have fixed the issue with the Travis build.

@wkozaczuk
Copy link
Collaborator

Though, honestly, I am not sure if it is actually building and testing your pull request changes or what is on master.

@wkozaczuk
Copy link
Collaborator

I think your changes look good and the unit tests pass on x64. But I would like to run more tests involving various apps (see https://github.com/cloudius-systems/osv/wiki/Automated-Testing-Framework) before I merge it.

@nyh
Copy link
Contributor Author

nyh commented Apr 23, 2023

Definitely, no hurry, it's been waiting for half a decade already 😂
We can also wait for the followup patches and benchmarking that started this renewed interest.

@wkozaczuk
Copy link
Collaborator

I have just committed and pushed this MR into the master branch. I meant to do it properly using the github UI but somehow accidentally did it by manually applying the patch version of it (the side effect of local testing :-)). So I am closing it manually.

Anyhow I have run quite extensive tests including the ones driven by https://github.com/cloudius-systems/osv/wiki/Automated-Testing-Framework and others from tests/misc-*. Everything seems to work fine including aarch64. In my case, I did see a little bit of a bump in the length of the colocated context switch time (2-3 ns with ROFS images and 5-10ns with ZFS images - more threads?). But I think it is negligible.

Closed with 894ceb2

@wkozaczuk wkozaczuk closed this May 31, 2023
nmeum added a commit to nmeum/osv that referenced this pull request Nov 11, 2024
Previously, pull request cloudius-systems#1223 added support for the SCHED_FIFO but
didn't implement SCHED_RR. This PR attempts to follow-up on that by
proposing an implementation of SCHED_RR.

Time slicing support, as mandated by SCHED_RR, is implemented though
the `set_realtime_time_slice(duration)` API added in the aforementioned
pull request. Within the scheduler, the amount of time the thread ran
so far is tracked if it exceeds the set duration, the thread is
preempted (if there is a runnable thread of same or higher priority).
The thread's run time is reset on preemption.

This is my first OSv pull request, I added some basic tests which I
can expand further. Additionally, I checked that the tests added in
know what you think.
nmeum added a commit to nmeum/osv that referenced this pull request Nov 11, 2024
Previously, pull request cloudius-systems#1223 added support for the SCHED_FIFO but
didn't implement SCHED_RR. This PR attempts to follow-up on that by
proposing an implementation of SCHED_RR.

Time slicing support, as mandated by SCHED_RR, is implemented though
the `set_realtime_time_slice(duration)` API added in the aforementioned
pull request. Within the scheduler, the amount of time the thread ran
so far is tracked if it exceeds the set duration, the thread is
preempted (if there is a runnable thread of same or higher priority).
The thread's run time is reset on preemption.

This is my first OSv pull request, I added some basic tests which I
can expand further. Additionally, I checked that the tests added in
know what you think.

Signed-off-by: Sören Tempel <tempel@ibr.cs.tu-bs.de>
nmeum added a commit to nmeum/osv that referenced this pull request Nov 11, 2024
Previously, pull request cloudius-systems#1223 added support for the SCHED_FIFO but
didn't implement SCHED_RR. This PR attempts to follow-up on that by
proposing an implementation of SCHED_RR.

Time slicing support, as mandated by SCHED_RR, is implemented though
the `set_realtime_time_slice(duration)` API added in the aforementioned
pull request. Within the scheduler, the amount of time the thread ran
so far is tracked if it exceeds the set duration, the thread is
preempted (if there is a runnable thread of same or higher priority).
The thread's run time is reset on preemption.

Signed-off-by: Sören Tempel <tempel@ibr.cs.tu-bs.de>
nmeum added a commit to nmeum/osv that referenced this pull request Sep 4, 2025
Previously, pull request cloudius-systems#1223 added support for the SCHED_FIFO but
didn't implement SCHED_RR. This PR attempts to follow-up on that by
proposing an implementation of SCHED_RR.

Time slicing support, as mandated by SCHED_RR, is implemented though
the `set_realtime_time_slice(duration)` API added in the aforementioned
pull request. Within the scheduler, the amount of time the thread ran
so far is tracked if it exceeds the set duration, the thread is
preempted (if there is a runnable thread of same or higher priority).
The thread's run time is reset on preemption.

Signed-off-by: Sören Tempel <tempel@ibr.cs.tu-bs.de>
nmeum added a commit to nmeum/osv that referenced this pull request Oct 13, 2025
Previously, pull request cloudius-systems#1223 added support for the SCHED_FIFO but
didn't implement SCHED_RR. This PR attempts to follow-up on that by
proposing an implementation of SCHED_RR.

Time slicing support, as mandated by SCHED_RR, is implemented though
the `set_realtime_time_slice(duration)` API added in the aforementioned
pull request. Within the scheduler, the amount of time the thread ran
so far is tracked if it exceeds the set duration, the thread is
preempted (if there is a runnable thread of same or higher priority).
The thread's run time is reset on preemption.

Signed-off-by: Sören Tempel <tempel@ibr.cs.tu-bs.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants