sched: low-level support for POSIX-compatible real-time scheduling #1223

nyh · 2023-03-29T09:21:33Z

This patch adds "real-time priority" to OSv threads, which can be used to implement Posix's SCHED_FIFO and SCHED_RR scheduling policies.

Refs #386. It doesn't yet "Fixes" it because this patch does not yet add the Posix front-end to this implementation (sched_setscheduler() et al.) - this will be added in a separate patch.

We add to each thread a _realtime._priority, an unsigned integer. Normal (non-real-time) threads have _realtime._priority of 0, and are time-shared fairly as before, while realtime threads have _realtime._priority > 0, and preempt any thread with a lower realtime priority, including all normal threads.

We add a new API, t->set_realtime_priority(priority), to make a thread real-time scheduled. The "priority" is an unsigned int as explained above (0 means normal thread). The resulting scheduling policy is fully compatible with POSIX's SCHED_FIFO. This includes following the intricate rules on what happens to the running thread's place in the queue when there are several real-time threads with the same priority and the running thread yields, waits, or is preempted by a higher-priority thread. This patch also adds an exercise for these cases in tests/misc-scheduler.cc.

This patch does not yet implement time slices as needed for supporting POSIX's SCHED_RR policy. This patch already includes a time-slice setting API, t->set_realtime_time_slice(duration), but if used it produces a warning message, and the scheduler currently ignores this setting. Adding support for real-time time slices is left for a future patch: To implement time slices, we will need to keep in _realtime the total time this thread ran since starting this slice (only zeroed when the thread is removed from the run queue), and each time we switch to a realtime thread with timeslice >= 0, we'll set an appropriate timer (for timeslice minus already used time).

I tested that although this patch adds code into the scheduler it does not noticably slow down context switches of normal threads. E.g., tests/misc-ctxsw.so measured colocated context switch time of 308ns before this patch, and exactly the same after it.

While developing this patch I tested it using added workloads in tests/misc-scheduler.cc, included in this patch, but unfortunately this is not a real regression test - it has no assertions and cannot automatically detect failure (and, like all the "misc-*" tests, it doesn't run automatically in "make check").

wkozaczuk

Not sure what specific changes but two nitpicks are required.

wkozaczuk · 2023-03-30T19:33:05Z

core/sched.cc

        } else if (!called_from_yield) {
-            auto &t = *runqueue.begin();
-            if (p->_runtime.get_local() < t._runtime.get_local()) {
+           auto &t = *runqueue.begin();


Super nitpick! Please shift one space to the right :-)

I don't know how this happens, I didn't need to change that line at all ;-) Thanks.

wkozaczuk · 2023-03-30T19:41:40Z

core/sched.cc

+                    return;
+#endif
+                }
+            } else if (p->_runtime.get_local() < t._runtime.get_local()) {


Are we handling the case when p->_realtime._priority == 0 && t._realtime._priority > 0? The below seems to indicate we are not treating this case in any special case, should we? In other words, should _runtime override the priority of threads? What if the priority of t is greater than 0? Should it then preempt p?

Good catch. I should && t._realtime._priority == 0 in this if (see also thread_runtime_compare()). We only stay in this thread - which we know is not real time, if the next thread is not real time AND it has a higher runtime.

wkozaczuk · 2023-03-30T19:43:23Z

core/sched.cc

+#endif
+                }
+                if (t._realtime._priority == p->_realtime._priority &&
+                        !called_from_yield) {


I think the !called_from_yield is unnecessary (see line 291/292).

You're right, and it's even worse - I think the if of line 291/292 is not the right thing. I think I have a bug. According to https://man7.org/linux/man-pages/man2/sched_yield.2.html, "If the calling thread is the only thread in the highest priority list at that time, it will continue to run after a call to sched_yield().". So the code wrongly runs some other thread even in non-real-time priorities. I'll need to fix this.

include/osv/sched.hh

wkozaczuk · 2023-03-30T20:01:00Z

core/sched.cc

+            if (p->_realtime._priority > 0) {
+                // Only switch to a higher-priority realtime thread, or to
+                // the next one with equal priority if yielding.
+                if (t._realtime._priority < p->_realtime._priority) {


This means that if somehow the current thread p has a high enough priority (for example 99) and all other threads in the run queue have a lower priority, this thread p will keep running forever, not even kernel threads will get a chance to do anything. Am I wrong in my analysis? Is this by design?
I am especially concerned about the kernel threads in this context. I am not sure if any of the thoughts raised in this article apply - https://lwn.net/Articles/818388/.

Yes, it is by design that a real-time thread that never yields, is never preempted by any thread with a lower priority or non-real-time priority.
However, the question about kernel threads is a good one, as in Linux/Unix the things that happen in the kernel weren't called "threads" and the priorities didn't apply to them. In particular cpu::load_balance() which can't even move the poor threads to other CPUs.

I think the idea is that real-time threads should be short-running - and if a real-time thread just starts an endless loop that never returns, it will cause problems also on Linux.

Maybe the right thing to set some sort of special priority to cpu::load_balance (and anything else?) to allow it to run also when a real-time thread is running. I don't know. I think we can improve this later.

wkozaczuk · 2023-03-30T20:06:17Z

core/sched.cc

+                    return;
+#endif
+                }
+                if (t._realtime._priority == p->_realtime._priority &&


Is this condition right? If the priority of t is equal to p, then we keep p running. Is this correct?

core/sched.cc

include/osv/sched.hh

This patch adds "real-time priority" to OSv threads, which can be used to implement Posix's SCHED_FIFO and SCHED_RR scheduling policies. Refs cloudius-systems#386. It doesn't yet "Fixes" it because this patch does not yet add the Posix front-end to this implementation (sched_setscheduler() et al.) - this will be added in a separate patch. We add to each thread a _realtime._priority, an unsigned integer. Normal (non-real-time) threads have _realtime._priority of 0, and are time-shared fairly as before, while realtime threads have _realtime._priority > 0, and preempt any thread with a lower realtime priority, including all normal threads. We add a new API, t->set_realtime_priority(priority), to make a thread real-time scheduled. The "priority" is an unsigned int as explained above (0 means normal thread). The resulting scheduling policy is fully compatible with POSIX's SCHED_FIFO. This includes following the intricate rules on what happens to the running thread's place in the queue when there are several real-time threads with the same priority and the running thread yields, waits, or is preempted by a higher-priority thread. This patch also adds an exercise for these cases in tests/misc-scheduler.cc. This patch does not yet implement time slices as needed for supporting POSIX's SCHED_RR policy. This patch already includes a time-slice setting API, t->set_realtime_time_slice(duration), but if used it produces a warning message, and the scheduler currently ignores this setting. Adding support for real-time time slices is left for a future patch: To implement time slices, we will need to keep in _realtime the total time this thread ran since starting this slice (only zeroed when the thread is removed from the run queue), and each time we switch to a realtime thread with timeslice >= 0, we'll set an appropriate timer (for timeslice minus already used time). I tested that although this patch adds code into the scheduler it does not noticably slow down context switches of normal threads. E.g., tests/misc-ctxsw.so measured colocated context switch time of 240ns before this patch, and exactly the same after it. While developing this patch I tested it using added workloads in tests/misc-scheduler.cc, included in this patch, but unfortunately this is not a real regression test - it has no assertions and cannot automatically detect failure (and, like all the "misc-*" tests, it doesn't run automatically in "make check"). Signed-off-by: Nadav Har'El <nyh@scylladb.com>

nyh · 2023-04-17T15:42:41Z

Pushed a new version with the bugs that @wkozaczuk noticed (thanks!) fixed.

While fixing these bugs, I decided the scheduler is written in a too complicated way - e.g., it's ugly and confusing that we have two separate places in the context switch code which re-queue the outgoing thread. But unfortunately, after doing this cleanup the resulting scheduler's context switches wre much slower - and I couldn't figure out why even though I spent several hours trying to fix it :-( So the version I pushed now just fixes the bugs that Waldek found, without trying to improve the understandability of the code.

nyh · 2023-04-17T15:44:33Z

By the way, @wkozaczuk, the Travis CI fails saying that "Error: Cannot perform an interactive login from a non TTY device" when trying to pass docker a password from a TTY. Do you have any idea where/how to fix it?

wkozaczuk · 2023-04-18T01:07:44Z

@nyh I believe I have fixed the issue with the Travis build.

wkozaczuk · 2023-04-18T01:14:55Z

Though, honestly, I am not sure if it is actually building and testing your pull request changes or what is on master.

wkozaczuk · 2023-04-23T18:02:46Z

I think your changes look good and the unit tests pass on x64. But I would like to run more tests involving various apps (see https://github.com/cloudius-systems/osv/wiki/Automated-Testing-Framework) before I merge it.

nyh · 2023-04-23T18:31:47Z

Definitely, no hurry, it's been waiting for half a decade already 😂
We can also wait for the followup patches and benchmarking that started this renewed interest.

wkozaczuk · 2023-05-31T16:15:42Z

I have just committed and pushed this MR into the master branch. I meant to do it properly using the github UI but somehow accidentally did it by manually applying the patch version of it (the side effect of local testing :-)). So I am closing it manually.

Anyhow I have run quite extensive tests including the ones driven by https://github.com/cloudius-systems/osv/wiki/Automated-Testing-Framework and others from tests/misc-*. Everything seems to work fine including aarch64. In my case, I did see a little bit of a bump in the length of the colocated context switch time (2-3 ns with ROFS images and 5-10ns with ZFS images - more threads?). But I think it is negligible.

Closed with 894ceb2

Previously, pull request cloudius-systems#1223 added support for the SCHED_FIFO but didn't implement SCHED_RR. This PR attempts to follow-up on that by proposing an implementation of SCHED_RR. Time slicing support, as mandated by SCHED_RR, is implemented though the `set_realtime_time_slice(duration)` API added in the aforementioned pull request. Within the scheduler, the amount of time the thread ran so far is tracked if it exceeds the set duration, the thread is preempted (if there is a runnable thread of same or higher priority). The thread's run time is reset on preemption. This is my first OSv pull request, I added some basic tests which I can expand further. Additionally, I checked that the tests added in know what you think.

Previously, pull request cloudius-systems#1223 added support for the SCHED_FIFO but didn't implement SCHED_RR. This PR attempts to follow-up on that by proposing an implementation of SCHED_RR. Time slicing support, as mandated by SCHED_RR, is implemented though the `set_realtime_time_slice(duration)` API added in the aforementioned pull request. Within the scheduler, the amount of time the thread ran so far is tracked if it exceeds the set duration, the thread is preempted (if there is a runnable thread of same or higher priority). The thread's run time is reset on preemption. This is my first OSv pull request, I added some basic tests which I can expand further. Additionally, I checked that the tests added in know what you think. Signed-off-by: Sören Tempel <tempel@ibr.cs.tu-bs.de>

Previously, pull request cloudius-systems#1223 added support for the SCHED_FIFO but didn't implement SCHED_RR. This PR attempts to follow-up on that by proposing an implementation of SCHED_RR. Time slicing support, as mandated by SCHED_RR, is implemented though the `set_realtime_time_slice(duration)` API added in the aforementioned pull request. Within the scheduler, the amount of time the thread ran so far is tracked if it exceeds the set duration, the thread is preempted (if there is a runnable thread of same or higher priority). The thread's run time is reset on preemption. Signed-off-by: Sören Tempel <tempel@ibr.cs.tu-bs.de>

nyh requested a review from wkozaczuk March 29, 2023 09:21

This was referenced Mar 29, 2023

Support some POSIX-RT APIs to run rt-tests (cyclictest) on Osv #386

Open

sched_{set,get}scheduler? #1222

Closed

wkozaczuk requested changes Mar 30, 2023

View reviewed changes

nyh force-pushed the rt-low-level branch from b5dd868 to 7944338 Compare April 17, 2023 15:36

wkozaczuk approved these changes Apr 23, 2023

View reviewed changes

wkozaczuk closed this May 31, 2023

nmeum mentioned this pull request Nov 11, 2024

sched: Support POSIX's SCHED_RR scheduling policy #1338

Open

Uh oh!

sched: low-level support for POSIX-compatible real-time scheduling #1223

sched: low-level support for POSIX-compatible real-time scheduling #1223

Uh oh!

Conversation

nyh commented Mar 29, 2023

Uh oh!

wkozaczuk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wkozaczuk Mar 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nyh commented Apr 17, 2023

Uh oh!

nyh commented Apr 17, 2023

Uh oh!

wkozaczuk commented Apr 18, 2023

Uh oh!

wkozaczuk commented Apr 18, 2023

Uh oh!

wkozaczuk commented Apr 23, 2023

Uh oh!

nyh commented Apr 23, 2023

Uh oh!

wkozaczuk commented May 31, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wkozaczuk Mar 30, 2023 •

edited

Loading