Scheduler algorithm

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
Post Reply
kenna
Posts: 8
Joined: Fri Apr 13, 2007 1:26 pm

Scheduler algorithm

Post by kenna »

Which of these two algorithms would be more efficient? Or would it simply be a matter of taste?

Threads:

Code: Select all

A = 4 quanta, priority 2
B = 2 quanta, priority 1
C = 2 quanta, priority 1
Algorithm 1:

Code: Select all

A A A A B C B C
Algorithm 2:

Code: Select all

A B A C A B A C
Algorithm 1 would have to perform less context switches, while algorithm 2 would allow other threads lower delta time.
User avatar
os64dev
Member
Member
Posts: 553
Joined: Sat Jan 27, 2007 3:21 pm
Location: Best, Netherlands

Post by os64dev »

id depends on the thread for instance at the sequence AAAABCBC: if process A would be an idle process and B and C are very complex algorithms it would be very bad. If on the other hand A would be a complex algorithm containing mutexes tat both B and C require it will be a good.
Author of COBOS
User avatar
Dandee Yuyo
Member
Member
Posts: 47
Joined: Fri Nov 09, 2007 6:46 pm
Location: Argentina

Post by Dandee Yuyo »

For me the first one is better. Task A has higher priority over the others, so it should has more processor time.
NaN - Not a Nerd
Working on: Physical Memory Management with a 5-lod mipmap XD
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Scheduler algorithm

Post by Brendan »

Hi,
kenna wrote:Which of these two algorithms would be more efficient? Or would it simply be a matter of taste?
The first algorithm would be more efficient (less context switches), although I'd use "A A A A B B C C". This is what I call "variable time slice scheduling" where all threads get one time slice each and a thread's priority determines the length of it's time slice. The algorithm used in the scheduler for this can easily be O(1).

You are correct about delta time though - with 100 running threads and an average of 20 ms per thread, each thread would have to wait for almost 2 seconds before it gets CPU time again. This makes it very bad for higher priority threads (e.g. threads that deal with the GUI).

The second algorithm is what I call "variable frequency scheduling" - each thread gets the same length time slices (e.g. 1 ms) but higher priority threads get time slices more often. This is worse for efficiency but better for high priority threads. However, it's extremely difficult (impossible?) to implement this as an "O(1)" algorithm - you end up searching through a list of threads to find out which thread should be run next.

There are also other algorithms. For example, the "highest priority thread that can run does run" algorithm is much better for very high priority threads (e.g. IRQ handling threads in my OS, and perhaps the GUI and threads that handle user interfaces). This algorithm is also easy to implement as "O(1)".

So, which algorithm do I think is best? Honestly, none of them (or more specifically, all of them).

What I do is split things up into 4 groups (or "scheduling policies"). Policy 0 is for very high priority threads and uses the "highest priority thread that can run does run" algorithm. Policy 1 is for high priority threads and uses the "variable frequency" algorithm. Policy 2 is for normal threads and uses the "variable time slice" algorithm. Policy 3 is for idle threads and also uses the "variable time slice" algorithm.

Each thread has a policy, and a priority within that policy. You could think of this as a 10-bit number, where the highest 2 bits determine the policy and the lowest 8 bits determine the priority within that policy.

The scheduler makes sure that it's always running threads from the highest priority policy. For example, if it's running a thread that belongs to Policy 2 and a thread from Policy 1 becomes ready to run, then the currently running thread is pre-empted.

Within each policy this may not apply. If a policy 0 thread is running and a higher priority policy 0 thread becomes ready to run, then the currently running thread will be pre-empted. However, if a policy 1, 2 or 3 thread is running and a higher priority thread in the same policy becomes ready to run, then the currently running thread won't be pre-empted.

To implement this I use 259 linked lists. The first 256 linked lists are used for Policy 0 with a seperate linked list for each thread priority (note: if there's 2 or more policy 0 threads with the same priority, then they take turns in a "round robin" way). The 257th linked list is for all policy 1 threads, then 258th linked list if for all policy 2 threads and the 259th linked list is for all policy 3 threads.

The other thing to consider is where threads are inserted into the "ready to run" list/s. For example, for "variable time slice" scheduling if a thread becomes ready to run you could insert it at the end of the list so it waits for all other threads to get their time slice before it gets a time slice, or you could insert it at the start of the list so that it's the next thread to get a time slice. I always put the new thread at the end of the list - otherwise a thread could hog almost all CPU time by doing something like sleeping for 1 ms at the end of it's time slice.

Lastly, there's time slice lengths. First I work out a performance rating for the computer. For e.g. the number of instructions that all CPUs could execute in 1 second divided by "(number_of_CPUs + 1) / 2". Then I use the performance rating to calculate the "base time slice length". For e.g. "base_time_slice = performance_rating / K" where K is selected to give 1 ms time slices on an average CPU. This means that the faster the computer is the smoother scheduling gets (or less context switching and overhead on slower computers).

For Policy 0 and policy 1, all threads get the base time slice length (e.g. 1 ms for an average computer). For Policy 2 a thread gets "base_time_slice * thread_priority" (e.g. 1 ms for a low priority thread and 256 ms for a high priority thread, on an average computer), and for policy 3 a thread gets "base_time_slice * 4 * thread_priority" (e.g. 4 ms for a low priority thread and 1024 ms for a high priority thread, on an average computer)

It all sounds complicated, but I've implemented it and found it to be extremely good. For example, you can have normal threads doing infinite loops without effecting how responsive the GUI is (or device driver threads), or idle threads doing infinite loops without effecting any threads in other policies.

There is one other "problem" with it (other than complexity) - software must use the policy/priority scheme, and they must use it correctly. Crappy software (e.g. POSIX software that almost never supports the OS's thread priorities) will not work well, and programmers need to choose the policy and priority of threads wisely.

For example, for a web browser you'd want the main "GUI handling" thread (that takes care of drawing the window, menus, etc) to be a high priority thread (e.g. policy 1, priority 128) and the code that renders HTML pages (and supplies this data to the main thread) to be a normal priority thread (e.g. policy 2, priority 32) so that the GUI remains responsive regardless of what the HTML renderer is doing. Then you might want a third thread for fetching data from the internet and web cache (e.g. policy 2, priority 196), and perhaps a fourth thread that analyzes usage patterns and prefetches web pages in the background (e.g. policy 3, priority 50) so that data is preloaded before the user wants it.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
ByrdKernel
Posts: 2
Joined: Sun Dec 02, 2007 12:52 am

Re: Scheduler algorithm

Post by ByrdKernel »

Brendan wrote:To implement this I use 259 linked lists. The first 256 linked lists are used for Policy 0 with a seperate linked list for each thread priority (note: if there's 2 or more policy 0 threads with the same priority, then they take turns in a "round robin" way). The 257th linked list is for all policy 1 threads, then 258th linked list if for all policy 2 threads and the 259th linked list is for all policy 3 threads.
Holy crap, that's a lot of linked lists! How big is each list entry? Just the PID (16- or 32-bit, I would presume, having not looked at your code), and a "next" field, or more info than that?

How much performance do you gain from doing it that way? I can see the obvious gains in flexibility, but it seems like you'd have to have an awful lot of threads competing for an awful little CPU time, and gaining gobs of it back just by using your method, before all those lists (and the memory they take up, and the CPU time it takes to walk them) suck the performance right out of it.

BTW, long time no see, dude! Wassup? :)

-Mike
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Scheduler algorithm

Post by Brendan »

Hi,
ByrdKernel wrote:
Brendan wrote:To implement this I use 259 linked lists. The first 256 linked lists are used for Policy 0 with a seperate linked list for each thread priority (note: if there's 2 or more policy 0 threads with the same priority, then they take turns in a "round robin" way). The 257th linked list is for all policy 1 threads, then 258th linked list if for all policy 2 threads and the 259th linked list is for all policy 3 threads.
Holy crap, that's a lot of linked lists! How big is each list entry? Just the PID (16- or 32-bit, I would presume, having not looked at your code), and a "next" field, or more info than that?
For my previous kernel each list entry was 2 bytes (thread ID of next entry on the list, or 0xFFFF if it was the last entry), but 2 bytes per thread is nothing when each thread needs a 2 KB or larger kernel stack, a 512 byte area to save FPU/MMX/SSE state, and other information.

On top of that there was 2 bytes per linked list to store the "thread ID" of the first entry in the list and another 2 bytes per linked list to store the "thread ID" of the last entry in the list. With 259 linked lists per CPU, for 4 CPUs it adds up to 4144 bytes.

Then there's reentrancy locks. For my previous kernel I had one (4 byte) reentrancy lock per scheduling policy per CPU (or 16 bytes per CPU, or a total of 64 bytes for 4 CPUs). However, for performance it's better to put each lock in it's own cache line - with 64 byte cache lines this would cost 256 bytes per CPU. To reduce lock contention you could also have one reentrancy lock per linked list (instead of one lock per scheduling policy). With one (64-byte) cache line per lock and 259 locks per CPU this would cost 16576 bytes per CPU (or almost 65 KB for 4 CPUs). Of course you could go the opposite way and have one lock for all linked lists, but I just don't do things like that (it increases lock contention and makes scalability suck).

I also kept track of the number of running threads in each policy for each CPU, which costs 16 bytes per CPU. This is partly for CPU load balancing.
ByrdKernel wrote:How much performance do you gain from doing it that way?
How much performance do I gain compared to what? :)
ByrdKernel wrote:I can see the obvious gains in flexibility, but it seems like you'd have to have an awful lot of threads competing for an awful little CPU time, and gaining gobs of it back just by using your method, before all those lists (and the memory they take up, and the CPU time it takes to walk them) suck the performance right out of it.
When the scheduler is looking for which thread to run next it takes the first thread on the highest priority queue/linked list, and removes that thread from the linked list. When the thread stops running it's placed back on the end of the linked list. This is all O(1) - you don't need to walk the linked lists.

Finding the highest priority queue/linked list is not O(1) though. If there's no "ready to run" policy 0 threads then (worst case) I checked the policy1 linked list then the policy2 linked list, then the policy3 linked list (which is never empty as there's always a policy 3 thread ready to run). If there are "ready to run" policy 0 threads I checked up to 255 linked lists before finding a list that isn't empty; but this is unusual, as policy 0 threads are "extremely high priority" and typically pre-empt lower priority tasks when they become ready to run (a direct switch from previous thread to policy 0 thread with no need to check any list) and then block again.
ByrdKernel wrote:BTW, long time no see, dude! Wassup? :)
Hehe - same thing (another rewrite in the persuit of perfection)... ;)


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
piranha
Member
Member
Posts: 1391
Joined: Thu Dec 21, 2006 7:42 pm
Location: Unknown. Momentum is pretty certain, however.
Contact:

Post by piranha »

What if.......your kernel could....see what the process does (based on needed processor time, previous freezes, allocated memory, etc) and then set a default kernel to another (almost same) version that has a different scheduler algorithm.
Something like this:
1) Process A does lots of stuff at a high priority. It takes lots of processor time. It freezes a lot too, because it isn't given enough power.
2) IF Process A is used often, or is needed for the OS, then the kernel tells the bootloader (config file?) to boot a different kernel who's only change is a different scheduler algorithm.
3) Does nothing until it restarts.
4) Could then check for other apps or processes if they work fine with the new scheduler, and if not then reverse the change.

-JL[/i]
SeaOS: Adding VT-x, networking, and ARM support
dbittman on IRC, @danielbittman on twitter
https://dbittman.github.io
Post Reply