File: vinayak2.html

package info (click to toggle)
lg-issue89 1-1
  • links: PTS
  • area: main
  • in suites: sarge
  • size: 1,804 kB
  • ctags: 80
  • sloc: xml: 106; makefile: 34; sh: 34; perl: 1
file content (328 lines) | stat: -rw-r--r-- 13,757 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
<!--startcut  ==============================================-->
<!-- *** BEGIN HTML header *** -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML><HEAD>
<title>The Linux Scheduler LG #89</title>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#0000AF"
ALINK="#FF0000">
<!-- *** END HTML header *** -->

<!-- *** BEGIN navbar *** -->
<!-- *** END navbar *** -->

<!--endcut ============================================================-->

<TABLE BORDER><TR><TD WIDTH="200">
<A HREF="http://www.linuxgazette.com/">
<IMG ALT="LINUX GAZETTE" SRC="../gx/2002/lglogo_200x41.png" 
	WIDTH="200" HEIGHT="41" border="0"></A> 
<BR CLEAR="all">
<SMALL>...<I>making Linux just a little more fun!</I></SMALL>
</TD><TD WIDTH="380">


<CENTER>
<BIG><BIG><STRONG><FONT COLOR="maroon">The Linux Scheduler</FONT></STRONG></BIG></BIG>
<BR>
<STRONG>By <A HREF="../authors/vinayak.html">Vinayak Hegde</A></STRONG>
</CENTER>

</TD></TR>
</TABLE>
<P>

<!-- END header -->






<h2> Why the Design of Scheduler is critical </h2>
<p>
Two of the most critical parts of a kernel are the memory subsystem and 
the scheduler. This is because they influence the design and affect the 
performance of almost every other part of the kernel and the OS. That is
also why you would want to get them absolutely right and optimize their 
performance. The Linux kernel is used right from small embedded devices,
scaling up to large mainframes. <!-- Though the kernel is not used as it is but some
modifications are made to it, the core more or less remains the same.
Hence it is imperative to get the scheduling policy and the design of the
scheduler right. -->  Designing is scheduler is at best <i> a black art.</i> No matter
how good the design is, some people will always feel that some categories
of  processes have got a raw deal.
</p>

<p>
In the article that follows, I have purposely tried to skip
quoting any reference code because one can easily get it from the net (see the
references).  The article also looks the challenges developers found when they
redesigned the scheduler, how those challenges were met, and what could be the
 the future direction the
scheduler is likely 
to follow. Having said that, there's nothing like reading the source code to get
an understanding of what's going on under the hood. You will find the implementation
of the scheduler in kernel/sched.c if you have the kernel source installed.
</p>

<h2> Scheduling Objectives </h2>
<p>
<b> The Linux scheduler strives to meet several objectives : </b> <br> <br>

<ol>

<li>
<b> <u> Fairness </u> </b> : 
<p> 

The scheduler should give a fair amount of CPU share to every
process. Quite a fair amount of work has been done in the new kernel to ensure
fair sharing of CPU time among processes.

</p> </li>

<li> <b> <u> Throughput and CPU utilization </u> </b> : 
<p>

The scheduler should try to maximize both
throughput and CPU utilization. The usual method of increasing CPU
utilization is by increasing the amount of multi-programming. But this is only
beneficial up to a point after which it becomes counterproductive and thrashing
starts. 

</p> </li>

<li> <b> <u> Minimal Overhead </u> </b> : 
<p>

A scheduler itself should run for as small time as
possible. The scheduler latency should be minimal. But this is the
tricky part. It is generally seen that scheduling itself is <i> not useful work (??) </i>.
But if the scheduling is done right even after expending some more time, it may
be worth the effort. But how do we decide which is the optimal point? Most
scheduler solve this problem by using some heuristic policies.
</p> </li>

<li> <b> <u> Enforce priority-based scheduling </u> </b> : 
<p>

Priority scheduling means that some processes get more preference over others. 
At the very least the scheduler must differentiate between I/O-bound processes 
and CPU-bound processes. Moreover, some kind of aging must be
be implemented so that starvation of processes does not take place. Linux does
enforce priorities as well as differentiates between different categories of
processes. The Linux kernel differentiates between batch scheduled jobs and 
interactive. They get a share of the CPU according to their priorities.
Probably some people have used the nice command to change the priority of
a process.

</p> </li>

<li> <b> <u> Turnaround time, waiting time </u> </b>: 
<p>

Turnaround time is the sum of the service time and amount of time wasted waiting 
in the ready queue. The scheduler tries to reduce both.

</p> </li>

<li> <b> <u> Response time and Variance </u> </b> : 
<p>

The response from a program should be as fast
as possible. Also another important factor which is often ignored is the variance
between the response times. It is not acceptable if the average response time
is low but some user occasionally experiences say, a 10-second delay from an interactive
program.

</p> </li>

<li> <b> <u> Miscellaneous </u> </b> : 
<p>

The scheduler also tries to meet some other goals such as
predictability. The behavior of the scheduler should be predictable for a given
set of process with assigned priorities. The scheduler performance must
degrade gracefully under heavy loads. This is particularly important because
the Linux is very popular in the server market, and servers tend to get overloaded
during peak traffic hours. 

</p> </li> 
</ol>

<h2> Improvements in the scheduler </h2>
<ul>

<li> <b> <u> The O(1) complexity Scheduling algorithm </u> </b>
<p>

What's O(1) you may ask? Well, I am going to skim the surface on what the O (known as the Big-Oh) notation means.
You will find references to the Big-Oh notation in any good algorithms book. What the Big Oh
notation essentially does is that it tries to estimate the running time of any algorithm 
independent of the machine-related implementation issues. It places a upper bound on the 
running time of the algorithm - that is an upper bound on the algorithm's worst case. 
It is an exceptional technique to compare the efficiency of an algorithm with respect to 
running time.
 
</p>
<p>

Take for example an algorithm which has two nested loops and both of whose
limits range from 1 to n-1 where (n is number of inputs for the algorithm) then the upper 
bound on it's running time is denoted by <b> <i> O(N<sup>2</sup>) </i></b>. Similarly,
consider an algorithm to search for an element in an unordered linked list. The worst case
is that we will have to traverse the list till the last element to get a match or worse still -
to find out that the element is not in the list. Such an algorithm is said to have 
<b> <i> O(N) </i> </b> complexity as it's running time is directly proportional to the number 
of elements in the list - N.
 
</p>
<p>

The Linux scheduler takes constant time to schedule the processes
in the ready queue. Hence, it is said to have <b> <i> O(1) </i> </b> complexity. No matter 
what is the number of processes active on the Linux system the scheduler will always take 
constant time to schedule them. All the "parts" - wakeup , selection of the process to be run
next, context switching and timer interrupt overhead - of the current Linux
kernel (2.5.49 is the reference here - See resources for details) have O(1) complexity.
Hence, the new scheduler has O(1) complexity in it's entirety.

</p> </li>

<li> <b> <u> Better Support for SMP and more scalable </u> </b>
<p>

As mentioned in the introduction, the Linux kernel runs on almost anything from wristwatches 
to supercomputers. With the earlier schedulers there were some problems with the scalability.
Some of these problems were solved by modifying the kernel for the application and the target architecture. However, the core design of the scheduler was
not very scalable. The new scheduler is much more scalable 
and SMP (Symmetric Multi-Processing) aware. The performance of the scheduler is much better for
SMP systems now. One on the goals stated by Ingo Molnar who has written the O(1) scheduler is
that in SMP, processors should not be idle when there is work to do. Also, care should be taken
that processes do not get scheduled on different processors form time to time. This is to avoid 
the overhead of filling in the cache with the required data every time.

</p> </li>

<li> <b> <u> Batch scheduling of jobs </u> </b>
<p>

This is not exactly a new feature, but there are some patches that can be applied to support 
batch scheduling. Earlier kernels also had some support for batch scheduling. As of now, the 
batch scheduling of task is done using priority levels. The Linux kernel uses about 40 nice 
levels (though they are all mapped to about 5 levels of priority). Batch scheduled processes
generally get the CPU when there are not many interactive and CPU-bound processes around, which have more priority. Batch scheduled processes get bigger time-slices to run than 
normal processes. This also minimizes the effect of swapping data in and out of the cache 
frequently, thus improving the performance of batch jobs.  

</p> </li>

<li> <b> <u> Better performance of Interactive Jobs </u> </b>
<p>

One of the major improvements in the new scheduler is detection and boosting the performance of
interactive tasks. Tweaking the old scheduler code was a bit cumbersome. In the new scheduler, 
detection of interactive jobs is decoupled from other scheduler tasks such as time-slice management.  
Interactive jobs in the system are detected in the system with the help of usage-related statistics.
That means interactive tasks have good response times under heavy loads, and CPU-hogging processes 
cannot monopolize CPU time. The new scheduler actively detects interactive tasks and give them more
precedence over other tasks. Even then, a interactive task is scheduled with other interactive tasks 
by using Round-Robin scheduling. This makes a lot of sense for desktop users as they will not see 
response time increase when they start a CPU intensive job such as encoding songs into the ogg format. 
Plans are on to merge O(1) and pre-emption code to give better response times for interactive tasks. 

</p> </li>

<li> <b> <u> Better Scalability and support for more architectures </u> </b>
<p>

Because of the redesign of the new scheduler, it scales more easily to other architectures such
as NUMA (Non-Uniform Memory Access) and SMT. NUMA architecture is used on some high-end servers
and supercomputers. Also there is some ongoing work on the SMT (Symmetric Multi-Threading).
SMT is also known by a more popular term: HyperThreading. One of the reasons is that
now every CPU has it's own run queue. Only the load balancing sub-part has a "global" view of the 
system. So any changes for some architecture are to be made mainly in the load balancing sub-part.
Also recently a patch for NUMA was released. In fact it has been incorporated in Linux 2.5.59.
SMT processors implement two (or more) virtual CPUs on the same physical processor - one "logical"
processor can be running while the other waits for memory access. SMT can certainly be seen as a 
sort of NUMA system, since the sibling processors share a cache and thus have faster access to 
memory that either one has accessed recently. Some work is going also going on SMT, but the 
new O(1) scheduler handles SMT pretty well even without any changes. Recently a patch was released
for SMT. Though the NUMA architecture bears some resemblance to the SMT architecture, the Linux
scheduler handles them differently.

</p> </li>

<li> <b> <u> Miscellaneous </u> </b>
<p>

The scheduler gives more priority to fork()ed children than to the parent itself. This could 
potentially be useful for servers which use the fork call to service requests. It could also be 
useful for GUI applications. There are also some real-time scheduling (based on priority)
available in the kernel. 

</p> </li> 
</ul>

<h2> Resources </h2>
<ul> 
<li> <a href="http://people.redhat.com/~mingo/O(1)-scheduler/"> Ingo Molnar's Homepage O(1) patches </a> </li>
<li> <a href="http://www.tldp.org/LDP/lki/lki-2.html#ss2.3"> Linux Kernel Internals - Tigran Aivazian</a> </li>
<li> Browse the Linux Kernel Source Code @ lxr.linux.no
	<ol> 
	<li> <a href="http://lxr.linux.no/source/Documentation/sched-design.txt?v=2.5.49"> Scheduler Design </a> </li>
	<li> <a href="http://lxr.linux.no/source/Documentation/sched-coding.txt?v=2.5.49"> Some notes about scheduler functions </a> </li>
	</ol>
</li>
</ul>








<!-- *** BEGIN author bio *** -->
<P>&nbsp;
<P>
<!-- *** BEGIN bio *** -->
<P>
<img ALIGN="LEFT" ALT="[BIO]" SRC="../gx/2002/note.png">
<em>
Vinayak is currently pursuing the APGDST course
at NCST. His areas of interest are networking, parallel
computing systems and programming languages. He
believes that Linux will do to the software industry
what the invention of printing press did to the world
of science and literature. In his non-existent free
time he likes listening to music and reading books. He
is currently working on Project LIberatioN-UX where he
makes affordable computing on Linux accessible for
academia/corporates by configuring remote boot stations
(Thin Clients).
</em>
<br CLEAR="all">
<!-- *** END bio *** -->

<!-- *** END author bio *** -->


<!-- *** BEGIN copyright *** -->
<hr>
<CENTER><SMALL><STRONG>
Copyright &copy; 2003, Vinayak Hegde.
Copying license <A HREF="../copying.html">http://www.linuxgazette.com/copying.html</A><BR> 
Published in Issue 89 of <i>Linux Gazette</i>, April 2003
</STRONG></SMALL></CENTER>
<!-- *** END copyright *** -->
<HR>

<!--startcut ==========================================================-->
<CENTER>
<!-- *** BEGIN navbar *** -->
<!-- *** END navbar *** -->
</CENTER>
</BODY></HTML>
<!--endcut ============================================================-->