File: pramode.html

package info (click to toggle)
lg-issue103 1-1
links: PTS
area: main
in suites: sarge
size: 1,060 kB
ctags: 58
sloc: makefile: 34; sh: 34
file content (1140 lines) | stat: -rw-r--r-- 36,238 bytes

<html>
<head>
<link href="../lg.css" rel="stylesheet" type="text/css" media="screen, projection"  />
<title>Experiments with Kernel 2.6 on a Hyperthreaded Pentium 4 LG #103</title>

<style type="text/css" media="screen, projection">
<!--


.articlecontent {
	position:absolute;
	top:143px;
}


-->
</style>


</head>

<body>


<img src="../gx/2003/newlogo-blank-200-gold2.jpg" id="logo" alt="Linux Gazette"/>
<p id="fun">...making Linux just a little more fun!</p>


<div class="content articlecontent">

<div id="previousnexttop">
<A HREF="okopnik.html" >&lt;-- prev</A> | <A HREF="ecol.html" >next --&gt;</A>
</div>



<h1>Experiments with Kernel 2.6 on a Hyperthreaded Pentium 4</h1>
<p id="by"><b>By <A HREF="../authors/pramode.html">Pramode C.E.</A></b></p>

<p>
<p> Having a dual CPU Linux system is cool, if you have 
deep enough pockets to buy one. You need costlier 
motherboards, more expensive power supply, and two 
CPUs which obviously cost twice as much as one CPU. 
That was, until Intel brought out the Hyperthreaded 
(HT) Pentium 4. You get two `logical' CPUs within one 
package, and you don't pay the price for two (of course, 
you don't get the performance of a real dual CPU system 
- if somebody said so, that should be marketing talk)! 
I recently purchased one such system and am still under 
the charm of `cat /proc/cpuinfo' telling me that I have 
two CPUs in my box. This article explains various 
experiments which I conducted on my HT system running 
the shining new Linux kernel 2.6.5 mostly with a view 
towards understanding concurrency control in Symmetric 
Multi Processing systems. Readers are expected to have
an elementary understanding of Operating System concepts
and Linux kernel programming to understand the material
presented here.

<h3> What is Hyperthreading? </h3>
<p>
We visualize the microprocessor processing an 
instruction stream by fetching, decoding and executing 
instructions at a breakneck speed. There are many speed 
barriers on the way. Suppose the microprocessor 
encounters an instruction to do I/O - it will have to 
wait for data to arrive from a peripheral unit over a 
comparatively slow `bus'. Suppose during the course of 
executing an instruction, some data has to be obtained 
from main memory. If the data is not available in the 
cache closest to the execution engine, it will have to 
be obtained from a slower cache down the hierarchy, or 
worse, from main memory (main memory access times are 
very long compared to the CPU clock cycle, with the 
clock reaching about 3GHz in modern processors). A 
microprocessor with an `Out of Order' execution engine 
tries to identify instructions ahead in the stream 
which can be executed in parallel. But there are limits 
to this process - limits imposed by the inherent serial 
nature of the instruction stream. Many a time, the next 
instruction will depend on the results generated by the 
current instruction and so on.
<p>
What if the microprocessor can process two instruction 
streams in parallel? Doing this will require that it 
maintains two sets of `architectural state' (general 
purpose registers, instruction pointer etc), but the 
execution units (the ALU, FPU) need not be duplicated. 
It is said that this can be done with very little 
hardware (and thus, cost) overhead. Now we can 
visualize two instruction streams in memory - the 
microprocessor maintaining two instruction pointers and 
fetching instructions from both the streams. Because 
the execution units are not duplicated, it would be 
impossible for the microprocessor to really execute two 
instruction simultaneously - but what if an instruction 
from one stream takes a long time to complete (maybe, 
it is doing I/O - or there was a cache miss and its 
waiting for data to arrive from main memory). During 
this period, the microprocessor can execute 
instructions from the other stream using the execution 
units which are sitting idle. This is the basis of 
Hyperthreading (or, what I understand to be the basis 
of Hyperthreading - the more inquisitive readers can 
form their own conclusions by reading some excellent 
documents, links to which I have put up in the 
concluding part of this article). The key to 
performance improvement here is that the two threads 
provide a mix of instructions which employ different 
execution units within the microprocessor. How much 
performance improvement will be seen in real 
application scenarios is still a topic of debate. 
Bottom line is - if you are purchasing an HT system for 
performance, look before you leap!

<p>
A multiprocessor-aware operating system like Linux sees 
the HT Pentium as it sees a `normal' twin CPU system, 
with logical processors 0 and 1. The scheduler can run 
tasks on any one of the CPUs. 

<h3> My Setup </h3>
<p>
I am using an HT enabled 2.8GHZ Pentium 4. The Linux 
distribution on this box is PCQLinux 2004, which is 
based on the latest offering from Fedora. 

<p>
I boot the system with Linux kernel 2.6.5 anxiously 
looking for signs which proclaim that I am the owner of 
a twin CPU system. Here is part of the output from `dmesg':

<pre>

Initializing CPU#0
Intel machine check reporting enabled on CPU#0.
CPU#0: Intel P4/Xeon Extended MCE MSRs (12) available
CPU#0: Thermal monitoring enabled
enabled ExtINT on CPU#0

Initializing CPU#1
masked ExtINT on CPU#1
Intel machine check reporting enabled on CPU#1.
CPU#1: Intel P4/Xeon Extended MCE MSRs (12) available
CPU#1: Thermal monitoring enabled

</pre>

The next place to look is /proc/cpuinfo, and sure 
enough, it says that there are two CPUs. Interrupts 
are shared by both the processors - so /proc/interrupts 
is also interesting. It shows the number of interrupts 
of each category processed by each logical processor. 
Running the `top' command displays the CPU on which 
each process is running.

<h3> Playing with the scheduler </h3>

<p>
The 2.6 scheduler provides a system call to set task 
affinity. Normally, the scheduler is free to reschedule 
a process on any CPU. Using the <b>sched_setaffinity </b>
system call, we can instruct it to restrict a process 
to one set of CPUs. The manual page of this system 
call provides the prototype as:

<pre>

int sched_setaffinity(pid_t pid,unsigned int len, 
                      unsigned long *mask);

</pre>

   
which is wrong! The prototype in the header file 
/usr/include/sched.h is:

<pre>

int sched_setaffinity(__pid_t __pid, 
                      __const cpu_set_t *__mask);

</pre>

This is what we should use. The first argument is the 
pid of the process whose scheduler affinity mask is to 
be altered (pid 0 means current process). The second 
argument is the mask. The mask should be manipulated 
only using the macros provided to do the same. Here is 
a function which when called with a logical CPU number 
as argument binds the calling process to that 
particular CPU (i.e., it will not be scheduled on any 
other CPU).

<p>
[<a href="misc/pramode/1.c.txt">Listing 1, using the sched_setaffinity call</a>]

<pre>

#define _GNU_SOURCE
#include &lt;sched.h&gt;

void run_on_cpu(int cpu)
{
        cpu_set_t mask;
        CPU_ZERO(&amp;mask);
        CPU_SET(cpu,&amp;mask);
        sched_setaffinity(0,&amp;mask);
}

</pre>

CPU_ZERO initializes all the bits in the mask to zero. 
CPU_SET sets only the bit corresponding to cpu. It 
would be interesting to run lots of copies of this 
program, keeping top running on another console all the 
time. We will see all the processes getting scheduled 
on the same CPU.

Another interesting experiment might be to run a 
command like yes

<pre>

yes &gt; /dev/null &amp;

</pre>

Note down the CPU on which it is being scheduled and 
then run lots of copies of the program in <b>[Listing 1]</b> on
the same CPU. 
We will see the scheduler migrating <code>yes</code> to the other 
CPU because it is relatively idle. If the affinity mask is not 
explicitly altered, a process will be rescheduled on any processor.

<h3> Measuring performance </h3>
<p>
Using toy programs for benchmarking is a stupid idea. 
Still, we wish to see whether there is some benefit to 
HT, at least in tailor-made situations. I wrote a 
program which had two functions - one reading from an 
I/O port in a loop and the other one incrementing a 
floating point number in a loop. Executed serially on 
one CPU, the program takes about 21 seconds (10 seconds 
each for the numeric and I/O functions). When the two 
tasks are executed on two different CPUs, the run time 
becomes about 12.4 seconds - a really big speedup. Is 
this behaviour expected or have I got something wrong? 
I am not really sure at this point. Readers with access 
to HT systems can surely help by commenting on the code 
and also trying to measure things on their systems. 

<p>
<a href="misc/pramode/2.c.txt">[Listing 2, trying to measure HT performance]</a>

<p>
Both the tasks running only do_io or do_float resulted 
in a decrease in measured speedup. When do_float was 
changed to do only integer arithmetic and both tasks 
(running on separate CPUs) were made to do do_float, 
there was a severe degradation in performance (when 
compared to both tasks running only on one CPU). I 
believe one thing of which we can be sure here is that the 
instruction mix is really significant. As any useful 
program would be doing lots of arithmetic, lots of I/O 
as well as plenty of memory accesses, a combination of 
such programs will give the hyperthreaded CPU an
opportunity to execute instructions from both the 
threads in parallel using unused execution units.

<p>
Another experiment which I did was to run two jobs - 
one accessing memory in such a way as to generate lots 
of cache misses and the other one doing some floating 
point computations. About 25% speedup was observed when 
the tasks were made to run on two CPUs. 

<p>
<a href="misc/pramode/3.c.txt">[Listing 3]</a>


<h3> Playing with the kernel </h3>
<p>
The 2.6 kernel has brought in lots of scalability 
enhancements which makes Linux a platform of choice for 
heavy duty server applications. There have been radical 
changes in many kernel subsystems and the programming 
interface which the kernel presents to driver writers 
has changed at many places. Let's first try to build a 
kernel module using the new `kbuild' mechanism.

<h3> Compiling a simple module </h3>

<p>
The module does nothing - it can be loaded and 
unloaded, thats all. 

<p>
<a href="misc/pramode/4.c.txt">[Listing 4, a simple kernel module]</a>

<p>
We need a Makefile containing the following lines: (4.c 
is the name of our module)

<pre>

obj-m := 4.o

default:

    make -C /usr/src/linux-2.6.5 SUBDIRS=`pwd`

</pre>

<p>
Now, just invoke make and the kernel build system will 
automagically build the module. More details can be 
found in documents under <b>Documentation/kbuild</b> of the 
kernel source tree. Note that the object file to be 
loaded is named 4.ko.


<h3> Understanding race conditions </h3>

<p>
When two CPUs try to execute code which modifies 
shared objects in parallel, the outcome will depend on 
the order in which the accesses/modifications took 
place. For the remaining part of this discussion, we 
shall assume that if two CPUs try to access memory in 
parallel, some kind of hardware arbitration circuit 
will serialize the access - but which CPU gets access 
first is essentially unpredictable.

<p>
The SMP-aware Linux kernel allows multiple CPUs to 
execute kernel as well as user level code in parallel. 
Let's think of what would happen if code running on two 
CPUs tries to execute the following sequence simultaneously.

<pre>

temp = count;
temp = temp + 1;
count = temp;

</pre>

The variable count is a global object shared by both 
the executing tasks whereas temp is local to each task. 
One execution timeline can be:

<p>
Both CPUs try to do temp=count in parallel. Say CPU0 
gets access to memory first. Code running on it will 
store the current value of count (say 0) in temp. Next, 
CPU1 gets access to memory. The task running on CPU1 
also stores the current value of count (which is 0) in 
a local variable. Now, both CPUs increment their local 
copies. Both CPUs write back the value of temp (which 
is 1) to count. Note that even though both CPUs have 
executed code to increment count, because of execution 
overlaps, the count has become 1 and not 2. This is a 
really big problem. The problem would not have occurred 
had the code running on one CPU, say CPU1, read the 
value of count after the code running on the other CPU 
had finished incrementing it. But there is no such 
guarantee on a complex multiprocessor system.

<p>
Let's try to demonstrate this race condition 
experimentally. Even though we are not using a real 
dual CPU system here, the behaviour will be identical. 
We will write a simple character driver with a 
`device_write' method which keeps updating a global 
variable `count' in a loop. We will run two tasks, 
initially, both on the same CPU. The tasks will open a 
device file and invoke the `write' system call many times 
in a loop, with the effect that the code in the kernel 
gets activated. If each task invokes `write' 100 times, 
and each device_write updates the global `count' variable 
100 times, then the total count should be, after both 
the tasks are over, 20000. Any other value is an 
indication of a race condition. We will experiment with 
both tasks running on the same CPU and on different CPUs.

<p>
<a href="misc/pramode/5.c.txt">[Listing 5, kernel module containing a race condition]</a>

<p>
The test program forks a child process. Both the parent 
and child execute `write' on the same file descriptor, 
with the effect that `foo_write' gets invoked a total of 
200000000 times. Each `foo_write' increments the count by 
1 - if there is no race condition, the count should be 
200000000. We do have a race here, so the count should 
be less than 200000000, which is what we observe when 
the parent and the child run on different CPUs. But we 
have a surprising result - even when the parent and 
child run on the same CPU, we get values less than 
200000000. Why?

<p>
<a href="misc/pramode/6.c.txt">[Listing 6, testing the kernel module]</a>

<p>
The answer is that I have compiled the 2.6 kernel with 
kernel preemption enabled. Classical uniprocessor Unix 
kernels are non-preemptive. That is, if a task enters 
the kernel, until it comes out, no other task would be 
scheduled. This design reduces problems associated with 
concurrency a lot. Kernel code which modifies a global 
data structure (say a linked list) need not be worried 
that it will be preempted during the execution of a 
critical section and some other task would enter kernel 
mode and access/modify the same data structure 
resulting in race conditions. The trouble with this 
design is that tasks which execute for a long time in 
the kernel would reduce the responsiveness of the 
system by not letting interactive tasks preempt them. 
The 2.6 kernel can be compiled with preemption enabled. 
The result is that even if both tasks are running on 
the same CPU, one task may be preempted after it has 
read and stored the value of count in the temporary 
variable and before it writes back the incremented 
value. This will result in race conditions cropping up.

<p>
The solution is to disable preemption by calling 
<b>preempt_disable</b> and enable it once again after 
the critical section is over by calling <b>preempt_enable</b>.
These functions basically manipulate a preemption 
count, which can be obtained by calling <b>preempt_count</b>


<h3>Doing an `atomic' increment</h3>

<p>
What if we change the three line sequence which 
increments count to just one line?

<pre>

count++;

</pre>
<p>
We don't have any guarantees that race conditions will 
not occur in this case also. The x86 compiler may 
translate count++ to three separate instructions which 
first fetches a value from memory and stores it in a 
register, increments the register and stores it back to 
the memory location. The GNU C Compiler version 3.3.2 
on my system in fact translates count++ (count is 
declared as volatile) to three statements. If we wish 
to make sure that count is getting incremented using 
just one assembly instruction, we will have to make use 
of inline assembly code like this:

<pre>

__asm__ __volatile__ (
   "incl %0\n\t"
   :"=m"(count)
   :"m"(count)
);

</pre>

<p>
Modifying the program this way ensures that if both 
tasks are running on one CPU, and even if preemption is 
not disabled, we get the correct results. This is 
because the current task won't be preempted until the 
assembly instruction it is executing is over. What if 
we make the tasks run on different CPUs? We note that 
again we are getting random values for count. This is 
because the `incl' statement operates internally by 
bringing the value in memory to some place within the 
CPU, incrementing it and then transferring it back to 
the same location. It is possible that if two CPUs try 
to execute this sequence in parallel, both the CPUs 
might load the value in memory to some location 
internal to them (the memory system hardware ensures 
that only one CPU is able to read from a location at 
one instant of time. But once that read is over, there 
is no guarantee that the other CPU will not be able to 
read from memory till the whole read-modify-write 
sequence running on the first CPU is completely over), 
increments the value and writes back to memory with the 
result that the count gets incremented by one and not 
two. This seems to be the case even with Hyperthreading 
systems - the hardware only guarantees that the 
load-store operations of one thread happen in sequence. 
This may get intermixed with the load-store sequences 
generated by the other thread.

<p>
What is the solution to this problem? On x86 systems, 
the instruction prefix `lock' can be used to make certain 
assembly instructions atomic. The Intel manuals say 
that `lock' can be used with only a few instruction (inc 
is one of them). The hardware somehow guarantees that 
two instructions prefixed by a `lock' running on two 
different CPUs and trying to access the same memory 
location will execute mutually exclusive to each other 
- that is, only after one instruction runs to completion 
(the full read-modify-write sequence) will the other 
instruction start executing.

<pre>

__asm__ __volatile__ (
   "lock; incl %0\n\t"
   :"=m"(count)
   :"m"(count)
);

</pre>

<p>
The modified program runs slower, but it gives us the 
correct count of 200000000. 

<h3> The atomic swap operation </h3>

<p>
The x86 XCHG instruction swaps a 32 bit register with a 
memory location. Even if it is not prefixed with a 
lock, it is guaranteed that XCHG will be atomic. It can 
be used as a building block for many synchronization 
primitives like the semaphore. Let's try to implement 
an atomic increment operation using the atomic swap.

<p>
<a href="misc/pramode/7.c.txt">[Listing 7, implementing atomic increment]</a>

                



<h3> Using the builtin atomic functions </h3>

<p>
The kernel header <b>include/asm/atomic.h</b> defines a few 
inline functions and macros which can be used for 
performing atomic operations.

<pre>

atomic_t count = ATOMIC_INIT(0);
atomic_read(&amp;count); 
atomic_set(&amp;count, 2); 
atomic_inc(&amp;count);

</pre>


<h3> The spin lock </h3>

<p>
Let's now try to understand how the <b>spin lock</b>, a
very fundamental multiprocessor synchronization mechanism, works. 
The problem is to prevent kernel code running on two 
CPUs from accessing a shared data structure 
concurrently. Had the data structure been something as 
simple as an integer (or some other basic type) and the 
operation been as simple as performing an addition or a 
subtraction or an increment, the atomic operations 
would have been sufficient. But what we wish to do is 
provide mutual exclusivity to two threads running on 
two different CPUs accessing an arbitrarily complex 
data structure using an arbitrary sequence of 
instructions. We need something more general than the 
atomic add, sub, or inc operations.

<h3> A flawed solution </h3>

<p>
Let the tasks running on both the CPUs look at a 
shared variable (let's call it `lock') before executing 
the section of code which has got to be mutually 
exclusive. If the value of lock is zero (lock is in the 
unlocked state), make it one and enter into the 
critical section. If not, keep on spinning until the 
lock becomes zero again. 

<p>
<a href="misc/pramode/8.c.txt">[Listing 8]</a>

<pre>

void acquire_lock(volatile int *lockptr)
{
        while(*lockptr);
        *lockptr = 1;
}
void release_lock(volatile int *lockptr)
{
        *lockptr = 0;
}

</pre>


<p>
Let's say code running on two CPUs is trying to 
increment a variable `count' by first storing it in 
`tmp'. We try to prevent race conditions 
from occurring by calling acquire_lock before the code 
sequence begins and release_lock after the code 
sequence ends.

<pre>

acquire_lock();

tmp = count;
tmp++;
count = tmp;

release_lock();

</pre>

<p>
Testing the code shows that it is not working as 
expected. The reason is simple. Suppose code executing 
on both the CPUs call acquire_lock simultaneously. A 
possible sequence of events would be:

<ol>

<li>Code running on CPU0 reads *lockptr and sees that it is zero

<li>Before code running on CPU0 makes *lockptr equal to 
one, code running on CPU1 sees that *lockptr is zero.

<li>Both code segments DO NOT enter into the while loop

<li>Code on CPU0 makes *lockptr equal to 1

<li>Code on CPU0 happily enters the `critical section'

<li>Code on CPU1 also makes *lockptr equal to 1 and enters 
the critical section

</ol>

<p>
The problem arose because testing whether *lockptr is 
zero and then making it equal to one executed non-atomically.

<h3> Another Attempt </h3>

<p>
The kernel header file include/asm/bitops.h presents an atomic 
test_and_set_bit function which takes an address and a 
bit number and sets that bit of the location pointed to 
by address, also returning true if the bit was 
initially set and false if it was clear. Let's use this 
function to implement another version of the spin lock.

<p>
<a href="misc/pramode/9.c.txt">[Listing 9]</a>

<pre>

/* Spin locks, another  attempt*/

void acquire_lock(volatile unsigned long *lockptr)
{

        while(test_and_set_bit(0, lockptr));
                ;
}

void release_lock(volatile unsigned long *lockptr)
{
        clear_bit(0, lockptr);
}

</pre>

<p>
Let's try to understand the logic. Assume that all the 
bits of the lock are initially clear (we use only one 
bit in this implementation). What if two CPUs try to 
execute <b>test_and_set_bit</b> concurrently? The 
implementation guarantees that one CPU will complete 
the operation before the other one begins it. That is, 
the value returned on one CPU is going to be definitely 
false and on the other one, definitely true. The CPU which 
gets the value 0 has succeeded in acquiring the lock 
while the other CPU keeps spinning. The spin loop exits 
only when the CPU which acquired the lock releases it 
by clearing the lock bit. This code should work 
properly. It does, on my system (the fact that you keep 
on getting the correct answer in tens of thousands of 
test runs is no guarantee that your implementation is 
correct - concurrent programming is too tough for that 
kind of simplistic assumptions). The only trouble is 
that it takes a very long time to get the output when 
it is used with our kernel module which keeps 
incrementing count. But that is to be expected. Locking 
does degrade performance.

<h3> The Linux Kernel Spinlock </h3>
<p>
It's easy to use the spin locks provided by the Linux 
kernel. The kernel defines a new type spinlock_t. A 
spin lock is a variable of this type. The lock is 
initialized by a macro SPIN_LOCK_UNLOCKED. 

<pre>

spinlock_t lock = SPIN_LOCK_UNLOCKED;
spin_lock(&amp;lock); /* Tries to acquire the lock */
spin_unlock(&amp;lock); /* Release the lock */

</pre>

<p>
The x86 kernel implementation contains some hacks to 
improve locking performance - readers can browse 
through include/linux/spinlock.h and include/asm/spinlock.h.

<p>
If a thread executing in kernel space acquires a spin 
lock and if it goes on to execute an interrupt service 
routine (on the same CPU), the ISR itself trying to 
acquire the lock would create problems. The ISR would 
keep on spinning, which will result in the lock holder 
not being able to release the lock. If code running on 
some other CPU is also spinning for the lock to be 
released, nobody would be able to make progress. So 
there are IRQ-safe versions of the lock and unlock 
function - they are called <b>spin_lock_irqsave</b> and 
<b>spin_unlock_irqrestore</b>. The first one saves the current 
state of local interrupts, disables local interrupts 
and tries to acquire the lock. The second one releases 
the lock and restores state of local interrupts.

<p>
On uniprocessor kernels, the spin_lock and spin_unlock 
functions are compiled away - they expand to empty 
statements. But on kernels with preemption, they 
disable and enable preemption.

<h3> Deadlocks </h3>

<p>
If not done properly, locking can create subtle 
problems which are difficult to debug. If a task 
running in kernel mode wants to access/modify two data 
structures, both protected by a lock (say to copy from 
one to the other), then the order in which the locks 
are acquired becomes significant. Say one task does:

<pre>

lock A
lock B

---critical section---

unlock B
unlock A

</pre>

and another task does:

<pre>

lock B
lock A

---critical section---

unlock A
unlock B

</pre>

There is a danger of deadlock if both CPUs complete 
the first set of locking operations in parallel (say 
task running on CPU0 acquires lock A and task running 
on CPU1 acquires lock B). Now, task on CPU0 will try to 
acquire lock B and will go in a  loop and task on 
CPU1 will try to acquire lock A and will get into a 
spin. Both loops will never terminate and the 
system will freeze. Be careful when you try out the 
code I present here - you might have to press the reset 
switch! The moral is that all tasks should acquire 
locks only in one particular order.

<p>
<a href="misc/pramode/10.c.txt">[Listing 10, kernel module demonstrating deadlock]</a>

<p>
Here is a user space program which triggers the
deadlock:

<p>
<a href="misc/pramode/11.c.txt">[Listing 11]</a>

<p>
Simply running the user space program a few times from 
the command prompt did not result in any problem on my 
machine. The reason might be code running on one CPU 
might have started only after code running on the other 
had locked both lock_a and lock_b. Running the program 
within a simple shell loop soon made the system freeze.

<pre>

while true
./11 # Name of the test program
done

</pre>

This is one of the problems with code containing 
deadlocks - sometimes they do not manifest themselves, 
even if we test thoroughly.

<h3> Sequence locks </h3>

<p>
Sequence locks can be used in situations where a writer 
should get immediate access to a critical section even 
in the presence of multiple readers. Readers should 
realize that they have got inconsistent results because 
of a writer interfering, and they should be prepared to 
try reading again and again till they get consistent 
results. A writer should never interfere while another 
writer is active in the critical section. The code 
which implements this locking scheme is fairly simple 
and has been introduced as part of the 2.6 kernel.

<p>
Lets first look at a `toy' program which demonstrates 
the use sequence locks. We have a writer process which 
keeps incrementing a 32 bit counter implemented as two 
independent 16 bit counters. The idea is to increment 
the higher 16 bits if the lower 16 bits overflows from 
0xffff to 0x0. A reader process keeps reading the two 
16 bits counters and generates a 32 bit count value. At 
no point of time should the reader process get a count 
which was lower than the previous count. The maximum 
count in the two 16 bit counters taken together at any 
point of time will always be less than 0xffffffff.

<p>
Two problems can give inconsistent results. Say the 
current value of the count is 0x6FFFF (stored as 0xFFFF 
and 0x6 in the two independent counters). The writer 
changes 0xFFFF to 0x0, which the reader reads and 
records as the lower 16 bits of the count. Next, if the 
reader reads the higher 16 bits before the writer 
increments it, reader would get 0x6. So the combined 
value will be 0x60000 which is less than the previous 
count 0x6FFFF. The problems here is that the reader is 
interrupting the writer.

<p>
Another scenario. Say current total is 0x2FFFF. The 
reader reads the higher 16 bits and gets 0x2. Before 
the reader can sum this up with the lower 16 bits, the 
writer starts running and changes 0x2FFFF to 0x30000, 
0x30001 etc. It is possible that the reader will run 
again only when the count becomes say 0x30006. At this 
point, the reader will get the lower 16 bits, 0x0006. 
Reader combines both parts and gets 0x20006 which is 
less than 0x2FFFF. 

<p>
<a href="misc/pramode/12.c.txt">[Listing 12, kernel module demonstrating reader/writer problem]</a>

<p>
Here is a user space test program:

<p>
<a href="misc/pramode/13.c.txt">[Listing 13]</a>

<p>
In situations where there are only a few writers and 
many readers running concurrently, where a writer is so 
impatient that he is not prepared to wait for any of 
the readers, sequence locks can be employed 
effectively. Examples of usage can be found in 
kernel/timer.c. 

<p>
We will modify our program to use seq locks - we shall 
then look at how these locks are implemented.

<p>
<a href="misc/pramode/14.c.txt">[Listing 14, Using sequence locks to solver reader-writer problem]</a>

<p>
Now let's look at the kernel implementation of the 
sequence lock.

<p>
<a href="misc/pramode/15.c.txt">[Listing 15]</a>
<pre>

typedef struct {
        unsigned sequence;
        spinlock_t lock;

} seqlock_t;

#define SEQLOCK_UNLOCKED { 0, SPIN_LOCK_UNLOCKED }

static inline void 
write_seqlock(seqlock_t *sl)
{

        spin_lock(&amp;sl-&gt;lock);
        ++sl-&gt;sequence;
        smp_wmb();                      
}       

static inline void 
write_sequnlock(seqlock_t *sl) 
{
        smp_wmb();
        sl-&gt;sequence++;
        spin_unlock(&amp;sl-&gt;lock);
}

static inline unsigned 
read_seqbegin(const seqlock_t *sl)
{
        unsigned ret = sl-&gt;sequence;
        smp_rmb();
        return ret;
}

static inline int 
read_seqretry(const seqlock_t *sl, unsigned iv)
{
        smp_rmb();
        return (iv &amp; 1) | (sl-&gt;sequence ^ iv);
}

</pre>

<p>
We note that seqlock_t is a structure which contains a 
sequence number and a spin lock - the initial sequence 
number will be 0 and the spin lock will be in the 
unlocked state. A writer will always have to acquire 
this spinlock to enter the critical section - this 
guards against other writers. Once the writer enters 
the critical section, it increments the sequence 
number. Once the writer leaves the critical section, 
the sequence number gets incremented again and the lock 
is unlocked. The important point here is that while the 
writer is operating, the sequence number would be odd. 
No writer in the critical section means sequence number 
will be even. A reader enters the critical section 
without grabbing any lock. If it is lucky, it wont be 
interrupting any writer and no writer would interrupt 
it. How does the reader know that it has been lucky 
enough? Simple, before doing its critical section, the 
reader will save away the sequence number of the 
associated lock. Now, after the critical section is 
over, the reader will check for two things:

<ol>
<li> Whether the saved sequence number was odd. If so, it 
  means that the reader was running while a writer also 
  was there in the critical section.

<li> Whether the saved sequence number and the current 
  sequence number are the same. If not, it means that a 
  writer had interrupted the reading process.
</ol>

<p>
If there has been no interruptions, the reader is happy 
with the value it has got - it's surely going to be a 
consistent value. Otherwise, the reader will redo the 
reading process till it gets a consistent result. The 
read_seqretry function is the heart of this technique.
Its body looks like this:

<pre>

return (iv &amp; 1) | (sl-&gt;sequence ^ iv);

</pre>
<p>
The first part checks whether iv, the old sequence 
number is even or odd. The second part checks whether 
the old and the new sequence numbers are identical. The 
smp_rmb, smp_wmb functions which you see in the code 
are to handle out of order loads and stores which I 
don't think Intel CPUs do - so those functions need 
not hinder your understanding of the code.


<h3>Conclusion</h3>
<p>
I have tried to present a few things about hyperthreading
and concurrency control in this article. Concurrency is
tricky business - and I am not an expert. I shall be grateful to
readers who point out errors/inconsistencies, if any.

<h3> Further Reading </h3>
<p>
Hyper threading technology is explained in depth in
this <a href="http://www.intel.com/technology/itj/2002/volume06issue01/art01_hyper/p01_abstract.htm">
Intel Document</a>.
An excellent  <a href="http://arstechnica.com/paedia/h/hyperthreading/hyperthreading-1.html">
Ars technica</a>
document looks at things in a much more graphic and
intuitive way. The <a href="http://www.extremetech.com/article2/0,1558,165010,00.asp">PC
Processor Microarchitecture Technology Primer</a> is an
excellent read for those who are interested in getting
some idea of how stuff like superscalar processing,
out of order execution, branch prediction, etc. works
without resorting to heavy-duty textbooks.

<p>
Readers who are new to inline assembly might refer to
this <a href="http://www-106.ibm.com/developerworks/library/l-ia.html">
IBM Developerworks article</a> for enlightenment. It's
impossible to read kernel code without a rudimentary
understanding of this powerful feature of the GNU C Compiler.

<p>
There is an interesting discussion archived <a href="http://www.sandpile.org/post/msgs/20003998.htm">
here</a> which deals with the necessity of the `lock' prefix on
hyperthreaded systems.

<p>
The book <b>Unix Systems for Modern Architectures</b>
by <b>Curt Schimmel</b> is a must read for programmers
working on modern SMP capable Unix kernels. The 
implementation of atomic increment using atomic
swap has been lifted from solution to problem 8.9
of this book. An excellent book for the kernel
newbie is <b>Linux Kernel Development</b> by
<b>Robert Love</b> of the preemptive
kernel fame.


</p>


<!-- *** BEGIN author bio *** -->
<P>&nbsp;
<P>
<!-- *** BEGIN bio *** -->
<hr>
<P>
<img ALIGN="LEFT" ALT="[BIO]" SRC="../gx/2002/note.png" class="bio">
<em>
I am an instructor working for IC Software in Kerala, India. I would have loved
becoming an organic chemist, but I do the second best thing possible, which is
play with Linux and teach programming!
</em>
<br CLEAR="all">
<!-- *** END bio *** -->

<!-- *** END author bio *** -->

<div id="articlefooter">

<p>
Copyright &copy; 2004, Pramode C.E.. Copying license 
<a href="http://linuxgazette.net/copying.html">http://linuxgazette.net/copying.html</a>
</p>

<p>
Published in Issue 103 of Linux Gazette, June 2004
</p>

</div>


<div id="previousnextbottom">
<A HREF="okopnik.html" >&lt;-- prev</A> | <A HREF="ecol.html" >next --&gt;</A>
</div>


</div>






<div id="navigation">

<a href="../index.html">Home</a>
<a href="../faq/index.html">FAQ</a>
<a href="../lg_index.html">Site Map</a>
<a href="../mirrors.html">Mirrors</a>
<a href="../mirrors.html">Translations</a>
<a href="../search.html">Search</a>
<a href="../archives.html">Archives</a>
<a href="../authors/index.html">Authors</a>
<a href="../contact.html">Contact Us</a>

</div>



<div id="breadcrumbs">

<a href="../index.html">Home</a> &gt; 
<a href="index.html">June 2004 (#103)</a> &gt; 
Article

</div>





<img src="../gx/2003/sit3-shine.7-2.gif" id="tux" alt="Tux"/>




</body>
</html>