1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140
|
<html>
<head>
<link href="../lg.css" rel="stylesheet" type="text/css" media="screen, projection" />
<title>Experiments with Kernel 2.6 on a Hyperthreaded Pentium 4 LG #103</title>
<style type="text/css" media="screen, projection">
<!--
.articlecontent {
position:absolute;
top:143px;
}
-->
</style>
</head>
<body>
<img src="../gx/2003/newlogo-blank-200-gold2.jpg" id="logo" alt="Linux Gazette"/>
<p id="fun">...making Linux just a little more fun!</p>
<div class="content articlecontent">
<div id="previousnexttop">
<A HREF="okopnik.html" ><-- prev</A> | <A HREF="ecol.html" >next --></A>
</div>
<h1>Experiments with Kernel 2.6 on a Hyperthreaded Pentium 4</h1>
<p id="by"><b>By <A HREF="../authors/pramode.html">Pramode C.E.</A></b></p>
<p>
<p> Having a dual CPU Linux system is cool, if you have
deep enough pockets to buy one. You need costlier
motherboards, more expensive power supply, and two
CPUs which obviously cost twice as much as one CPU.
That was, until Intel brought out the Hyperthreaded
(HT) Pentium 4. You get two `logical' CPUs within one
package, and you don't pay the price for two (of course,
you don't get the performance of a real dual CPU system
- if somebody said so, that should be marketing talk)!
I recently purchased one such system and am still under
the charm of `cat /proc/cpuinfo' telling me that I have
two CPUs in my box. This article explains various
experiments which I conducted on my HT system running
the shining new Linux kernel 2.6.5 mostly with a view
towards understanding concurrency control in Symmetric
Multi Processing systems. Readers are expected to have
an elementary understanding of Operating System concepts
and Linux kernel programming to understand the material
presented here.
<h3> What is Hyperthreading? </h3>
<p>
We visualize the microprocessor processing an
instruction stream by fetching, decoding and executing
instructions at a breakneck speed. There are many speed
barriers on the way. Suppose the microprocessor
encounters an instruction to do I/O - it will have to
wait for data to arrive from a peripheral unit over a
comparatively slow `bus'. Suppose during the course of
executing an instruction, some data has to be obtained
from main memory. If the data is not available in the
cache closest to the execution engine, it will have to
be obtained from a slower cache down the hierarchy, or
worse, from main memory (main memory access times are
very long compared to the CPU clock cycle, with the
clock reaching about 3GHz in modern processors). A
microprocessor with an `Out of Order' execution engine
tries to identify instructions ahead in the stream
which can be executed in parallel. But there are limits
to this process - limits imposed by the inherent serial
nature of the instruction stream. Many a time, the next
instruction will depend on the results generated by the
current instruction and so on.
<p>
What if the microprocessor can process two instruction
streams in parallel? Doing this will require that it
maintains two sets of `architectural state' (general
purpose registers, instruction pointer etc), but the
execution units (the ALU, FPU) need not be duplicated.
It is said that this can be done with very little
hardware (and thus, cost) overhead. Now we can
visualize two instruction streams in memory - the
microprocessor maintaining two instruction pointers and
fetching instructions from both the streams. Because
the execution units are not duplicated, it would be
impossible for the microprocessor to really execute two
instruction simultaneously - but what if an instruction
from one stream takes a long time to complete (maybe,
it is doing I/O - or there was a cache miss and its
waiting for data to arrive from main memory). During
this period, the microprocessor can execute
instructions from the other stream using the execution
units which are sitting idle. This is the basis of
Hyperthreading (or, what I understand to be the basis
of Hyperthreading - the more inquisitive readers can
form their own conclusions by reading some excellent
documents, links to which I have put up in the
concluding part of this article). The key to
performance improvement here is that the two threads
provide a mix of instructions which employ different
execution units within the microprocessor. How much
performance improvement will be seen in real
application scenarios is still a topic of debate.
Bottom line is - if you are purchasing an HT system for
performance, look before you leap!
<p>
A multiprocessor-aware operating system like Linux sees
the HT Pentium as it sees a `normal' twin CPU system,
with logical processors 0 and 1. The scheduler can run
tasks on any one of the CPUs.
<h3> My Setup </h3>
<p>
I am using an HT enabled 2.8GHZ Pentium 4. The Linux
distribution on this box is PCQLinux 2004, which is
based on the latest offering from Fedora.
<p>
I boot the system with Linux kernel 2.6.5 anxiously
looking for signs which proclaim that I am the owner of
a twin CPU system. Here is part of the output from `dmesg':
<pre>
Initializing CPU#0
Intel machine check reporting enabled on CPU#0.
CPU#0: Intel P4/Xeon Extended MCE MSRs (12) available
CPU#0: Thermal monitoring enabled
enabled ExtINT on CPU#0
Initializing CPU#1
masked ExtINT on CPU#1
Intel machine check reporting enabled on CPU#1.
CPU#1: Intel P4/Xeon Extended MCE MSRs (12) available
CPU#1: Thermal monitoring enabled
</pre>
The next place to look is /proc/cpuinfo, and sure
enough, it says that there are two CPUs. Interrupts
are shared by both the processors - so /proc/interrupts
is also interesting. It shows the number of interrupts
of each category processed by each logical processor.
Running the `top' command displays the CPU on which
each process is running.
<h3> Playing with the scheduler </h3>
<p>
The 2.6 scheduler provides a system call to set task
affinity. Normally, the scheduler is free to reschedule
a process on any CPU. Using the <b>sched_setaffinity </b>
system call, we can instruct it to restrict a process
to one set of CPUs. The manual page of this system
call provides the prototype as:
<pre>
int sched_setaffinity(pid_t pid,unsigned int len,
unsigned long *mask);
</pre>
which is wrong! The prototype in the header file
/usr/include/sched.h is:
<pre>
int sched_setaffinity(__pid_t __pid,
__const cpu_set_t *__mask);
</pre>
This is what we should use. The first argument is the
pid of the process whose scheduler affinity mask is to
be altered (pid 0 means current process). The second
argument is the mask. The mask should be manipulated
only using the macros provided to do the same. Here is
a function which when called with a logical CPU number
as argument binds the calling process to that
particular CPU (i.e., it will not be scheduled on any
other CPU).
<p>
[<a href="misc/pramode/1.c.txt">Listing 1, using the sched_setaffinity call</a>]
<pre>
#define _GNU_SOURCE
#include <sched.h>
void run_on_cpu(int cpu)
{
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(cpu,&mask);
sched_setaffinity(0,&mask);
}
</pre>
CPU_ZERO initializes all the bits in the mask to zero.
CPU_SET sets only the bit corresponding to cpu. It
would be interesting to run lots of copies of this
program, keeping top running on another console all the
time. We will see all the processes getting scheduled
on the same CPU.
Another interesting experiment might be to run a
command like yes
<pre>
yes > /dev/null &
</pre>
Note down the CPU on which it is being scheduled and
then run lots of copies of the program in <b>[Listing 1]</b> on
the same CPU.
We will see the scheduler migrating <code>yes</code> to the other
CPU because it is relatively idle. If the affinity mask is not
explicitly altered, a process will be rescheduled on any processor.
<h3> Measuring performance </h3>
<p>
Using toy programs for benchmarking is a stupid idea.
Still, we wish to see whether there is some benefit to
HT, at least in tailor-made situations. I wrote a
program which had two functions - one reading from an
I/O port in a loop and the other one incrementing a
floating point number in a loop. Executed serially on
one CPU, the program takes about 21 seconds (10 seconds
each for the numeric and I/O functions). When the two
tasks are executed on two different CPUs, the run time
becomes about 12.4 seconds - a really big speedup. Is
this behaviour expected or have I got something wrong?
I am not really sure at this point. Readers with access
to HT systems can surely help by commenting on the code
and also trying to measure things on their systems.
<p>
<a href="misc/pramode/2.c.txt">[Listing 2, trying to measure HT performance]</a>
<p>
Both the tasks running only do_io or do_float resulted
in a decrease in measured speedup. When do_float was
changed to do only integer arithmetic and both tasks
(running on separate CPUs) were made to do do_float,
there was a severe degradation in performance (when
compared to both tasks running only on one CPU). I
believe one thing of which we can be sure here is that the
instruction mix is really significant. As any useful
program would be doing lots of arithmetic, lots of I/O
as well as plenty of memory accesses, a combination of
such programs will give the hyperthreaded CPU an
opportunity to execute instructions from both the
threads in parallel using unused execution units.
<p>
Another experiment which I did was to run two jobs -
one accessing memory in such a way as to generate lots
of cache misses and the other one doing some floating
point computations. About 25% speedup was observed when
the tasks were made to run on two CPUs.
<p>
<a href="misc/pramode/3.c.txt">[Listing 3]</a>
<h3> Playing with the kernel </h3>
<p>
The 2.6 kernel has brought in lots of scalability
enhancements which makes Linux a platform of choice for
heavy duty server applications. There have been radical
changes in many kernel subsystems and the programming
interface which the kernel presents to driver writers
has changed at many places. Let's first try to build a
kernel module using the new `kbuild' mechanism.
<h3> Compiling a simple module </h3>
<p>
The module does nothing - it can be loaded and
unloaded, thats all.
<p>
<a href="misc/pramode/4.c.txt">[Listing 4, a simple kernel module]</a>
<p>
We need a Makefile containing the following lines: (4.c
is the name of our module)
<pre>
obj-m := 4.o
default:
make -C /usr/src/linux-2.6.5 SUBDIRS=`pwd`
</pre>
<p>
Now, just invoke make and the kernel build system will
automagically build the module. More details can be
found in documents under <b>Documentation/kbuild</b> of the
kernel source tree. Note that the object file to be
loaded is named 4.ko.
<h3> Understanding race conditions </h3>
<p>
When two CPUs try to execute code which modifies
shared objects in parallel, the outcome will depend on
the order in which the accesses/modifications took
place. For the remaining part of this discussion, we
shall assume that if two CPUs try to access memory in
parallel, some kind of hardware arbitration circuit
will serialize the access - but which CPU gets access
first is essentially unpredictable.
<p>
The SMP-aware Linux kernel allows multiple CPUs to
execute kernel as well as user level code in parallel.
Let's think of what would happen if code running on two
CPUs tries to execute the following sequence simultaneously.
<pre>
temp = count;
temp = temp + 1;
count = temp;
</pre>
The variable count is a global object shared by both
the executing tasks whereas temp is local to each task.
One execution timeline can be:
<p>
Both CPUs try to do temp=count in parallel. Say CPU0
gets access to memory first. Code running on it will
store the current value of count (say 0) in temp. Next,
CPU1 gets access to memory. The task running on CPU1
also stores the current value of count (which is 0) in
a local variable. Now, both CPUs increment their local
copies. Both CPUs write back the value of temp (which
is 1) to count. Note that even though both CPUs have
executed code to increment count, because of execution
overlaps, the count has become 1 and not 2. This is a
really big problem. The problem would not have occurred
had the code running on one CPU, say CPU1, read the
value of count after the code running on the other CPU
had finished incrementing it. But there is no such
guarantee on a complex multiprocessor system.
<p>
Let's try to demonstrate this race condition
experimentally. Even though we are not using a real
dual CPU system here, the behaviour will be identical.
We will write a simple character driver with a
`device_write' method which keeps updating a global
variable `count' in a loop. We will run two tasks,
initially, both on the same CPU. The tasks will open a
device file and invoke the `write' system call many times
in a loop, with the effect that the code in the kernel
gets activated. If each task invokes `write' 100 times,
and each device_write updates the global `count' variable
100 times, then the total count should be, after both
the tasks are over, 20000. Any other value is an
indication of a race condition. We will experiment with
both tasks running on the same CPU and on different CPUs.
<p>
<a href="misc/pramode/5.c.txt">[Listing 5, kernel module containing a race condition]</a>
<p>
The test program forks a child process. Both the parent
and child execute `write' on the same file descriptor,
with the effect that `foo_write' gets invoked a total of
200000000 times. Each `foo_write' increments the count by
1 - if there is no race condition, the count should be
200000000. We do have a race here, so the count should
be less than 200000000, which is what we observe when
the parent and the child run on different CPUs. But we
have a surprising result - even when the parent and
child run on the same CPU, we get values less than
200000000. Why?
<p>
<a href="misc/pramode/6.c.txt">[Listing 6, testing the kernel module]</a>
<p>
The answer is that I have compiled the 2.6 kernel with
kernel preemption enabled. Classical uniprocessor Unix
kernels are non-preemptive. That is, if a task enters
the kernel, until it comes out, no other task would be
scheduled. This design reduces problems associated with
concurrency a lot. Kernel code which modifies a global
data structure (say a linked list) need not be worried
that it will be preempted during the execution of a
critical section and some other task would enter kernel
mode and access/modify the same data structure
resulting in race conditions. The trouble with this
design is that tasks which execute for a long time in
the kernel would reduce the responsiveness of the
system by not letting interactive tasks preempt them.
The 2.6 kernel can be compiled with preemption enabled.
The result is that even if both tasks are running on
the same CPU, one task may be preempted after it has
read and stored the value of count in the temporary
variable and before it writes back the incremented
value. This will result in race conditions cropping up.
<p>
The solution is to disable preemption by calling
<b>preempt_disable</b> and enable it once again after
the critical section is over by calling <b>preempt_enable</b>.
These functions basically manipulate a preemption
count, which can be obtained by calling <b>preempt_count</b>
<h3>Doing an `atomic' increment</h3>
<p>
What if we change the three line sequence which
increments count to just one line?
<pre>
count++;
</pre>
<p>
We don't have any guarantees that race conditions will
not occur in this case also. The x86 compiler may
translate count++ to three separate instructions which
first fetches a value from memory and stores it in a
register, increments the register and stores it back to
the memory location. The GNU C Compiler version 3.3.2
on my system in fact translates count++ (count is
declared as volatile) to three statements. If we wish
to make sure that count is getting incremented using
just one assembly instruction, we will have to make use
of inline assembly code like this:
<pre>
__asm__ __volatile__ (
"incl %0\n\t"
:"=m"(count)
:"m"(count)
);
</pre>
<p>
Modifying the program this way ensures that if both
tasks are running on one CPU, and even if preemption is
not disabled, we get the correct results. This is
because the current task won't be preempted until the
assembly instruction it is executing is over. What if
we make the tasks run on different CPUs? We note that
again we are getting random values for count. This is
because the `incl' statement operates internally by
bringing the value in memory to some place within the
CPU, incrementing it and then transferring it back to
the same location. It is possible that if two CPUs try
to execute this sequence in parallel, both the CPUs
might load the value in memory to some location
internal to them (the memory system hardware ensures
that only one CPU is able to read from a location at
one instant of time. But once that read is over, there
is no guarantee that the other CPU will not be able to
read from memory till the whole read-modify-write
sequence running on the first CPU is completely over),
increments the value and writes back to memory with the
result that the count gets incremented by one and not
two. This seems to be the case even with Hyperthreading
systems - the hardware only guarantees that the
load-store operations of one thread happen in sequence.
This may get intermixed with the load-store sequences
generated by the other thread.
<p>
What is the solution to this problem? On x86 systems,
the instruction prefix `lock' can be used to make certain
assembly instructions atomic. The Intel manuals say
that `lock' can be used with only a few instruction (inc
is one of them). The hardware somehow guarantees that
two instructions prefixed by a `lock' running on two
different CPUs and trying to access the same memory
location will execute mutually exclusive to each other
- that is, only after one instruction runs to completion
(the full read-modify-write sequence) will the other
instruction start executing.
<pre>
__asm__ __volatile__ (
"lock; incl %0\n\t"
:"=m"(count)
:"m"(count)
);
</pre>
<p>
The modified program runs slower, but it gives us the
correct count of 200000000.
<h3> The atomic swap operation </h3>
<p>
The x86 XCHG instruction swaps a 32 bit register with a
memory location. Even if it is not prefixed with a
lock, it is guaranteed that XCHG will be atomic. It can
be used as a building block for many synchronization
primitives like the semaphore. Let's try to implement
an atomic increment operation using the atomic swap.
<p>
<a href="misc/pramode/7.c.txt">[Listing 7, implementing atomic increment]</a>
<h3> Using the builtin atomic functions </h3>
<p>
The kernel header <b>include/asm/atomic.h</b> defines a few
inline functions and macros which can be used for
performing atomic operations.
<pre>
atomic_t count = ATOMIC_INIT(0);
atomic_read(&count);
atomic_set(&count, 2);
atomic_inc(&count);
</pre>
<h3> The spin lock </h3>
<p>
Let's now try to understand how the <b>spin lock</b>, a
very fundamental multiprocessor synchronization mechanism, works.
The problem is to prevent kernel code running on two
CPUs from accessing a shared data structure
concurrently. Had the data structure been something as
simple as an integer (or some other basic type) and the
operation been as simple as performing an addition or a
subtraction or an increment, the atomic operations
would have been sufficient. But what we wish to do is
provide mutual exclusivity to two threads running on
two different CPUs accessing an arbitrarily complex
data structure using an arbitrary sequence of
instructions. We need something more general than the
atomic add, sub, or inc operations.
<h3> A flawed solution </h3>
<p>
Let the tasks running on both the CPUs look at a
shared variable (let's call it `lock') before executing
the section of code which has got to be mutually
exclusive. If the value of lock is zero (lock is in the
unlocked state), make it one and enter into the
critical section. If not, keep on spinning until the
lock becomes zero again.
<p>
<a href="misc/pramode/8.c.txt">[Listing 8]</a>
<pre>
void acquire_lock(volatile int *lockptr)
{
while(*lockptr);
*lockptr = 1;
}
void release_lock(volatile int *lockptr)
{
*lockptr = 0;
}
</pre>
<p>
Let's say code running on two CPUs is trying to
increment a variable `count' by first storing it in
`tmp'. We try to prevent race conditions
from occurring by calling acquire_lock before the code
sequence begins and release_lock after the code
sequence ends.
<pre>
acquire_lock();
tmp = count;
tmp++;
count = tmp;
release_lock();
</pre>
<p>
Testing the code shows that it is not working as
expected. The reason is simple. Suppose code executing
on both the CPUs call acquire_lock simultaneously. A
possible sequence of events would be:
<ol>
<li>Code running on CPU0 reads *lockptr and sees that it is zero
<li>Before code running on CPU0 makes *lockptr equal to
one, code running on CPU1 sees that *lockptr is zero.
<li>Both code segments DO NOT enter into the while loop
<li>Code on CPU0 makes *lockptr equal to 1
<li>Code on CPU0 happily enters the `critical section'
<li>Code on CPU1 also makes *lockptr equal to 1 and enters
the critical section
</ol>
<p>
The problem arose because testing whether *lockptr is
zero and then making it equal to one executed non-atomically.
<h3> Another Attempt </h3>
<p>
The kernel header file include/asm/bitops.h presents an atomic
test_and_set_bit function which takes an address and a
bit number and sets that bit of the location pointed to
by address, also returning true if the bit was
initially set and false if it was clear. Let's use this
function to implement another version of the spin lock.
<p>
<a href="misc/pramode/9.c.txt">[Listing 9]</a>
<pre>
/* Spin locks, another attempt*/
void acquire_lock(volatile unsigned long *lockptr)
{
while(test_and_set_bit(0, lockptr));
;
}
void release_lock(volatile unsigned long *lockptr)
{
clear_bit(0, lockptr);
}
</pre>
<p>
Let's try to understand the logic. Assume that all the
bits of the lock are initially clear (we use only one
bit in this implementation). What if two CPUs try to
execute <b>test_and_set_bit</b> concurrently? The
implementation guarantees that one CPU will complete
the operation before the other one begins it. That is,
the value returned on one CPU is going to be definitely
false and on the other one, definitely true. The CPU which
gets the value 0 has succeeded in acquiring the lock
while the other CPU keeps spinning. The spin loop exits
only when the CPU which acquired the lock releases it
by clearing the lock bit. This code should work
properly. It does, on my system (the fact that you keep
on getting the correct answer in tens of thousands of
test runs is no guarantee that your implementation is
correct - concurrent programming is too tough for that
kind of simplistic assumptions). The only trouble is
that it takes a very long time to get the output when
it is used with our kernel module which keeps
incrementing count. But that is to be expected. Locking
does degrade performance.
<h3> The Linux Kernel Spinlock </h3>
<p>
It's easy to use the spin locks provided by the Linux
kernel. The kernel defines a new type spinlock_t. A
spin lock is a variable of this type. The lock is
initialized by a macro SPIN_LOCK_UNLOCKED.
<pre>
spinlock_t lock = SPIN_LOCK_UNLOCKED;
spin_lock(&lock); /* Tries to acquire the lock */
spin_unlock(&lock); /* Release the lock */
</pre>
<p>
The x86 kernel implementation contains some hacks to
improve locking performance - readers can browse
through include/linux/spinlock.h and include/asm/spinlock.h.
<p>
If a thread executing in kernel space acquires a spin
lock and if it goes on to execute an interrupt service
routine (on the same CPU), the ISR itself trying to
acquire the lock would create problems. The ISR would
keep on spinning, which will result in the lock holder
not being able to release the lock. If code running on
some other CPU is also spinning for the lock to be
released, nobody would be able to make progress. So
there are IRQ-safe versions of the lock and unlock
function - they are called <b>spin_lock_irqsave</b> and
<b>spin_unlock_irqrestore</b>. The first one saves the current
state of local interrupts, disables local interrupts
and tries to acquire the lock. The second one releases
the lock and restores state of local interrupts.
<p>
On uniprocessor kernels, the spin_lock and spin_unlock
functions are compiled away - they expand to empty
statements. But on kernels with preemption, they
disable and enable preemption.
<h3> Deadlocks </h3>
<p>
If not done properly, locking can create subtle
problems which are difficult to debug. If a task
running in kernel mode wants to access/modify two data
structures, both protected by a lock (say to copy from
one to the other), then the order in which the locks
are acquired becomes significant. Say one task does:
<pre>
lock A
lock B
---critical section---
unlock B
unlock A
</pre>
and another task does:
<pre>
lock B
lock A
---critical section---
unlock A
unlock B
</pre>
There is a danger of deadlock if both CPUs complete
the first set of locking operations in parallel (say
task running on CPU0 acquires lock A and task running
on CPU1 acquires lock B). Now, task on CPU0 will try to
acquire lock B and will go in a loop and task on
CPU1 will try to acquire lock A and will get into a
spin. Both loops will never terminate and the
system will freeze. Be careful when you try out the
code I present here - you might have to press the reset
switch! The moral is that all tasks should acquire
locks only in one particular order.
<p>
<a href="misc/pramode/10.c.txt">[Listing 10, kernel module demonstrating deadlock]</a>
<p>
Here is a user space program which triggers the
deadlock:
<p>
<a href="misc/pramode/11.c.txt">[Listing 11]</a>
<p>
Simply running the user space program a few times from
the command prompt did not result in any problem on my
machine. The reason might be code running on one CPU
might have started only after code running on the other
had locked both lock_a and lock_b. Running the program
within a simple shell loop soon made the system freeze.
<pre>
while true
./11 # Name of the test program
done
</pre>
This is one of the problems with code containing
deadlocks - sometimes they do not manifest themselves,
even if we test thoroughly.
<h3> Sequence locks </h3>
<p>
Sequence locks can be used in situations where a writer
should get immediate access to a critical section even
in the presence of multiple readers. Readers should
realize that they have got inconsistent results because
of a writer interfering, and they should be prepared to
try reading again and again till they get consistent
results. A writer should never interfere while another
writer is active in the critical section. The code
which implements this locking scheme is fairly simple
and has been introduced as part of the 2.6 kernel.
<p>
Lets first look at a `toy' program which demonstrates
the use sequence locks. We have a writer process which
keeps incrementing a 32 bit counter implemented as two
independent 16 bit counters. The idea is to increment
the higher 16 bits if the lower 16 bits overflows from
0xffff to 0x0. A reader process keeps reading the two
16 bits counters and generates a 32 bit count value. At
no point of time should the reader process get a count
which was lower than the previous count. The maximum
count in the two 16 bit counters taken together at any
point of time will always be less than 0xffffffff.
<p>
Two problems can give inconsistent results. Say the
current value of the count is 0x6FFFF (stored as 0xFFFF
and 0x6 in the two independent counters). The writer
changes 0xFFFF to 0x0, which the reader reads and
records as the lower 16 bits of the count. Next, if the
reader reads the higher 16 bits before the writer
increments it, reader would get 0x6. So the combined
value will be 0x60000 which is less than the previous
count 0x6FFFF. The problems here is that the reader is
interrupting the writer.
<p>
Another scenario. Say current total is 0x2FFFF. The
reader reads the higher 16 bits and gets 0x2. Before
the reader can sum this up with the lower 16 bits, the
writer starts running and changes 0x2FFFF to 0x30000,
0x30001 etc. It is possible that the reader will run
again only when the count becomes say 0x30006. At this
point, the reader will get the lower 16 bits, 0x0006.
Reader combines both parts and gets 0x20006 which is
less than 0x2FFFF.
<p>
<a href="misc/pramode/12.c.txt">[Listing 12, kernel module demonstrating reader/writer problem]</a>
<p>
Here is a user space test program:
<p>
<a href="misc/pramode/13.c.txt">[Listing 13]</a>
<p>
In situations where there are only a few writers and
many readers running concurrently, where a writer is so
impatient that he is not prepared to wait for any of
the readers, sequence locks can be employed
effectively. Examples of usage can be found in
kernel/timer.c.
<p>
We will modify our program to use seq locks - we shall
then look at how these locks are implemented.
<p>
<a href="misc/pramode/14.c.txt">[Listing 14, Using sequence locks to solver reader-writer problem]</a>
<p>
Now let's look at the kernel implementation of the
sequence lock.
<p>
<a href="misc/pramode/15.c.txt">[Listing 15]</a>
<pre>
typedef struct {
unsigned sequence;
spinlock_t lock;
} seqlock_t;
#define SEQLOCK_UNLOCKED { 0, SPIN_LOCK_UNLOCKED }
static inline void
write_seqlock(seqlock_t *sl)
{
spin_lock(&sl->lock);
++sl->sequence;
smp_wmb();
}
static inline void
write_sequnlock(seqlock_t *sl)
{
smp_wmb();
sl->sequence++;
spin_unlock(&sl->lock);
}
static inline unsigned
read_seqbegin(const seqlock_t *sl)
{
unsigned ret = sl->sequence;
smp_rmb();
return ret;
}
static inline int
read_seqretry(const seqlock_t *sl, unsigned iv)
{
smp_rmb();
return (iv & 1) | (sl->sequence ^ iv);
}
</pre>
<p>
We note that seqlock_t is a structure which contains a
sequence number and a spin lock - the initial sequence
number will be 0 and the spin lock will be in the
unlocked state. A writer will always have to acquire
this spinlock to enter the critical section - this
guards against other writers. Once the writer enters
the critical section, it increments the sequence
number. Once the writer leaves the critical section,
the sequence number gets incremented again and the lock
is unlocked. The important point here is that while the
writer is operating, the sequence number would be odd.
No writer in the critical section means sequence number
will be even. A reader enters the critical section
without grabbing any lock. If it is lucky, it wont be
interrupting any writer and no writer would interrupt
it. How does the reader know that it has been lucky
enough? Simple, before doing its critical section, the
reader will save away the sequence number of the
associated lock. Now, after the critical section is
over, the reader will check for two things:
<ol>
<li> Whether the saved sequence number was odd. If so, it
means that the reader was running while a writer also
was there in the critical section.
<li> Whether the saved sequence number and the current
sequence number are the same. If not, it means that a
writer had interrupted the reading process.
</ol>
<p>
If there has been no interruptions, the reader is happy
with the value it has got - it's surely going to be a
consistent value. Otherwise, the reader will redo the
reading process till it gets a consistent result. The
read_seqretry function is the heart of this technique.
Its body looks like this:
<pre>
return (iv & 1) | (sl->sequence ^ iv);
</pre>
<p>
The first part checks whether iv, the old sequence
number is even or odd. The second part checks whether
the old and the new sequence numbers are identical. The
smp_rmb, smp_wmb functions which you see in the code
are to handle out of order loads and stores which I
don't think Intel CPUs do - so those functions need
not hinder your understanding of the code.
<h3>Conclusion</h3>
<p>
I have tried to present a few things about hyperthreading
and concurrency control in this article. Concurrency is
tricky business - and I am not an expert. I shall be grateful to
readers who point out errors/inconsistencies, if any.
<h3> Further Reading </h3>
<p>
Hyper threading technology is explained in depth in
this <a href="http://www.intel.com/technology/itj/2002/volume06issue01/art01_hyper/p01_abstract.htm">
Intel Document</a>.
An excellent <a href="http://arstechnica.com/paedia/h/hyperthreading/hyperthreading-1.html">
Ars technica</a>
document looks at things in a much more graphic and
intuitive way. The <a href="http://www.extremetech.com/article2/0,1558,165010,00.asp">PC
Processor Microarchitecture Technology Primer</a> is an
excellent read for those who are interested in getting
some idea of how stuff like superscalar processing,
out of order execution, branch prediction, etc. works
without resorting to heavy-duty textbooks.
<p>
Readers who are new to inline assembly might refer to
this <a href="http://www-106.ibm.com/developerworks/library/l-ia.html">
IBM Developerworks article</a> for enlightenment. It's
impossible to read kernel code without a rudimentary
understanding of this powerful feature of the GNU C Compiler.
<p>
There is an interesting discussion archived <a href="http://www.sandpile.org/post/msgs/20003998.htm">
here</a> which deals with the necessity of the `lock' prefix on
hyperthreaded systems.
<p>
The book <b>Unix Systems for Modern Architectures</b>
by <b>Curt Schimmel</b> is a must read for programmers
working on modern SMP capable Unix kernels. The
implementation of atomic increment using atomic
swap has been lifted from solution to problem 8.9
of this book. An excellent book for the kernel
newbie is <b>Linux Kernel Development</b> by
<b>Robert Love</b> of the preemptive
kernel fame.
</p>
<!-- *** BEGIN author bio *** -->
<P>
<P>
<!-- *** BEGIN bio *** -->
<hr>
<P>
<img ALIGN="LEFT" ALT="[BIO]" SRC="../gx/2002/note.png" class="bio">
<em>
I am an instructor working for IC Software in Kerala, India. I would have loved
becoming an organic chemist, but I do the second best thing possible, which is
play with Linux and teach programming!
</em>
<br CLEAR="all">
<!-- *** END bio *** -->
<!-- *** END author bio *** -->
<div id="articlefooter">
<p>
Copyright © 2004, Pramode C.E.. Copying license
<a href="http://linuxgazette.net/copying.html">http://linuxgazette.net/copying.html</a>
</p>
<p>
Published in Issue 103 of Linux Gazette, June 2004
</p>
</div>
<div id="previousnextbottom">
<A HREF="okopnik.html" ><-- prev</A> | <A HREF="ecol.html" >next --></A>
</div>
</div>
<div id="navigation">
<a href="../index.html">Home</a>
<a href="../faq/index.html">FAQ</a>
<a href="../lg_index.html">Site Map</a>
<a href="../mirrors.html">Mirrors</a>
<a href="../mirrors.html">Translations</a>
<a href="../search.html">Search</a>
<a href="../archives.html">Archives</a>
<a href="../authors/index.html">Authors</a>
<a href="../contact.html">Contact Us</a>
</div>
<div id="breadcrumbs">
<a href="../index.html">Home</a> >
<a href="index.html">June 2004 (#103)</a> >
Article
</div>
<img src="../gx/2003/sit3-shine.7-2.gif" id="tux" alt="Tux"/>
</body>
</html>
|