File: pramode.html

package info (click to toggle)
lg-issue103 1-1
  • links: PTS
  • area: main
  • in suites: sarge
  • size: 1,060 kB
  • ctags: 58
  • sloc: makefile: 34; sh: 34
file content (1140 lines) | stat: -rw-r--r-- 36,238 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140

<html>
<head>
<link href="../lg.css" rel="stylesheet" type="text/css" media="screen, projection"  />
<title>Experiments with Kernel 2.6 on a Hyperthreaded Pentium 4 LG #103</title>

<style type="text/css" media="screen, projection">
<!--


.articlecontent {
	position:absolute;
	top:143px;
}


-->
</style>


</head>

<body>


<img src="../gx/2003/newlogo-blank-200-gold2.jpg" id="logo" alt="Linux Gazette"/>
<p id="fun">...making Linux just a little more fun!</p>


<div class="content articlecontent">

<div id="previousnexttop">
<A HREF="okopnik.html" >&lt;-- prev</A> | <A HREF="ecol.html" >next --&gt;</A>
</div>



<h1>Experiments with Kernel 2.6 on a Hyperthreaded Pentium 4</h1>
<p id="by"><b>By <A HREF="../authors/pramode.html">Pramode C.E.</A></b></p>

<p>
<p> Having a dual CPU Linux system is cool, if you have 
deep enough pockets to buy one. You need costlier 
motherboards, more expensive power supply, and two 
CPUs which obviously cost twice as much as one CPU. 
That was, until Intel brought out the Hyperthreaded 
(HT) Pentium 4. You get two `logical' CPUs within one 
package, and you don't pay the price for two (of course, 
you don't get the performance of a real dual CPU system 
- if somebody said so, that should be marketing talk)! 
I recently purchased one such system and am still under 
the charm of `cat /proc/cpuinfo' telling me that I have 
two CPUs in my box. This article explains various 
experiments which I conducted on my HT system running 
the shining new Linux kernel 2.6.5 mostly with a view 
towards understanding concurrency control in Symmetric 
Multi Processing systems. Readers are expected to have
an elementary understanding of Operating System concepts
and Linux kernel programming to understand the material
presented here.

<h3> What is Hyperthreading? </h3>
<p>
We visualize the microprocessor processing an 
instruction stream by fetching, decoding and executing 
instructions at a breakneck speed. There are many speed 
barriers on the way. Suppose the microprocessor 
encounters an instruction to do I/O - it will have to 
wait for data to arrive from a peripheral unit over a 
comparatively slow `bus'. Suppose during the course of 
executing an instruction, some data has to be obtained 
from main memory. If the data is not available in the 
cache closest to the execution engine, it will have to 
be obtained from a slower cache down the hierarchy, or 
worse, from main memory (main memory access times are 
very long compared to the CPU clock cycle, with the 
clock reaching about 3GHz in modern processors). A 
microprocessor with an `Out of Order' execution engine 
tries to identify instructions ahead in the stream 
which can be executed in parallel. But there are limits 
to this process - limits imposed by the inherent serial 
nature of the instruction stream. Many a time, the next 
instruction will depend on the results generated by the 
current instruction and so on.
<p>
What if the microprocessor can process two instruction 
streams in parallel? Doing this will require that it 
maintains two sets of `architectural state' (general 
purpose registers, instruction pointer etc), but the 
execution units (the ALU, FPU) need not be duplicated. 
It is said that this can be done with very little 
hardware (and thus, cost) overhead. Now we can 
visualize two instruction streams in memory - the 
microprocessor maintaining two instruction pointers and 
fetching instructions from both the streams. Because 
the execution units are not duplicated, it would be 
impossible for the microprocessor to really execute two 
instruction simultaneously - but what if an instruction 
from one stream takes a long time to complete (maybe, 
it is doing I/O - or there was a cache miss and its 
waiting for data to arrive from main memory). During 
this period, the microprocessor can execute 
instructions from the other stream using the execution 
units which are sitting idle. This is the basis of 
Hyperthreading (or, what I understand to be the basis 
of Hyperthreading - the more inquisitive readers can 
form their own conclusions by reading some excellent 
documents, links to which I have put up in the 
concluding part of this article). The key to 
performance improvement here is that the two threads 
provide a mix of instructions which employ different 
execution units within the microprocessor. How much 
performance improvement will be seen in real 
application scenarios is still a topic of debate. 
Bottom line is - if you are purchasing an HT system for 
performance, look before you leap!

<p>
A multiprocessor-aware operating system like Linux sees 
the HT Pentium as it sees a `normal' twin CPU system, 
with logical processors 0 and 1. The scheduler can run 
tasks on any one of the CPUs. 

<h3> My Setup </h3>
<p>
I am using an HT enabled 2.8GHZ Pentium 4. The Linux 
distribution on this box is PCQLinux 2004, which is 
based on the latest offering from Fedora. 

<p>
I boot the system with Linux kernel 2.6.5 anxiously 
looking for signs which proclaim that I am the owner of 
a twin CPU system. Here is part of the output from `dmesg':

<pre>

Initializing CPU#0
Intel machine check reporting enabled on CPU#0.
CPU#0: Intel P4/Xeon Extended MCE MSRs (12) available
CPU#0: Thermal monitoring enabled
enabled ExtINT on CPU#0

Initializing CPU#1
masked ExtINT on CPU#1
Intel machine check reporting enabled on CPU#1.
CPU#1: Intel P4/Xeon Extended MCE MSRs (12) available
CPU#1: Thermal monitoring enabled

</pre>

The next place to look is /proc/cpuinfo, and sure 
enough, it says that there are two CPUs. Interrupts 
are shared by both the processors - so /proc/interrupts 
is also interesting. It shows the number of interrupts 
of each category processed by each logical processor. 
Running the `top' command displays the CPU on which 
each process is running.

<h3> Playing with the scheduler </h3>

<p>
The 2.6 scheduler provides a system call to set task 
affinity. Normally, the scheduler is free to reschedule 
a process on any CPU. Using the <b>sched_setaffinity </b>
system call, we can instruct it to restrict a process 
to one set of CPUs. The manual page of this system 
call provides the prototype as:

<pre>

int sched_setaffinity(pid_t pid,unsigned int len, 
                      unsigned long *mask);

</pre>

   
which is wrong! The prototype in the header file 
/usr/include/sched.h is:

<pre>

int sched_setaffinity(__pid_t __pid, 
                      __const cpu_set_t *__mask);

</pre>

This is what we should use. The first argument is the 
pid of the process whose scheduler affinity mask is to 
be altered (pid 0 means current process). The second 
argument is the mask. The mask should be manipulated 
only using the macros provided to do the same. Here is 
a function which when called with a logical CPU number 
as argument binds the calling process to that 
particular CPU (i.e., it will not be scheduled on any 
other CPU).

<p>
[<a href="misc/pramode/1.c.txt">Listing 1, using the sched_setaffinity call</a>]

<pre>

#define _GNU_SOURCE
#include &lt;sched.h&gt;

void run_on_cpu(int cpu)
{
        cpu_set_t mask;
        CPU_ZERO(&amp;mask);
        CPU_SET(cpu,&amp;mask);
        sched_setaffinity(0,&amp;mask);
}

</pre>

CPU_ZERO initializes all the bits in the mask to zero. 
CPU_SET sets only the bit corresponding to cpu. It 
would be interesting to run lots of copies of this 
program, keeping top running on another console all the 
time. We will see all the processes getting scheduled 
on the same CPU.

Another interesting experiment might be to run a 
command like yes

<pre>

yes &gt; /dev/null &amp;

</pre>

Note down the CPU on which it is being scheduled and 
then run lots of copies of the program in <b>[Listing 1]</b> on
the same CPU. 
We will see the scheduler migrating <code>yes</code> to the other 
CPU because it is relatively idle. If the affinity mask is not 
explicitly altered, a process will be rescheduled on any processor.

<h3> Measuring performance </h3>
<p>
Using toy programs for benchmarking is a stupid idea. 
Still, we wish to see whether there is some benefit to 
HT, at least in tailor-made situations. I wrote a 
program which had two functions - one reading from an 
I/O port in a loop and the other one incrementing a 
floating point number in a loop. Executed serially on 
one CPU, the program takes about 21 seconds (10 seconds 
each for the numeric and I/O functions). When the two 
tasks are executed on two different CPUs, the run time 
becomes about 12.4 seconds - a really big speedup. Is 
this behaviour expected or have I got something wrong? 
I am not really sure at this point. Readers with access 
to HT systems can surely help by commenting on the code 
and also trying to measure things on their systems. 

<p>
<a href="misc/pramode/2.c.txt">[Listing 2, trying to measure HT performance]</a>

<p>
Both the tasks running only do_io or do_float resulted 
in a decrease in measured speedup. When do_float was 
changed to do only integer arithmetic and both tasks 
(running on separate CPUs) were made to do do_float, 
there was a severe degradation in performance (when 
compared to both tasks running only on one CPU). I 
believe one thing of which we can be sure here is that the 
instruction mix is really significant. As any useful 
program would be doing lots of arithmetic, lots of I/O 
as well as plenty of memory accesses, a combination of 
such programs will give the hyperthreaded CPU an
opportunity to execute instructions from both the 
threads in parallel using unused execution units.

<p>
Another experiment which I did was to run two jobs - 
one accessing memory in such a way as to generate lots 
of cache misses and the other one doing some floating 
point computations. About 25% speedup was observed when 
the tasks were made to run on two CPUs. 

<p>
<a href="misc/pramode/3.c.txt">[Listing 3]</a>


<h3> Playing with the kernel </h3>
<p>
The 2.6 kernel has brought in lots of scalability 
enhancements which makes Linux a platform of choice for 
heavy duty server applications. There have been radical 
changes in many kernel subsystems and the programming 
interface which the kernel presents to driver writers 
has changed at many places. Let's first try to build a 
kernel module using the new `kbuild' mechanism.

<h3> Compiling a simple module </h3>

<p>
The module does nothing - it can be loaded and 
unloaded, thats all. 

<p>
<a href="misc/pramode/4.c.txt">[Listing 4, a simple kernel module]</a>

<p>
We need a Makefile containing the following lines: (4.c 
is the name of our module)

<pre>

obj-m := 4.o

default:

    make -C /usr/src/linux-2.6.5 SUBDIRS=`pwd`

</pre>

<p>
Now, just invoke make and the kernel build system will 
automagically build the module. More details can be 
found in documents under <b>Documentation/kbuild</b> of the 
kernel source tree. Note that the object file to be 
loaded is named 4.ko.


<h3> Understanding race conditions </h3>

<p>
When two CPUs try to execute code which modifies 
shared objects in parallel, the outcome will depend on 
the order in which the accesses/modifications took 
place. For the remaining part of this discussion, we 
shall assume that if two CPUs try to access memory in 
parallel, some kind of hardware arbitration circuit 
will serialize the access - but which CPU gets access 
first is essentially unpredictable.

<p>
The SMP-aware Linux kernel allows multiple CPUs to 
execute kernel as well as user level code in parallel. 
Let's think of what would happen if code running on two 
CPUs tries to execute the following sequence simultaneously.

<pre>

temp = count;
temp = temp + 1;
count = temp;

</pre>

The variable count is a global object shared by both 
the executing tasks whereas temp is local to each task. 
One execution timeline can be:

<p>
Both CPUs try to do temp=count in parallel. Say CPU0 
gets access to memory first. Code running on it will 
store the current value of count (say 0) in temp. Next, 
CPU1 gets access to memory. The task running on CPU1 
also stores the current value of count (which is 0) in 
a local variable. Now, both CPUs increment their local 
copies. Both CPUs write back the value of temp (which 
is 1) to count. Note that even though both CPUs have 
executed code to increment count, because of execution 
overlaps, the count has become 1 and not 2. This is a 
really big problem. The problem would not have occurred 
had the code running on one CPU, say CPU1, read the 
value of count after the code running on the other CPU 
had finished incrementing it. But there is no such 
guarantee on a complex multiprocessor system.

<p>
Let's try to demonstrate this race condition 
experimentally. Even though we are not using a real 
dual CPU system here, the behaviour will be identical. 
We will write a simple character driver with a 
`device_write' method which keeps updating a global 
variable `count' in a loop. We will run two tasks, 
initially, both on the same CPU. The tasks will open a 
device file and invoke the `write' system call many times 
in a loop, with the effect that the code in the kernel 
gets activated. If each task invokes `write' 100 times, 
and each device_write updates the global `count' variable 
100 times, then the total count should be, after both 
the tasks are over, 20000. Any other value is an 
indication of a race condition. We will experiment with 
both tasks running on the same CPU and on different CPUs.

<p>
<a href="misc/pramode/5.c.txt">[Listing 5, kernel module containing a race condition]</a>

<p>
The test program forks a child process. Both the parent 
and child execute `write' on the same file descriptor, 
with the effect that `foo_write' gets invoked a total of 
200000000 times. Each `foo_write' increments the count by 
1 - if there is no race condition, the count should be 
200000000. We do have a race here, so the count should 
be less than 200000000, which is what we observe when 
the parent and the child run on different CPUs. But we 
have a surprising result - even when the parent and 
child run on the same CPU, we get values less than 
200000000. Why?

<p>
<a href="misc/pramode/6.c.txt">[Listing 6, testing the kernel module]</a>

<p>
The answer is that I have compiled the 2.6 kernel with 
kernel preemption enabled. Classical uniprocessor Unix 
kernels are non-preemptive. That is, if a task enters 
the kernel, until it comes out, no other task would be 
scheduled. This design reduces problems associated with 
concurrency a lot. Kernel code which modifies a global 
data structure (say a linked list) need not be worried 
that it will be preempted during the execution of a 
critical section and some other task would enter kernel 
mode and access/modify the same data structure 
resulting in race conditions. The trouble with this 
design is that tasks which execute for a long time in 
the kernel would reduce the responsiveness of the 
system by not letting interactive tasks preempt them. 
The 2.6 kernel can be compiled with preemption enabled. 
The result is that even if both tasks are running on 
the same CPU, one task may be preempted after it has 
read and stored the value of count in the temporary 
variable and before it writes back the incremented 
value. This will result in race conditions cropping up.

<p>
The solution is to disable preemption by calling 
<b>preempt_disable</b> and enable it once again after 
the critical section is over by calling <b>preempt_enable</b>.
These functions basically manipulate a preemption 
count, which can be obtained by calling <b>preempt_count</b>


<h3>Doing an `atomic' increment</h3>

<p>
What if we change the three line sequence which 
increments count to just one line?

<pre>

count++;

</pre>
<p>
We don't have any guarantees that race conditions will 
not occur in this case also. The x86 compiler may 
translate count++ to three separate instructions which 
first fetches a value from memory and stores it in a 
register, increments the register and stores it back to 
the memory location. The GNU C Compiler version 3.3.2 
on my system in fact translates count++ (count is 
declared as volatile) to three statements. If we wish 
to make sure that count is getting incremented using 
just one assembly instruction, we will have to make use 
of inline assembly code like this:

<pre>

__asm__ __volatile__ (
   "incl %0\n\t"
   :"=m"(count)
   :"m"(count)
);

</pre>

<p>
Modifying the program this way ensures that if both 
tasks are running on one CPU, and even if preemption is 
not disabled, we get the correct results. This is 
because the current task won't be preempted until the 
assembly instruction it is executing is over. What if 
we make the tasks run on different CPUs? We note that 
again we are getting random values for count. This is 
because the `incl' statement operates internally by 
bringing the value in memory to some place within the 
CPU, incrementing it and then transferring it back to 
the same location. It is possible that if two CPUs try 
to execute this sequence in parallel, both the CPUs 
might load the value in memory to some location 
internal to them (the memory system hardware ensures 
that only one CPU is able to read from a location at 
one instant of time. But once that read is over, there 
is no guarantee that the other CPU will not be able to 
read from memory till the whole read-modify-write 
sequence running on the first CPU is completely over), 
increments the value and writes back to memory with the 
result that the count gets incremented by one and not 
two. This seems to be the case even with Hyperthreading 
systems - the hardware only guarantees that the 
load-store operations of one thread happen in sequence. 
This may get intermixed with the load-store sequences 
generated by the other thread.

<p>
What is the solution to this problem? On x86 systems, 
the instruction prefix `lock' can be used to make certain 
assembly instructions atomic. The Intel manuals say 
that `lock' can be used with only a few instruction (inc 
is one of them). The hardware somehow guarantees that 
two instructions prefixed by a `lock' running on two 
different CPUs and trying to access the same memory 
location will execute mutually exclusive to each other 
- that is, only after one instruction runs to completion 
(the full read-modify-write sequence) will the other 
instruction start executing.

<pre>

__asm__ __volatile__ (
   "lock; incl %0\n\t"
   :"=m"(count)
   :"m"(count)
);

</pre>

<p>
The modified program runs slower, but it gives us the 
correct count of 200000000. 

<h3> The atomic swap operation </h3>

<p>
The x86 XCHG instruction swaps a 32 bit register with a 
memory location. Even if it is not prefixed with a 
lock, it is guaranteed that XCHG will be atomic. It can 
be used as a building block for many synchronization 
primitives like the semaphore. Let's try to implement 
an atomic increment operation using the atomic swap.

<p>
<a href="misc/pramode/7.c.txt">[Listing 7, implementing atomic increment]</a>

                



<h3> Using the builtin atomic functions </h3>

<p>
The kernel header <b>include/asm/atomic.h</b> defines a few 
inline functions and macros which can be used for 
performing atomic operations.

<pre>

atomic_t count = ATOMIC_INIT(0);
atomic_read(&amp;count); 
atomic_set(&amp;count, 2); 
atomic_inc(&amp;count);

</pre>


<h3> The spin lock </h3>

<p>
Let's now try to understand how the <b>spin lock</b>, a
very fundamental multiprocessor synchronization mechanism, works. 
The problem is to prevent kernel code running on two 
CPUs from accessing a shared data structure 
concurrently. Had the data structure been something as 
simple as an integer (or some other basic type) and the 
operation been as simple as performing an addition or a 
subtraction or an increment, the atomic operations 
would have been sufficient. But what we wish to do is 
provide mutual exclusivity to two threads running on 
two different CPUs accessing an arbitrarily complex 
data structure using an arbitrary sequence of 
instructions. We need something more general than the 
atomic add, sub, or inc operations.

<h3> A flawed solution </h3>

<p>
Let the tasks running on both the CPUs look at a 
shared variable (let's call it `lock') before executing 
the section of code which has got to be mutually 
exclusive. If the value of lock is zero (lock is in the 
unlocked state), make it one and enter into the 
critical section. If not, keep on spinning until the 
lock becomes zero again. 

<p>
<a href="misc/pramode/8.c.txt">[Listing 8]</a>

<pre>

void acquire_lock(volatile int *lockptr)
{
        while(*lockptr);
        *lockptr = 1;
}
void release_lock(volatile int *lockptr)
{
        *lockptr = 0;
}

</pre>


<p>
Let's say code running on two CPUs is trying to 
increment a variable `count' by first storing it in 
`tmp'. We try to prevent race conditions 
from occurring by calling acquire_lock before the code 
sequence begins and release_lock after the code 
sequence ends.

<pre>

acquire_lock();

tmp = count;
tmp++;
count = tmp;

release_lock();

</pre>

<p>
Testing the code shows that it is not working as 
expected. The reason is simple. Suppose code executing 
on both the CPUs call acquire_lock simultaneously. A 
possible sequence of events would be:

<ol>

<li>Code running on CPU0 reads *lockptr and sees that it is zero

<li>Before code running on CPU0 makes *lockptr equal to 
one, code running on CPU1 sees that *lockptr is zero.

<li>Both code segments DO NOT enter into the while loop

<li>Code on CPU0 makes *lockptr equal to 1

<li>Code on CPU0 happily enters the `critical section'

<li>Code on CPU1 also makes *lockptr equal to 1 and enters 
the critical section

</ol>

<p>
The problem arose because testing whether *lockptr is 
zero and then making it equal to one executed non-atomically.

<h3> Another Attempt </h3>

<p>
The kernel header file include/asm/bitops.h presents an atomic 
test_and_set_bit function which takes an address and a 
bit number and sets that bit of the location pointed to 
by address, also returning true if the bit was 
initially set and false if it was clear. Let's use this 
function to implement another version of the spin lock.

<p>
<a href="misc/pramode/9.c.txt">[Listing 9]</a>

<pre>

/* Spin locks, another  attempt*/

void acquire_lock(volatile unsigned long *lockptr)
{

        while(test_and_set_bit(0, lockptr));
                ;
}

void release_lock(volatile unsigned long *lockptr)
{
        clear_bit(0, lockptr);
}

</pre>

<p>
Let's try to understand the logic. Assume that all the 
bits of the lock are initially clear (we use only one 
bit in this implementation). What if two CPUs try to 
execute <b>test_and_set_bit</b> concurrently? The 
implementation guarantees that one CPU will complete 
the operation before the other one begins it. That is, 
the value returned on one CPU is going to be definitely 
false and on the other one, definitely true. The CPU which 
gets the value 0 has succeeded in acquiring the lock 
while the other CPU keeps spinning. The spin loop exits 
only when the CPU which acquired the lock releases it 
by clearing the lock bit. This code should work 
properly. It does, on my system (the fact that you keep 
on getting the correct answer in tens of thousands of 
test runs is no guarantee that your implementation is 
correct - concurrent programming is too tough for that 
kind of simplistic assumptions). The only trouble is 
that it takes a very long time to get the output when 
it is used with our kernel module which keeps 
incrementing count. But that is to be expected. Locking 
does degrade performance.

<h3> The Linux Kernel Spinlock </h3>
<p>
It's easy to use the spin locks provided by the Linux 
kernel. The kernel defines a new type spinlock_t. A 
spin lock is a variable of this type. The lock is 
initialized by a macro SPIN_LOCK_UNLOCKED. 

<pre>

spinlock_t lock = SPIN_LOCK_UNLOCKED;
spin_lock(&amp;lock); /* Tries to acquire the lock */
spin_unlock(&amp;lock); /* Release the lock */

</pre>

<p>
The x86 kernel implementation contains some hacks to 
improve locking performance - readers can browse 
through include/linux/spinlock.h and include/asm/spinlock.h.

<p>
If a thread executing in kernel space acquires a spin 
lock and if it goes on to execute an interrupt service 
routine (on the same CPU), the ISR itself trying to 
acquire the lock would create problems. The ISR would 
keep on spinning, which will result in the lock holder 
not being able to release the lock. If code running on 
some other CPU is also spinning for the lock to be 
released, nobody would be able to make progress. So 
there are IRQ-safe versions of the lock and unlock 
function - they are called <b>spin_lock_irqsave</b> and 
<b>spin_unlock_irqrestore</b>. The first one saves the current 
state of local interrupts, disables local interrupts 
and tries to acquire the lock. The second one releases 
the lock and restores state of local interrupts.

<p>
On uniprocessor kernels, the spin_lock and spin_unlock 
functions are compiled away - they expand to empty 
statements. But on kernels with preemption, they 
disable and enable preemption.

<h3> Deadlocks </h3>

<p>
If not done properly, locking can create subtle 
problems which are difficult to debug. If a task 
running in kernel mode wants to access/modify two data 
structures, both protected by a lock (say to copy from 
one to the other), then the order in which the locks 
are acquired becomes significant. Say one task does:

<pre>

lock A
lock B

---critical section---

unlock B
unlock A

</pre>

and another task does:

<pre>

lock B
lock A

---critical section---

unlock A
unlock B

</pre>

There is a danger of deadlock if both CPUs complete 
the first set of locking operations in parallel (say 
task running on CPU0 acquires lock A and task running 
on CPU1 acquires lock B). Now, task on CPU0 will try to 
acquire lock B and will go in a  loop and task on 
CPU1 will try to acquire lock A and will get into a 
spin. Both loops will never terminate and the 
system will freeze. Be careful when you try out the 
code I present here - you might have to press the reset 
switch! The moral is that all tasks should acquire 
locks only in one particular order.

<p>
<a href="misc/pramode/10.c.txt">[Listing 10, kernel module demonstrating deadlock]</a>

<p>
Here is a user space program which triggers the
deadlock:

<p>
<a href="misc/pramode/11.c.txt">[Listing 11]</a>

<p>
Simply running the user space program a few times from 
the command prompt did not result in any problem on my 
machine. The reason might be code running on one CPU 
might have started only after code running on the other 
had locked both lock_a and lock_b. Running the program 
within a simple shell loop soon made the system freeze.

<pre>

while true
./11 # Name of the test program
done

</pre>

This is one of the problems with code containing 
deadlocks - sometimes they do not manifest themselves, 
even if we test thoroughly.

<h3> Sequence locks </h3>

<p>
Sequence locks can be used in situations where a writer 
should get immediate access to a critical section even 
in the presence of multiple readers. Readers should 
realize that they have got inconsistent results because 
of a writer interfering, and they should be prepared to 
try reading again and again till they get consistent 
results. A writer should never interfere while another 
writer is active in the critical section. The code 
which implements this locking scheme is fairly simple 
and has been introduced as part of the 2.6 kernel.

<p>
Lets first look at a `toy' program which demonstrates 
the use sequence locks. We have a writer process which 
keeps incrementing a 32 bit counter implemented as two 
independent 16 bit counters. The idea is to increment 
the higher 16 bits if the lower 16 bits overflows from 
0xffff to 0x0. A reader process keeps reading the two 
16 bits counters and generates a 32 bit count value. At 
no point of time should the reader process get a count 
which was lower than the previous count. The maximum 
count in the two 16 bit counters taken together at any 
point of time will always be less than 0xffffffff.

<p>
Two problems can give inconsistent results. Say the 
current value of the count is 0x6FFFF (stored as 0xFFFF 
and 0x6 in the two independent counters). The writer 
changes 0xFFFF to 0x0, which the reader reads and 
records as the lower 16 bits of the count. Next, if the 
reader reads the higher 16 bits before the writer 
increments it, reader would get 0x6. So the combined 
value will be 0x60000 which is less than the previous 
count 0x6FFFF. The problems here is that the reader is 
interrupting the writer.

<p>
Another scenario. Say current total is 0x2FFFF. The 
reader reads the higher 16 bits and gets 0x2. Before 
the reader can sum this up with the lower 16 bits, the 
writer starts running and changes 0x2FFFF to 0x30000, 
0x30001 etc. It is possible that the reader will run 
again only when the count becomes say 0x30006. At this 
point, the reader will get the lower 16 bits, 0x0006. 
Reader combines both parts and gets 0x20006 which is 
less than 0x2FFFF. 

<p>
<a href="misc/pramode/12.c.txt">[Listing 12, kernel module demonstrating reader/writer problem]</a>

<p>
Here is a user space test program:

<p>
<a href="misc/pramode/13.c.txt">[Listing 13]</a>

<p>
In situations where there are only a few writers and 
many readers running concurrently, where a writer is so 
impatient that he is not prepared to wait for any of 
the readers, sequence locks can be employed 
effectively. Examples of usage can be found in 
kernel/timer.c. 

<p>
We will modify our program to use seq locks - we shall 
then look at how these locks are implemented.

<p>
<a href="misc/pramode/14.c.txt">[Listing 14, Using sequence locks to solver reader-writer problem]</a>

<p>
Now let's look at the kernel implementation of the 
sequence lock.

<p>
<a href="misc/pramode/15.c.txt">[Listing 15]</a>
<pre>

typedef struct {
        unsigned sequence;
        spinlock_t lock;

} seqlock_t;

#define SEQLOCK_UNLOCKED { 0, SPIN_LOCK_UNLOCKED }

static inline void 
write_seqlock(seqlock_t *sl)
{

        spin_lock(&amp;sl-&gt;lock);
        ++sl-&gt;sequence;
        smp_wmb();                      
}       

static inline void 
write_sequnlock(seqlock_t *sl) 
{
        smp_wmb();
        sl-&gt;sequence++;
        spin_unlock(&amp;sl-&gt;lock);
}

static inline unsigned 
read_seqbegin(const seqlock_t *sl)
{
        unsigned ret = sl-&gt;sequence;
        smp_rmb();
        return ret;
}

static inline int 
read_seqretry(const seqlock_t *sl, unsigned iv)
{
        smp_rmb();
        return (iv &amp; 1) | (sl-&gt;sequence ^ iv);
}

</pre>

<p>
We note that seqlock_t is a structure which contains a 
sequence number and a spin lock - the initial sequence 
number will be 0 and the spin lock will be in the 
unlocked state. A writer will always have to acquire 
this spinlock to enter the critical section - this 
guards against other writers. Once the writer enters 
the critical section, it increments the sequence 
number. Once the writer leaves the critical section, 
the sequence number gets incremented again and the lock 
is unlocked. The important point here is that while the 
writer is operating, the sequence number would be odd. 
No writer in the critical section means sequence number 
will be even. A reader enters the critical section 
without grabbing any lock. If it is lucky, it wont be 
interrupting any writer and no writer would interrupt 
it. How does the reader know that it has been lucky 
enough? Simple, before doing its critical section, the 
reader will save away the sequence number of the 
associated lock. Now, after the critical section is 
over, the reader will check for two things:

<ol>
<li> Whether the saved sequence number was odd. If so, it 
  means that the reader was running while a writer also 
  was there in the critical section.

<li> Whether the saved sequence number and the current 
  sequence number are the same. If not, it means that a 
  writer had interrupted the reading process.
</ol>

<p>
If there has been no interruptions, the reader is happy 
with the value it has got - it's surely going to be a 
consistent value. Otherwise, the reader will redo the 
reading process till it gets a consistent result. The 
read_seqretry function is the heart of this technique.
Its body looks like this:

<pre>

return (iv &amp; 1) | (sl-&gt;sequence ^ iv);

</pre>
<p>
The first part checks whether iv, the old sequence 
number is even or odd. The second part checks whether 
the old and the new sequence numbers are identical. The 
smp_rmb, smp_wmb functions which you see in the code 
are to handle out of order loads and stores which I 
don't think Intel CPUs do - so those functions need 
not hinder your understanding of the code.


<h3>Conclusion</h3>
<p>
I have tried to present a few things about hyperthreading
and concurrency control in this article. Concurrency is
tricky business - and I am not an expert. I shall be grateful to
readers who point out errors/inconsistencies, if any.

<h3> Further Reading </h3>
<p>
Hyper threading technology is explained in depth in
this <a href="http://www.intel.com/technology/itj/2002/volume06issue01/art01_hyper/p01_abstract.htm">
Intel Document</a>.
An excellent  <a href="http://arstechnica.com/paedia/h/hyperthreading/hyperthreading-1.html">
Ars technica</a>
document looks at things in a much more graphic and
intuitive way. The <a href="http://www.extremetech.com/article2/0,1558,165010,00.asp">PC
Processor Microarchitecture Technology Primer</a> is an
excellent read for those who are interested in getting
some idea of how stuff like superscalar processing,
out of order execution, branch prediction, etc. works
without resorting to heavy-duty textbooks.

<p>
Readers who are new to inline assembly might refer to
this <a href="http://www-106.ibm.com/developerworks/library/l-ia.html">
IBM Developerworks article</a> for enlightenment. It's
impossible to read kernel code without a rudimentary
understanding of this powerful feature of the GNU C Compiler.

<p>
There is an interesting discussion archived <a href="http://www.sandpile.org/post/msgs/20003998.htm">
here</a> which deals with the necessity of the `lock' prefix on
hyperthreaded systems.

<p>
The book <b>Unix Systems for Modern Architectures</b>
by <b>Curt Schimmel</b> is a must read for programmers
working on modern SMP capable Unix kernels. The 
implementation of atomic increment using atomic
swap has been lifted from solution to problem 8.9
of this book. An excellent book for the kernel
newbie is <b>Linux Kernel Development</b> by
<b>Robert Love</b> of the preemptive
kernel fame.


</p>


<!-- *** BEGIN author bio *** -->
<P>&nbsp;
<P>
<!-- *** BEGIN bio *** -->
<hr>
<P>
<img ALIGN="LEFT" ALT="[BIO]" SRC="../gx/2002/note.png" class="bio">
<em>
I am an instructor working for IC Software in Kerala, India. I would have loved
becoming an organic chemist, but I do the second best thing possible, which is
play with Linux and teach programming!
</em>
<br CLEAR="all">
<!-- *** END bio *** -->

<!-- *** END author bio *** -->

<div id="articlefooter">

<p>
Copyright &copy; 2004, Pramode C.E.. Copying license 
<a href="http://linuxgazette.net/copying.html">http://linuxgazette.net/copying.html</a>
</p>

<p>
Published in Issue 103 of Linux Gazette, June 2004
</p>

</div>


<div id="previousnextbottom">
<A HREF="okopnik.html" >&lt;-- prev</A> | <A HREF="ecol.html" >next --&gt;</A>
</div>


</div>






<div id="navigation">

<a href="../index.html">Home</a>
<a href="../faq/index.html">FAQ</a>
<a href="../lg_index.html">Site Map</a>
<a href="../mirrors.html">Mirrors</a>
<a href="../mirrors.html">Translations</a>
<a href="../search.html">Search</a>
<a href="../archives.html">Archives</a>
<a href="../authors/index.html">Authors</a>
<a href="../contact.html">Contact Us</a>

</div>



<div id="breadcrumbs">

<a href="../index.html">Home</a> &gt; 
<a href="index.html">June 2004 (#103)</a> &gt; 
Article

</div>





<img src="../gx/2003/sit3-shine.7-2.gif" id="tux" alt="Tux"/>




</body>
</html>