File: lisa.latex

package info (click to toggle)
cfengine 1.4.9-3
  • links: PTS
  • area: main
  • in suites: hamm
  • size: 3,540 kB
  • ctags: 1,861
  • sloc: ansic: 25,408; sh: 1,708; perl: 1,088; lex: 690; makefile: 435; lisp: 182; yacc: 101; csh: 24
file content (1003 lines) | stat: -rw-r--r-- 46,140 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
\documentstyle[A4,twocolumn]{article}

\title{Adaptive locks for frequently scheduled tasks with\\
unpredictable runtimes}

\author{Mark Burgess and Demosthenes Skipitaris\\~\\\em Centre of
Science and Technology, Faculty of Engineering\\\em Oslo College, Cort
Adelers Gate 30\\\em N-0254 Oslo, Norway}

\begin{document}
\maketitle

\begin{abstract}
We present a form of discretionary lock which is designed to render
unreliable but frequently scheduled scripts or programs predictable
even when the execution time of locked operations may grow and exceed
their expected scheduling interval. We implement our locking policy
with lock-unlock semantics and test them on the system administration
language cfengine. The locks are controlled by too-soon and too-late
parameters so that execution times can be controlled within fixed
bounds even when scheduling requests occur randomly in addition to the
periodic scheduling time. This has the added bonus of providing an
anti-spamming functionality.

\end{abstract}

\section{Introduction}

When two or more instantiations of a program use a resource
concurrently, it can lead to contention between the competing
processes and unpredictable results. Application programs (for
example, mail-readers or daemons) usually avoid this situation with
the help of a lock so that only a single instantiation can run at a
given time. A lock is a device which assures that only one process can
use a specific resource at a given time\cite{lockdef,lockdef2}; an
application lock makes the execution of an application program into an
exclusive resource. A typical approach to locking is to write a short
file which contains the process ID of the running
application\cite{applock}.  The file has a unique name and is
therefore trivially-locatable by multiple instantiations of the
program.  An application lock is generally adequate for programs which
start and then expect to continue more or less indefinitely, e.g.\
daemons such as {\tt cron} and mail readers, but it can lead to
unfortunate problems if used to block frequently or periodically run
programs or scripts. If the duration of such a program is capable of
exceeding its scheduling interval, then there could be an overlap
between instantiations of the program or a failure of the program to
be started at the correct time, and this must be dealt with in a
sensible way.  This problem is of particular interest in connection
with automated system administration where complex scripts are often
scheduled by cron, but may also be started by hand.

Let us give a concrete example. Consider a program, run hourly by
cron, which executes a remote command on a series of hosts; one can
imagine a program which distributes or makes copies of key files. The
time for this job to complete depends on many factors: the size of the
files, the speed of the network, the load on the participating hosts
etc. If the network latency time were high, or if an RPC error
occurred then this script or program could hang completely or fail to
complete inside its allotted hour. After the next elapsed hour, the
hanging lock would result in an annoying error and a failure of the
program to perform its task. If left unlocked, there could be
contention between multiple instantiations of the program and
inconsistent results.

Network services present a related problem. Consider a program which
is initiated by a network connection to a particular port for the
purposes of updating one or more resources. It is desirable to lock
such a program to avoid contention between multiple connections. It
might be appropriate to lock the entire process with a single threaded
connection. Service demultiplexers like {\tt inetd} contain the
functionality required to serialize access.  On the other hand, it
might be better to lock only specific resources.  We might wish to go
even further and restrict the frequency at which the program can be
run at all. Such a contingency could be used to prevent spamming of
the network connection or even the accidental wastage of CPU time.
All of the above examples may be thought of in terms of resource
sharing in a concurrent environment.

If we focus our attention more to the problem of resource control we
gain a new perspective on the problem of multiple program
instantiations.  Rather than locking an entire program, we lock
smaller parts which are independent.  The idea here is that it is
useful to use discretionary locks to control only specific resources
required within a program\cite{tanenbaum1,transactions}.  Such `local'
locks might allow a program to run partially, blocking collisions, but
would admit access to the busy resources on a one-by-one basis. This
might not always be desirable though: the scheme could result in
awkward problems if the resources were critical to the operation of
the program as a whole.  The program might be forced to exit without
performing its function at all and thus time spent executing the
partial-program would only be CPU time wasted.  Unlike locking of
kernel resources, or database operations, the co-existence of multiple
programs does not necessarily require us to preserve every operation
and serialize them, it is sometimes sufficient to ensure that only the
most up-to-date instantiation is allowed to run at a given time. It
could also be acceptable to have different instantiations of a program
running concurrently, but in such a way that they did not interfere.

In spite of all the conditionals in the above, it is possible to
address a large proportion of the cases encountered by system
tasks. In this paper we are interested in the first case where it is
meaningful to lock self-contained `objects' within a larger program
(these are usually referred to as {\em atoms} in related
literature). Such a scheme is appropriate for many system
administration scripts which bundle resource-independent
operations. We aim to create a more intelligent approach to the
locking problem, which is robust to unpredictable failures and which
provides certain assurances against hanging processes or failure to
execute.  We do not exclude multiple instantiations of programs
running in different threads, but instead try to ensure that they
cooperate rather than contend. The granularity of the locking scheme
has to be chosen carefully to achieve sensible and predictable
behaviour and we take some time to describe the behaviour in detail.

The locking semantics we describe here are motivated by our desire to
equip the system administration robot {\tt
cfengine}\cite{cf1,cf2,cf3,cf4} with a flexible but autonomous
mechanism for avoiding contention and spamming in a distributed,
multithreaded environment. By introducing our new locking policy,
cfengine can function as an integrated front-end for cron and
network-initiated scripts, effectively creating a single network-wide
file for starting regular and intermittently scheduled programs which
is protected against spamming attacks or accidental
repetition. Although intended for cfengine, the locking policy we have
arrived at is applicable to any situation where programs are scheduled
on a time scale which is comparable to their runtime. They would be
particularly useful as an addition to scripting languages such as
Perl, Guile, Tcl and even Java.

Traditionally attention has been given in the literature to the
problem of locking of shared memory resources and concurrent database
transactions using discretionary locks, mutexes, semaphores and
monitors\cite{tanenbaum1,transactions}. Resource locks in distributed
systems have also been discussed in connection with fragile
communication links\cite{adaptive,concurrentlock} and independent
paral\-lelism\cite{waitfree}. Application locks do not seem to have
enjoyed the same interest in the literature, perhaps because of their
apparent triviality, but they are widely used in concurrent and shared
applications and are closely related to all of the above issues.

In the present work we wish to illustrate how a simple modification of
the most trivial application locks can lead to enhanced autonomy of
scheduled systems.  By implementing such locks in the automated system
administration robot cfengine we show how this is directly relevant to
the reliability of automated system administration.  Our locks are a
generalization of the concurrent lock concept, ignoring the strong
ordering of the atoms, but including garbage collection and protection
against undesirable repetition.

\section{Locking semantics}

All locking begins by defining {\em atomic operations}, or critical
sections: these are the basic pieces of a program that must run to
completion, without the disturbance from third parties. In other
words, atoms are all-or-nothing pieces of a program. Atoms are
protected by {\tt GetLock()}, {\tt ReleaseCurrentLock()} parentheses within
the program code.

\begin{verse}
{\tt GetLock}({\em parameters})\\
~\\
~~~~  /* Atom code */\\
~\\
{\tt ReleaseCurrentLock}()
\end{verse}

Serialized access to these atoms is assured by encapsulating each one
with an exclusive lock.  To create a locking policy, one must find the
most efficient way of implementing resource control. If we lock
objects which are too primitive (fine grain), we risk starting
programs which will only run partially, unable to complete because of
busy resources. This would simply constitute a waste of CPU time. On
the other hand, if we lock objects which are too coarse, logically
independent parts of the program will not be started at all. This is
unnecessary and inefficient. In a concurrent environment there is no
reason why independent atoms could not run in separate threads,
allowing several instantiations of a batch program to `flow through'
one another. This assumes however that the order of the atoms is not
important.

By defining suitable atoms to lock, one is able to optimize the
execution of the tasks in a program.  Several approaches to locking
may be considered. A lock-manager daemon is one possibility. This is
analogous to many network license daemons: a daemon hands out tickets
which are valid for a certain lifetime. After the ticket expires, the
program is considered overdue and should be killed.  A major problem
with a daemon based locking mechanism is that it is highly time
consuming and that it is susceptible to precisely the same problems as
those which cause the uncertainty on program runtimes. Use of {\tt
flock()} is another possibility, but this is not completely
portable. A realistic approach needs to be more compact and efficient.

\section{Implementation}

We have chosen to implement locks using regular files and index nodes.
By using a unique naming algorithm, we are able to instantly `know'
the name of a lock without having to search for it or ask a manager
for the data. This minimizes time consuming calls to the network or to
disk.

In order to secure a unique name, we need to provide enough
information to be able to identify each atom uniquely. This is
presently accomplished by passing a string to the lock function which
can be combined with other elements such as the host on which the lock
was created and any other relevant information. For convenience we
classify atoms with an operator/operand pair. For example, consider a
lock request to edit a file. In this case the operator would be
``edit'' and the operand would be the name of the file. The names must
be processed to expunge unfortunate characters which would lead to
illegal file names.

\small
\begin{verbatim}
    CanonifyName(char *buffer)
    {
    for (sp = buffer; *sp != '\0'; sp++)
       {
       if (!isalnum(*sp))
          {
          *sp = '_';
          }
       }
    }
\end{verbatim}
\normalsize

We use a function {\tt CanonifyName}({\em name}) which returns a
string suitable for use as a filename. It suffices to swap illegal
characters for an underscore, for instance.

In order to function properly, the lock-name must be different for
each distinct atom, but must be constant over time so that multiple
instantiations of the program will always find the same lock. The time
information should therefore not be coded into the name of the lock;
instead, one relies on the time stamps on the inodes to determine
their age. For example, when editing the file {\tt /etc/motd} on host
{\tt dax}, a lock named

\begin{verbatim}
 lock.cfengine_conf.dax.editfile._etc_motd 
\end{verbatim}

\noindent
would be created.

We create two kinds of lock within the {\tt GetLock()} call: a lock
for active threads of execution which blocks multiple instantiations
of a process, and a permanent lock which records the last time at
which the resource was accessed. See figure \ref{components}. The
latter information can be encapsulated in a single inode without using
any disk blocks and provides the information necessary to restrict the
frequency of access. We call this an anti-spamming lock.

\begin{figure}[ht]
\centerline{
\begin{picture}(190,100)(0,0)
\put(100,80){\oval(80,40)}
\put(100,80){\makebox(0,0){Atom}}
\put(  0,1){\framebox(70,30){Last lock}}
\put(120,1){\dashbox{3}(70,30){Active lock}}
\put(100,59){\vector(-2,-1){53}}
\put(100,59){\vector( 2,-1){53}}
\end{picture}
}
\caption{The adaptive lock components for an atom.}
\label{components}
\end{figure}

If a lock already exists for a specified atom, and that lock has not
expired, the atom remains locked and access to the atom is denied.
Lock expiry occurs when a certain predefined period of time has
elapsed since the active lock was created. In this case, a garbage
collection mechanism attempts to carefully eliminate the process
attached to the hanging lock (if it still exists) and then remove the
old lock, replacing it with a new one and permitting the killing
process to take over the task.  The third possibility is that no
active lock exists for an atom, but that the time since its previous
execution is too short.  This information is gleaned from the
permanent lock. In that case access to the atom is also denied. This
feature gives us the `anti-spamming' functionality.

\begin{figure*}
\centerline{
\begin{picture}(400,213)(0,0)
\put(0,12){\vector(1,0){400}}
%\put(0,12){\vector(0,1){84}}
\put(50,0){$\Delta t$}
\put(150,0){$\Delta t$}
\put(250,0){$\Delta t$}
\put(350,0){$\Delta t$}
%
\put(  0,8){\line(0,1){8}}
\put(100,8){\line(0,1){8}}
\put(200,8){\line(0,1){8}}
\put(300,8){\line(0,1){8}}
\multiput(  0,8)(0,3){69}{\line(0,1){0.3}}
\multiput(100,8)(0,3){69}{\line(0,1){0.3}}
\multiput(200,8)(0,3){32}{\line(0,1){0.3}}
\multiput(200,129)(0,3){29}{\line(0,1){0.3}}
\multiput(300,8)(0,3){32}{\line(0,1){0.3}}
\multiput(300,129)(0,3){29}{\line(0,1){0.3}}
%
\put(10,54){\vector(-1,0){10}}
\put( 0,50){\line(0,1){8}}
\put(58,54){\vector( 1,0){10}}
\put(68,50){\line(0,1){8}}
\put(10,51){{\tt IfElapsed}}
%
\put(159,138){\vector(-1,0){59}}
\put(100,134){\line(0,1){8}}
\put(219,138){\vector( 1,0){59}}
\put(278,134){\line(0,1){8}}
\put(160,135){{\tt ExpireAfter}}
%
\put(300,188){\framebox(80,25){New cfengine}}
\put(200,146){\dashbox{3}(50,25){Locked}}
\put(300,104){\makebox(30,25){Killed}}
\put(100,104){\framebox(230,25){Long running cfengine}}
\put( 52,62){\dashbox{3}(45,25){Too soon}}
\put(  0,20){\framebox(43,25){Cfengine}}
%
\multiput(300,104)(0,1){25}{\line(0,1){0.2}}
\multiput(301,104)(0,1){25}{\line(0,1){0.2}}
\multiput(302,104)(0,1){25}{\line(0,1){0.2}}
\multiput(303,104)(0,1){25}{\line(0,1){0.2}}
\multiput(304,104)(0,1){25}{\line(0,1){0.2}}
\multiput(305,104)(0,1){25}{\line(0,1){0.2}}
\multiput(306,104)(0,1){25}{\line(0,1){0.2}}
\multiput(307,104)(0,1){25}{\line(0,1){0.2}}
\multiput(308,104)(0,1){25}{\line(0,1){0.2}}
\multiput(309,104)(0,1){25}{\line(0,1){0.2}}
\multiput(310,104)(0,1){25}{\line(0,1){0.2}}
\multiput(311,104)(0,1){25}{\line(0,1){0.2}}
\multiput(312,104)(0,1){25}{\line(0,1){0.2}}
\multiput(313,104)(0,1){25}{\line(0,1){0.2}}
\multiput(314,104)(0,1){25}{\line(0,1){0.2}}
\multiput(315,104)(0,1){25}{\line(0,1){0.2}}
\multiput(316,104)(0,1){25}{\line(0,1){0.2}}
\multiput(317,104)(0,1){25}{\line(0,1){0.2}}
\multiput(318,104)(0,1){25}{\line(0,1){0.2}}
\multiput(319,104)(0,1){25}{\line(0,1){0.2}}
\multiput(320,104)(0,1){25}{\line(0,1){0.2}}
\multiput(321,104)(0,1){25}{\line(0,1){0.2}}
\multiput(322,104)(0,1){25}{\line(0,1){0.2}}
\multiput(323,104)(0,1){25}{\line(0,1){0.2}}
\multiput(324,104)(0,1){25}{\line(0,1){0.2}}
\multiput(325,104)(0,1){25}{\line(0,1){0.2}}
\multiput(326,104)(0,1){25}{\line(0,1){0.2}}
\multiput(327,104)(0,1){25}{\line(0,1){0.2}}
\multiput(328,104)(0,1){25}{\line(0,1){0.2}}
\multiput(329,104)(0,1){25}{\line(0,1){0.2}}
\multiput(330,104)(0,1){25}{\line(0,1){0.2}}
\put(315,188){\vector(0,-1){59}}
\end{picture}
}
\label{ugh}
\caption{A schematic illustration of the behaviour of locks with
respect to the scheduling interval $\Delta t$ and the parameters {\tt
IfElapsed} and {\tt ExpireAfter}.}
\end{figure*}

A record of these lock transactions is kept for subsequent analysis if
required. This indicates when and how locks were created and removed,
thereby allowing problem cases, such as locks which always need to be
removed forcibly, to be traced.

\begin{figure*}
\begin{verbatim}
      GetLock(operator,operand,ifelapsed,expireafter,host,now)
      {
      sprintf(LOG,"%s/program.%s.runlog",LOGDIR,host);
      sprintf(LOCK,"%s/lock.%s.%s.%s",LOCKDIR,host,operator,operand);
      sprintf(LAST,"%s/last.%s.%s.%s",LOCKDIR,host,operator,operand);

      lastcompleted = GetLastLock();       /* Check for non-existent process */
      elapsedtime = (now-lastcompleted) / 60;

      if (elapsedtime < ifelapsed)
         {
         return false;
         }

      lastcompleted = CheckOldLock();       /* Check for existing process */
      elapsedtime = (now-lastcompleted) / 60;

      if (lastcompleted != 0)
         {
         if (elapsedtime >= expireafter)
            {
            pid = GetLockPid();             /* Lock expired */
            KillCarefully(pid);
            unlink(LOCK);
            }
         else
            {
            return false;                   /* Already running */
            }
        }

      SetLock();
      return true;
      }
\end{verbatim}
\caption{A schematic algorithm for implementing the locking
policy. The function call {\tt GetLock()} takes arguments which are
used to build a unique name. Operator and operand pertain to the atom
which is to be locked. The expiry time and elapsed time limits are
times in minutes, and the {\tt now} parameter is the system clock
value for the time at which the lock is created. The function {\tt
GetLastLock()} creates the anti-spamming `last' lock if it does not
previously exist. This is important for theoretical deadlock
avoidance.}
\end{figure*}

To make the locking behaviour user-configurable we introduce two
parameters called {\tt ExpireAfter} and {\tt IfElapsed}, which have
values in minutes. See figure \ref{ugh}. {\tt ExpireAfter} describes
the number of minutes after creation at which a lock should expire. It
is measured from the creation time-stamp on the active lock to the
current value of the system clock.  {\tt IfElapsed} describes the
number of minutes after which it becomes acceptable to execute the
same atom again. It is measured from the modification time stamp of
the anti-spamming lock to the current value of the system clock.

We choose to read the current time as a parameter to {\tt GetLock()},
rather than reading it directly in the locking function, for the
following reason. The most correct time to use here could be construed
in one of two ways: it could be taken as being the time at which the
program was started, or as the exact time at which the lock creation
takes place. The difference between these times could differ by
seconds, minutes or hours depending on the nature of the job being
locked.  By using the time at which the program was started for all
locks throughout the program, one effectively treats a `pass' of the
program as a cohesive entity: if one lock expires for a given value of
{\tt ExpireAfter}, they all expire. A certain ordering of atoms can be
preserved. If, on the other hand, one always reads the present value
of the system clock directly, the locking mechanism becomes sensitive
to the length of time it has taken to execute the different parts of
the program. Both policies might be desirable in different situations,
so we do not see fit to impose any particular restriction on this.

When a lock has expired, we try to kill the owner process of the
expired lock.  The process ID of the expired process is read from the
active lock. Then the signals {\tt CONT}, {\tt INT}, {\tt TERM} and
{\tt KILL} are sent in that order.  On some systems, {\tt INT} is the
only signal that will not hang the process permanently in case of a
disk-wait situation, thus {\tt INT} is sent first.  Then the default
terminate signal {\tt TERM} is sent, and finally the non-ignorable
signal {\tt KILL} is sent. Sleep periods of several seconds separate
these calls to give the kernel and program time to respond to the
signals.  The {\tt CONT} signal is placed first in case the process
has been suspended and wants to exit straight away.  This should be
harmless to non-suspended processes.

\section{Adaptive locks in cfengine}

Our locking mechanism was designed and implemented with cfengine
version 1.4.$x$ in mind.  Cfengine is a descriptive language and a
configuration robot which can perform distributed system
administration on large networks\cite{cf1,cf2,cf3,cf4}.  A cfengine
program is generally scheduled as a cron job, but can also be
initiated interactively or by remote network connection. Cfengine can
examine many hundreds of files, system processes and launch dozens of
user scripts depending on the time of day and the host concerned.
Cfengine's job is to coordinate these activities based the {\em state}
of the system. The state comprises many variables based on host type,
date, time, and the present condition of the host as compared to a
reference model.  Its total run time involves too many variables to be
practically predictable.

Opening cfengine to the network places an extra onus on its behaviour
with respect to scheduling. Although designed in such as way that it
does not give away any rights to outside users, cfengine is
intentionally constructed so that general users (not just {\tt root})
can be allowed to execute the standard configuration in order to
update or diagnose the system, even when human administrators are not
available. The mere thought of this is enough to send convulsions down
the spines of many system folks, and it would indeed be a cause for
concern unless measures were incorporated to protect such a provision
from abuse. Adaptive locks will therefore play a central role in a
`connected' cfengine environment in the future.

We have tested our adaptive locks with cron initiated cfengine as well
as with remote connections with some success. The locks do indeed
fulfill their role in preventing seizures and overlaps which can occur
due to unforeseen delays. The locking of individual atoms means that,
even though a particular script might run over its allotted time,
other scripts and tasks can be completed without delay, come the next
scheduling time. Silly mistakes can also be dealt with
unproblematically: a cfengine program which starts itself is
impervious to the apparent recursive well, provided the {\tt
IfElapsed} parameter is not set to zero. This is, after all, simply an
example of spamming (see below).

Adaptive locks are very important for cfengine: cfengine is a tool
which is supposed to automate basic system administration tasks and
work as a front end for user-scripts, allowing administrators to
collect an entire network's scripts into a single place and providing
a net-wide front-end for cron.  In order to be effective in this role,
cfengine must support a high degree of autonomy.  Cfengine atomizes
operations in different ways. Some operations, such as file editing
and script execution, are locked on a per-file basis. Other operations
which could involve large scale traversals of the file system are
locked per class of operation. The aim of the locking policy is to
make the system safe and efficient -- i.e.\ not to overload to the
system with contrary tasks.

Previously, cfengine processes were locked by a single global lock. If
a process were interrupted for some reason, a hanging lock would
remain and cause warning messages to be printed from the affected
hosts the next time cfengine was scheduled. Certain cfengine processes
would overrun their allotted time: typically the weekly runs which
perform extensive system checking and updates of system databases.
This would happen once a week, generating useless mail which everyone
would have been happier not to receive.  Bad NFS connections through
buggy kernels have been known to hang scripts. Also, bugs in cfengine
itself, which manifest themselves only under special conditions, could
result in a core dump and a hanging lock.  Although each isolated
occurrance of these problems was relatively rare, the cumulative
effect on a large network could be substantial enough to be an
irritation.  The system administrator would then be required to chase
after these old locks and remove them.

The new locks allow several cfengines to coexist as different
processes, without interference. Moreover, since one of the purposes
of automation is to minimize the amount of fruitless messages from the
system, the original locking policy was clearly not in tune with the
cfengine's autonomous philosophy. Using the new adaptive locks,
cfengine can clean up its own hanging processes without the
intervention of a human, and even better: silently. In large network
environments such silence is golden.

With large scale system checking, the total number of locks used in a
single pass of cfengine might approach several tens or even a hundred
on an busy system, but only one active lock is present per active
thread. (We do not normally expect more than two threads for normal
system administration tasks.) The anti-spamming locks take up only a
single inode each and since most file systems have thousands of spare
inodes, this usage is hardly a concern.  The first part of table
\ref{runtimes} shows runtimes for a small cfengine run which sets 24
locks, while the last part shows a run which sets 32 locks. Some of
the operations involved in the second run are large.  Although the
difference in real time seems large for the smaller run, the
difference in user and system time is much smaller. The actual CPU
time spent to set and remove the locks is not high, which means that
we wait for the disk when creating and deleting the locks.  For the
larger run, the differences are almost the same, but here the
dominating part of the run is the cfengine operations itself, not the
administration of locks.

\begin{table}[ht]
\begin{center}
\begin{tabular}{|l|c|c|c|}
\hline
Small & locks & no locks & diff\\
\hline
\hline
real & 3.3  & 1.1 & 2.2 \\
user & 0.3 & 0.3 & 0\\
sys  & 0.4 & 0.3 & 0.1\\
\hline
\end{tabular}
\vskip 0.5cm
\begin{tabular}{|l|c|c|c|}
\hline
Large & locks & no locks & diff\\
\hline
\hline
real & 12.8  & 10.0 & 2.8 \\
user &  2.3 &  2.2 & 0.1\\
sys  &  4.2 &  4.0 & 0.2 \\
\hline
\end{tabular}
\caption{Real, user and system time in seconds for two different
cfengine runs.}
\label{runtimes}
\end{center}
\end{table}

Cfengine can be exposed to infinite loops from which it will recover
gracefully. Figure \ref{cf2cf} illustrates a cfengine program which
calls itself.  Suppose we have a cfengine program which contains three
atomic operations $A$, $B$ and $C$. Suppose also that $B$ is a
shell command which executes cfengine. Let us then examine how the
locks handle the execution of this program, assuming i) that the
scripts have not been executed for a long time $>${\tt IfElapsed} and
ii) that the locking parameters have `sensible' values.

\begin{figure*}
\centerline{
\begin{picture}(400,255)(0,0)
\put( 0,255){\vector( 0,-1){255}}
\multiput(200,250)(0,-9){28}{\makebox(0,0){.}}
\put(360,227){\vector(-1,0){20}}
\put(360,202){\vector(-1,0){20}}
\put(360,227){\line(0,-1){75}}
\put(360,152){\vector(-1,0){20}}
\put(370,177){\vector(-1,0){20}}
\put(370,177){\line(0,-1){125}}
\put(370,127){\vector(-1,0){20}}
\put(370,52){\vector(-1,0){20}}
\put(360,102){\vector(-1,0){20}}
\put(360,77){\vector(-1,0){20}}
\put(360,102){\line(0,-1){75}}
\put(360,27){\vector(-1,0){20}}
\put(10,240){(Start cfengine \#1)}
\put(20,225){\#1 $A$ starts}
\put(210,225){Lock $A$}
\put(20,200){\#1 $A$ ends}
\put(210,200){Unlock $A$, write last lock}
\put(20,175){\#1 $B$ starts (Start cfengine \#2)}
\put(210,175){Lock $B$}
\put(40,150){\#2 $A$ starts / exits}
\put(210,150){Can't lock $A$, too soon}
\put(40,125){\#2 $B$ starts / exits}
\put(210,125){Can't lock $B$, already exists}
\put(40,100){\#2 $C$ starts}
\put(210,100){Lock $C$}
\put(40,75){\#2 $C$ ends}
\put(210,75){Unlock $C$, write last lock}
\put(20,50){\#1 $B$ ends (End cfengine \#2)}
\put(210,50){Unlock $B$, write last lock}
\put(20,25){\#1 $C$ starts / exits}
\put(210,25){Can't lock $C$, too soon}
\put(10,10){(End cfengine \#1)}
\end{picture}
}
\caption{Diagram of actions versus time for a cfengine process which
calls itself recursively. This illustrates the way the locks prevent
infinite recursive loops.}\label{cf2cf}
\end{figure*}

The example in figure \ref{cf2cf} shows how no more than two cfengine
processes will be started. When cfengine is first started, it executes
atom $A$, locking and unlocking it normally.  When it arrives at $B$,
a lock is acquired to run cfengine recursively (since this has not
previously occurred) and a second cfengine process proceeds to
run. Under the auspices of this second process, a new lock is
requested for $A$, but this fails since it is too soon since the last
instance of $A$ from process \#1. Next a lock is requested for $B$,
but this also fails because $B$ is busy and not enough time has passed
for it to expire. Thus we come to $C$. Since $C$ has not been
executed, cfengine obtains a lock for $C$ and executes it to
completion, then releasing the lock. Process \#2 is then complete and so
is atom $B$ from cfengine process \#1. The lock for $B$ is released and
cfengine attempts to finish process \#1 by getting a lock for $C$. This
fails however, since $C$ was just executed by the process \#2 and not
enough time has elapsed for it to be restarted (or killed). The first
process is then complete.

Notice how two processes flow through one another. The real work in
$A$ and $C$ (which could have been done by a single process) simply
gets shared between two processes, and no harm is done.

A similar sequence of events occurs if a process hangs while executing
an atom (see figure \ref{cf2cf2}). Suppose that an old instantiation
of (process \#1) managed to execute $A$ successfully, but hung while
executing atom $B$. Later, after the lock on $B$ has expired, another
cfengine (process \#2) will execute $A$ again, kill the previous lock on
$B$ and execute $B$, then execute $C$. Here we assume that $B$ hangs
for some spurious reason, not because of any fundamental problem with
$B$.

\begin{figure*}
\centerline{
\begin{picture}(400,155)(0,0)
\put(0,155){\vector(0,-1){155}}
\multiput(200,150)(0,-9){17}{\makebox(0,0){.}}
\put(10,140){(Start cfengine \#1)}
\put(20,125){$A$ starts / ends}
\put(210,125){Lock created / removed}
\put(20,100){$B$ starts -- hangs}
\put(210,100){Lock created}
\put(10,75){(Start cfengine \#2)}
\put(210,60){Lock created / removed}
\put(20,60){$A$ starts / ends}
\put(20,35){$B$ killed and restarted / $B$ ends}
\put(210,35){Old lock removed, new one established,}
\put(210,25){then removed}
\put(210,10){Lock created / removed}
\put(20,10){$C$ starts / ends}
\end{picture}
}
\caption{Diagram of actions versus time for a cfengine process which
has hung while executing some action $B$. The lock expires and a new
cfengine takes over the remaining work, killing the old process along
the way.}\label{cf2cf2}
\end{figure*}

Similar scenarios can be constructed with remote connections and more
convoluted loops. All of these either reduce to the examples above or
are defeated by cfengine's refusal to copy from a host to itself via
the network (local copying without socket waits is used instead).
Spamming attacks from malicious users are stifled by the same
anti-spamming locks.

For various reasons our implementation of locks in cfengine includes
logging of lock behaviour. This allows us to trace the executing of
scripts and other atoms in a cfengine program and gain an impression
of how long the individual elements took to complete. This information
could then be fed back into the locking mechanism to optimize the
parameters {\tt IfElapsed} and {\tt ExpireAfter}.

\section{Deadlock and strange loops}

One of the drawbacks with locking mechanisms is that they can, through
unfortunate interactions, lead to deadlock if the system in which they
are used admits circular dependencies. In most of the cases we
encounter in system administration the likelihood of this occurring is
insignificant, but there is nonetheless a theoretical possibility
which is worth addressing.

In a single threaded application, the use of our locking policy
renders deadlock impossible unless the {\tt ExpireAfter} or {\tt
IfElapsed} parameters are set to silly values (see next section).
Deadlock (a non-recoverable hang) can only occur if concurrent
processes are running in such a way that there is circular waiting, or
a tail-chasing loop. We assume that the atoms themselves are safe,
since no locking policy can protect completely against what happens
inside an atom.

Starvation is a possibility however. This means that certain actions
may not be carried out at all.  A simple of example of this is the
following: if every instance of an atom overruns its allotted time,
and the expiry time for the atom is shorter than its scheduling
interval, then the atom will be killed at every scheduling interval,
never completing its task even once.

A related concern is that of spamming, or the senseless repetition of
a given atom, either by accident or through malice. Adequate
protection from spamming is only assured if the {\tt IfElapsed}
parameter is not set to a very low value.

An atomic operation which contains an implicit call to itself will
never be able to enter into an infinite loop, since the locks prevent
more than a single instantiation of the atom from existing.

\section{Predictable behaviour}

The success of any locking policy depends on the security of the
locks. Locks must be impervious to careless or malicious interference
from other processes or users. If not secure from interference, it is
a trivial matter to defeat the locks and open programs to
unpredictable behaviour. Deleting all the lock inodes suffices to
subvert the locking mechanism.

Locks may also be defeated by setting the parameters {\tt ExpireAfter}
and {\tt IfElapsed} to zero. In the former case, proper exclusion of
contentious processes will be disabled, and in the latter protection
from spamming will be disabled.

\begin{table}[ht]
\begin{center}
\begin{tabular}{|c|l|}
\hline
{\tt IfElapsed} & Result\\
\hline
\hline
  $T$ & Prevents atom from executing too\\
    ~ & soon (before $T$ minutes)  \\
$T=0$ & Impotent, spamming possible \\
$T\rightarrow\infty$ & Run atoms once only \\
\hline
\end{tabular}
\end{center}
\caption{Lock behaviour for various values of the variable {\tt
IfElapsed}.}
\end{table}

It is assumed that users of the locks will not sabotage themselves by
setting these parameters with silly values. It would be an interesting
investigation to see whether optimal values of these parameters could
be found for a specific type of atom, and whether the values could
even be determined automatically.

\begin{table}[ht]
\begin{center}
\begin{tabular}{|c|l|}
\hline
{\tt ExpireAfter} & Result\\
\hline
\hline
$T$ & Makes atoms interruptable after\\
 ~ & $T$ minutes \\
$T= 0$ & Impotent, nothing is locked \\
$T\rightarrow\infty$ & Never expire, deadlock possible\\
\hline
\end{tabular}
\end{center}
\caption{Lock behaviour for various values of the variable {\tt
ExpireAfter}.}
\end{table}

In order to deal with atoms which frequently overrun their allotted
time, we may note a rule of thumb, namely that {\tt ExpireAfter}
should generally be greater than {\tt IfElapsed}.  If {\tt
ExpireAfter} $<$ {\tt IfElapsed}, expiry will occur every time a new
atom is started after an overrun. This is probably too soon, since the
aim is to give the atoms a chance to complete their work.

The function of the active locks is to enforce a correct or sensible
interleaving of the atomic operations.  The optimal definition of
atoms can play a key role in determining the correctness of
behaviour. The so-called locking {\em granularity} is central to this
issue. A central assumption in our treatment here is that the atoms
themselves do not lead to subversive or incorrect behaviour. No
locking policy can effectively restrict what happens within the atoms.

To illustrate the importance of granularity, consider an extreme
example in which two concurrent threads contain a circular wait loop.
Suppose two threads each run at regular intervals (figure
\ref{deadlock}).

\begin{figure*}
\centerline{
\begin{picture}(260,107)(0,0)
\put(  0,0){\framebox(120,90){}}
\put(140,0){\framebox(120,90){}}
\put(25,50){\framebox(70,30){Wait for $X$}}
\put(25,10){\framebox(70,30){Create $Y$}}
\put(165,50){\framebox(70,30){Wait for $Y$}}
\put(165,10){\framebox(70,30){Create $X$}}
\put( 60,102){\makebox(0,0){Thread \#1}}
\put(200,102){\makebox(0,0){Thread \#2}}
\end{picture}
}
\caption{A two threaded example with circular waiting.}
\label{deadlock}
\end{figure*}

Thread \#1 performs two operations in sequence: the first is to sleep
until object $X$ is created, the second is to create object $Y$.  In
thread \#2, the operation sleeps until $Y$ is created and then object
$X$ is created. Clearly neither thread can proceed in this circular
wait loop and deadlock ensues. Let us now consider the two alternative
ways of locking these actions and the resulting behaviour. If we lock
both actions as a single atom, then expiry will cause the threads to
die and be restarted after a certain time.  However, each time the
threads are started, they fall into the same trap, since they can
never proceed past the first operation. If, on the other hand, we lock
each operation as separate atoms, the deadlock can be broken {\em
provided the locks are correctly removed from the killed process and
the anti-spamming locks are updated}. Then the scenario is as follows.
The first time the threads run, they fall into deadlock. After a
certain time, however, the threads expire and one or more threads is
killed when a new process tries to run the atom.  If we assume that
{\tt IfElapsed} is greater than $\Delta t$, the scheduling interval of
the program then, as the new threads start, insufficient time will
have elapsed since the last lock was written for each thread, and the
first operation will not be executed. This allows the offending atom
to be hopped-over and the deadlock will be circumvented.

The assumptions in this scenario are clear:
\begin{itemize}
\item Locking each operation separately implies that it is safe to
execute the operations independently.

\item The {\tt IfElapsed} parameter must be set to a value which is
greater than scheduling time for the atom (not necessarily just its
encapsulating program), and the anti-spamming lock must already
exist. This means that the permanent lock should always be created if
it does not already exist, otherwise deadlock is possible.

\item The {\tt ExpireAfter} parameter must be set to a value which is
greater than the scheduling interval.
\end{itemize}

No greater assurances against deadlock can be given, nor do we attempt
to cover every avenue of circular dependency.  The possible cases are
quite complicated.  If silly values are chosen for the parameters {\tt
IfElapsed} and {\tt ExpireAfter}, we can theoretically end up in a
deadlock situation.  For our purpose of utilizing locks in autonomous
system administration, the likelihood of such strange loops is small
and of mainly theoretical interest.  We therefore decline to analyze
the problem further in this context, but end with the following
claim. If $\Delta t$ is the scheduling interval (the interval at which
you expect to re-run atoms), then
$${\tt ExpireAfter} > \Delta t \ge {\tt IfElapsed}$$ ensures correct
and sensible behaviour. What ever one sets {\tt ExpireAfter} to, its
true value can never be less than that {\tt IfElapsed}, since this
defines the rate at which the locks are reexamined.

All of these theoretical diversions should not detract from the real
intention of parameters: namely to provide reasonable protection from
unforeseen conditions. For normal script execution, on an hourly basis,
we recommend values approximately as follows:

\begin{center}
\begin{tabular}{cc}
$\Delta t$ & 1 hour \\
{\tt IfElapsed} & 15 mins\\
{\tt ExpireAfter} & 1 hour 30 mins
\end{tabular}
\end{center}

\section{Conclusions}

The locking policy introduced in this paper essentially solves the
problem of hanging and crashed processes for cfengine, using a minimum
of system resources. Although the simplicity of the algorithm could
make the autonomous garbage collection procedure inappropriate for
certain programs, in most cases of interest to system administrators,
the behaviour is sensible and correct.  The principal advantage of
these locks is that one can always be confident that the system will
not seize up; the flow of updates remains in motion.

How do system administrators use these locks in practice? The simplest
way, which is completely transparent, is to use cfengine as a
front-end for starting all scripts. The has several advantages, since
cfengine provides a powerful classing engine which can be used to make
a single net-wide cron file. Cfengine can do many things, but it is
valuable even solely as a script scheduler.  The alternative to this
is to implement the locks in Perl or shell or some other scripting
language. This is easily accomplished since the locks use only files
(\verb+echo >> file+) and inodes (\verb+touch file+). Time comparisons
are harder in the shell, but not insurmountable.  Languages like Perl
and Guile/scheme should implement the locks as a library module.

One minor problem we have run into occurs with programs which are
started through calls to {\tt rsh}. In this case, the rsh process does
not always terminate, even when the process started by rsh has exited.
If such a program is killed, when a lock expires, processes will not
necessarily die in the intended fashion.  Thus while the new
instantiation of the program may continue to restart the entire task
anew, this can leave hanging processes from the ostensibly-killed
instantiation, which simply clutter up the process table.  A possible
solution would be to kill the entire process group for the rsh, but
this method is not completely portable.  This is presently a teething
problem to be solved.

Adaptive locks contribute an insignificant amount of time to the total
runtime in trials with cfengine and conceal the occurrence of spurious
messages associated with the locks.  Our locks are simple to implement
and may be used in any program where one has atomic operations whose
order need not be serialized into any strong order. An added side
effect is that programs become effectively re-entrant to multiple
threads.

It would make a fascinating study to determine whether the
intelligence of a program like cfengine could be extended to encompass
learning with respect to the jobs its carries out. Could, for
instance, the values of {\tt IfElapsed} and {\tt ExpireAfter} be tuned
automatically from the collective experience of the system itself? For
example, an atom which is frequently killed could be allowed more time
to complete. Conversely, programs which are started at every {\tt
IfElapsed} interval could indicate an attempt to spam the system, and
measures could be taken to warn about or restrict the use of that
atom. It is surprising how many interesting issues can be attached to
such a simple idea as the adaptive lock and we hope to return to some
of them in the future, as part of our programme of research into
self-maintaining operating systems.

\begin{thebibliography}{99}

\bibitem{lockdef} Gregory V. Wilson, {\em Practical Parallel
Programming}, MIT Press (1995)

\bibitem{lockdef2} G.R. Andres, {\em Concurrent progamming}. Benjamin
Cummings (1991)

\bibitem{applock} Typical applications which use this approach are the
unix {\tt cron} daemon and the {\tt elm} mail reader, to mention just
two.

\bibitem{tanenbaum1} A.S. Tanenbaum, {\em Distributed operating
systems}, Prentice Hall, London (1995).

\bibitem{transactions} O. Wolfson, {\em Locking policies for
distributed data\-bases}, International Conference on Data
Engineering, 315 (1984)

\bibitem{cf1} M. Burgess, {\em A site configuration engine}. Computing
systems {\bf 8}, volume 3, 309 (1995)

\bibitem{cf2} The cfengine web site:\\
\verb+http://www.iu.hioslo.no/~mark/cfengine+

\bibitem{cf3} M. Burgess and R. Ralston, {\em Distributed resource
administration using cfengine}, Software Practice and Experience (in
press)

\bibitem{cf4} Cfengine, documentation, Free Software Foundation
1995/6. This is also available online at the web site in
ref. \cite{cf2}.

\bibitem{database} O. Wolfson, {\em The performance of locking
protocols in distributed databases}, Proceedings of the Third
International Conference on Data Engineering, 256 (1987)

\bibitem{adaptive} A.P. Sheth, A. Singhal and A.T. Liu, International
Conference on Data Engineering, 474 (1984)

\bibitem{concurrentlock} L. Chiu and M.T. Liu, Proceedings of the
Third International Conference on Data Engineering, 322 (1987)

\bibitem{waitfree} M. Herlihy, {\em Wait free synchronization}, ACM
transactions on programming languages and systems, {\bf 13}, 124
(1991)

\end{thebibliography}

\end{document}