File: string-copying-notes.html

package info (click to toggle)
tla 1.3.5%2Bdfsg-9
  • links: PTS
  • area: main
  • in suites: etch, etch-m68k
  • size: 21,368 kB
  • ctags: 11,679
  • sloc: ansic: 149,782; sh: 16,207; xml: 2,689; lisp: 1,927; makefile: 1,052; cpp: 363; tcl: 230; awk: 48; sed: 25
file content (586 lines) | stat: -rw-r--r-- 14,233 bytes parent folder | download | duplicates (7)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
  <head>
    <title>Tom Lord's Hackery</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    <link href="../../gnuarch.css" rel="stylesheet" type="text/css">
  </head>

  <body>

    <h1 class="page-title"><u>Tom Lord's Hackery</u></h1>

    <div class="mainContent">

<h1 class=essay-title>
  Considering String Allocation and Copying in Tla 1.x<br>
  
</h1>


  <h2>Is String Copying in <code>libawk</code> a Problem Worth Fixing?
</h2>
  <blockquote>
    
    <p>
      <b>Hypothesis:</b>




    <p>
      <i>Excessive string copying is a significant (attention deserving)
    performance problem in <code>tla</code>, and most of the problem is caused by
    the behavior of <code>libawk</code>.</i>




  </blockquote>




  <h2>Background
</h2>
  <p>
    <code>libawk</code> in <code>tla</code> provides two data structures used throughout most
  of <code>tla</code>: a <b>relational table</b> (a <i>2-d, dynamically sized, array of
  strings</i>) and an <b>associative table</b> (a <i>mutable associative mapping
  of strings to strings</i>).  Some basic relational operators are
  provided for the tables.




  <p>
    These data structures are represented quite simply.  The relational
  tables are dynamically allocated arrays of pointers (each element a
  table row), pointing to dynamically allocated arrays of pointers
  (each element a column within one row), pointing to dynamically
  allocated 0-terminated strings of unsigned <code>char</code> values.  The
  associative tables are simple hash tables, with keys and values
  represented by privately allocated, 0-terminated strings of unsigned
  <code>char</code> values.




  <p>
    A noteworthy property of the <code>libawk</code> implementation is that
  operations which create a new table (relational or associative) must
  newly allocate copies of all of the strings in that table,
  regardless of where the original data comes from.




  <p>
    For example: a <code>join</code> operation accepts two tables as input and
  produces as output a third table.  All strings in the output table
  have identical contents to some string in one or both of the input
  tables, but <code>join</code> as currently implemented allocates a new copy of
  those strings anyway.




  <p>
    Another example: suppose that I want two copies of a single table,
  each sorted using a different column as the sort key (as is common
  in some <code>tla</code> operations).  Each string in either table is
  duplicated in the other, the duplicate being a separately allocated
  copy.








  <h2>Envelope Calculations and Anecdotal Profiling/Tuning Results
</h2>
  <p>
    To put some rough numbers to this, I examined an arch import
  of the linux kernel prepared by John Goerzen.   I computed
  an inventory:




  
  <pre>
      % tla inventory --source --both --ids &gt; ,foo

  </pre>

  <p>
    which produces a list of all source files which are not arch control
  files, listing the relative path to each file along with that files
  inventory id, one such pair per line.  This listing disregards all
  arch control files on the grounds that, in theory, arch might be
  modified to shrink the list of such files by a large amount.




  <p>
    The size of the resulting listing is a reasonable approximation of
  the amount of memory that must be dynamically allocated to create an
  in-core table containing the inventory.  It is also approximately
  the amount of memory needed to allocate the keys and values of an
  associative table mapping file paths to ids or vice versa.




  <p>
    The number of lines in that listing is basically half the number of
  calls to <code>malloc</code> needed to create such a table and half the number
  of calls to <code>free</code> needed to free such a table.




  <p>
    Some performance critical operations in arch are likely to create
  (approximately but also &quot;at least&quot;) two (distinctively sorted)
  tables of the <code>(file-path, inventory-id)</code> relationship and two
  associative tables mapping file paths to ids and ids to inventory
  ids.




  <p>
    The listing size for this particular copy of the kernel has a bit
  over 30,00 lines and is just a bit over 1.5MB long.




  <p>
    Four tables (two associative, two relational) will require




  
  <pre>

                n_allocations == 4 * (30000 * 2)
                              == 240000


                n_string_bytes == 4 * 1.5MB
                               == 6MB

  </pre>

  <p>
    All four tables can be constructed from a single one, returned by
  the <code>inventory</code> primitivie.  That construction, in addition to
  allocating new storage for strings, will copy the contents of each
  string three times: 4.5MB of string copying spread across 180000
  separate strings.




  <p>
    These expenses are likely to be small relative to the costs of
  system calls made by arch and, especially on small trees, are
  likely to be insignificant.   Alas, there is a limit to how much
  arch's use of system calls can be optimized and my sense is that
  we are nearing that limit.




  <p>
    In absolute terms (such as &quot;wall clock time&quot;) these and related
  string and table expenses are quite noticable.  For example, several
  arch developers have obtained noteworthy performance improvements
  simply by replacing the unoptimized string copying operations with
  calls to the highly optimized versions typically found in the native
  <code>libc</code> of each host system.




  <p>
    In my own experience, substitution of a silghtly slower <code>malloc</code>
  for a native malloc also has a tangible performance impact and
  quick-n-dirty profiling with <code>gprof</code> suggests that the vast majority
  of these calls originate in <code>libawk</code>.




  <p>
    These calculations and experience with profiling and optimizing 
  suggest that attention to <code>libawk</code>, especially with the aim of
  eliminating string allocations and copies, is likely to be rewarded
  with a somewhat faster, significantly more memory-efficient 
  <code>tla</code>.








  <h2>Secondary Reasons to Want to Change String Handling
</h2>
  <p>
    The performance incentives for changing string handling are 
  not the only reason to reconsider string handling in <code>libawk</code>.




  <p>
    I have three other reasons (at least):




  
    <h3>The <code>libawk</code> Abstractions Do Not Require String Copying
</h3>
    <p>
      <code>libawk</code> was originally intended to provide mutable tables
  of immutable strings.   In other words, clients of the library
  should be free to add or delete rows or columns, keys and values --
  but they should not ever directly modify a string which is an
  element of a table.




    <p>
      As currently presented, the <code>libawk</code> API does not enforce that
  aspect of its abstraction.  To the degree it is successfully
  honored by the rest of the code in <code>tla</code> (which is not entirely
  clear), the abstraction is honored only by careful coding with
  no particular assistence from the <b>C</b> type system.




    <p>
      Modifying the <code>libawk</code> API to enforce the immutability of 
  strings would make client code of that library easier to 
  understand, maintain, verify, and modify.




  
  
  
    <h3>The Representation Type of <code>libawk</code> Strings Will Probably Change Anyway
</h3>
    <p>
      It is conceivable that, as <code>tla</code> acquires support for
  extended character sets in filenames and other strings,
  these will be simply be represented internally as 
  Unicode strings, using the UTF-8 encoding form.




    <p>
      It is also conceivable (and in my view more sensible) if 
  strings might internally be represented as Unicode strings
  using a different encoding form (such as UTF-16 or UTF-32,
  depending on circumstance).




    <p>
      Regardless, either way, the shift from presumed-ASCII strings
  to strings which sometimes include extended characters is 
  likely to be a difficult one to implement.   (Indeed, one
  justification for beginning <code>tla 2.0</code> is to avoid having to make
  this shift on the <code>tla 1.x</code> code base by instead providing a
  cleaner-from-the-start foundation in <code>tla 2.0</code>).




    <p>
      If we want to potentiate making the shift to extended characters in
  <code>tla 1.x</code>, modifying the <code>libawk</code> API so that internal string
  representations are better hidden is a definate help.




    <p>
      Currently, much code in <code>tla 1.x</code> assumes that, for example, 
  the string in the third column of the second row of a table
  can be accessed as a value of type <code>t_uchar *</code> using a 
  construct such as:




    
    <pre>
        t_uchar * table_elt = table[1][2];

    </pre>

    <p>
      Eliminating such constructs by hiding them behind a procedural
  abstraction (i.e., not having clients use array subscripting but,
  instead, having them call a function to access elements)
  would be progress on both eliminating needless string copying
  <b>and</b> preparing the way for extended character sets.




  
  
  
    <h3>Hash-consed and/or Constructive Strings May be Desirable
</h3>
    <p>
      [This is, by far, the most speculative rationale for 
  changing <code>libawk</code>.]




    <p>
      String comparison is an important operation in <code>tla</code>, both 
  comparison for equality and comparison for lexical order.




    <p>
      Unicode, necessarily, makes comparison operations considerably
  more expensive than their ASCII-only analogs.




    <p>
      One <i>sometimes applicable</i> technique for reducing the expense of
  such operations in a string-intensive application is to (a) use only
  immutable strings; (b) never allocate two strings with identical
  contents (always share a single string) -- this gives rise to a fast
  equality test; (c) memoize the results of all string comparisons for
  order, doing so in a manner that allows a quick inference of the
  order between two previously uncompared strings provided that they
  have both been compared to some other string or to two other strings
  whose order is known (or less expensively inferrable).




    <p>
      Another <i>sometimes applicable</i> technique for reducing string copying
  and, in particular, speeding up operations such as substring 
  extraction and string concatenation is to use what might be called
  &quot;constructive strings&quot; (sometimes called &quot;ropes&quot; after an
  implementation by Hans Boehm).




    <p>
      Using reference-counted immutable strings in <code>libawk</code> opens the 
  door to trying out either of these techniques.





  
  



  <h2>Program for Research
</h2>
  <p>
    I am interested in finding an easy way to eliminate <code>libawk</code>-related
  string copying.




  <p>
    There are many ways a <code>libawk</code>-style API might work without so 
  much string copying but I think that the most straightforward
  of these would be to:




  <blockquote>
    
    <p>
      <b>1. Reference-count <code>libawk</code> Strings</b>  Strings do not 
     contain references to other objects and thus are not
     subject to being part of data structure containing a 
     cycle of references.   For that reason, reference counting
     is a natural fit for memory managing them.




    <p>
      <b>2. Make the <code>libawk</code> Types Opaque</b>  The internal 
     representations in <code>libawk</code> should be hidden from 
     clients.



  </blockquote>

  <p>
    My intention, then, isf to proceed first by making such changes to
  <code>libawk</code>.   This will immediately break a good deal of code 
  throughout <code>tla</code>.   The breakage will be of a variety that a 
  <b>C</b> compiler will report it -- this is convenient because the 
  compiler will help to trace down all code which must be modified
  to honor the corrected <code>libawk</code> abstractions.








  <h2>Current Questions
</h2>
  <p>
    Have other Arch developers tried a similar experiment already?




  <p>
    Do other Arch developers agree with the conclusions I draw from
  my envelope calculations and the anecdotal profiling and tuning
  results?




  <p>
    What is the comfort-level among Arch developers regarding the
  prospects of migrating such a change, which is likely to be
  very large in scope, from a <code>tla-1.3.1</code> branch to other branches
  such as <code>baz1.x</code> and <code>tla-1.4</code>?









  <h2>Copyright
</h2>
  <p>
    <b>Copyright (C) 2004 Tom Lord</b>




  <p>
    This program is free software; you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
 the Free Software Foundation; either version 2, or (at your option)
 any later version.




  <p>
    This program is distributed in the hope that it will be useful,
 but <b>WITHOUT ANY WARRANTY</b>; without even the implied warranty of
 <b>MERCHANTABILITY</b> or <b>FITNESS FOR A PARTICULAR PURPOSE</b>.  See the
 GNU General Public License for more details.




  <p>
    You should have received a copy of the GNU General Public License
 along with this program; if not, write to the Free Software Foundation,
 Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.




  <p>
    See the file <code>COPYING</code> for further information about
 the copyright and warranty status of this work.







    </div>

    <div class="navLeft">

<p>
  <a href="../../index.html">home</a>





  <h2>GNU Arch
</h2>
  <p>
    <a href="../../web/gnu-arch/what-is.html">what is GNU Arch?</a>




  <p>
    <a href="../../bugs/index.html">GNU Arch Bug Database</a>




  <p>
    <a href="../../index.html#external GNU Arch links">popular GNU Arch links</a>




  <p>
    <a href="../../web/communications/gnu-arch-roadmap.html">GNU Arch development roadmap</a>




  <p>
    <a href="../../web/gnu-arch/popular-writings.html">popular GNU Arch writings</a>






    </div>

  </body>
</html>