1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349
|
# All About Libpas, Phil's Super Fast Malloc
Filip Pizlo, Apple Inc., February 2022
# License
Copyright (c) 2022 Apple Inc. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY APPLE INC. ``AS IS'' AND ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL APPLE INC. OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# Introduction
This document describes how libpas works as of [247029@main](https://commits.webkit.org/247029@main), so a bit ahead of
where WebKit was as of [246842@main](https://commits.webkit.org/246842@main). Libpas is a fast and memory-efficient
memory allocation toolkit capable of supporting many heaps at once, engineered with the hopes that someday it'll be used
for comprehensive isoheaping of all malloc/new callsites in C/C++ programs.
Since WebKit [186504@main](https://commits.webkit.org/186504@main), we've been steadily enabling libpas as a
replacement for WebKit's bmalloc and MetaAllocator. This has so far added up to a ~2% Speedometer2 speed-up and
a ~8% memory improvement (on multiple memory benchmarks). Half of the speed-up comes from replacing the MetaAllocator,
which was JavaScriptCore's old way of managing executable memory. Now, JSC uses libpas's jit_heap to manage executable
memory. The other half of the speed-up comes from replacing everything that bmalloc provided -- the fastMalloc API, the
Gigacage API, and the IsoHeap<> API. All of the memory improvement comes from replacing bmalloc (the MetaAllocator was
already fairly memory-efficient).
This document is structured as follows. First I describe the goals of libpas; these are the things that a
malloc-like API created out of libpas should be able to expose as fast and memory-efficient functions. Then I
describe the coding style. Next I tell all about the design. Finally I talk about how libpas is tested.
# Goals of Libpas
Libpas tries to be:
- Fast. The goal is to beat bmalloc performance on single-threaded code. Bmalloc was previously the fastest
known malloc for WebKit.
- Scalable. Libpas is meant to scale well on multi-core devices.
- Memory-efficient. The goal is to beat bmalloc memory usage across the board. Part of the strategy for memory
efficiency is consistent use of first-fit allocation.
- External metadata. Libpas never puts information about a free object inside that object. The metadata is
always elsewhere. So, there's no way for a use-after-free to corrupt libpas's understanding of memory.
- Multiple heap configurations. Not all programs want the same time-memory trade-off. Some malloc users have
very bizarre requirements, like what JavaScriptCore does with its ExecutableAllocator. The goal is to support
all kinds of special allocator needs simultaneously in one library.
- Boatloads of heaps. Libpas was written with the dream of obviating the need for ownership type systems or
other compiler approaches to fixing the type-safety of use-after-frees. This means that we need one heap per
type, and be 100% strict about it. So, libpas supports tons of heaps.
- Type-awareness. Sometimes, malloc decisions require knowing what the type's size and alignment are, like when
deciding how to split and coalesce memory. Libpas is designed to avoid type errors arising from the malloc's
"skewed" reuse of memory.
- Common free. Users of libpas isoheaps don't have to know the heap of an object when they free it. All objects
should funnel into the same free function. One kind of exception to this requirement is stuff like
ExecutableAllocator, which needs a malloc, but is fine with not calling a common free function.
- Cages. WebKit uses virtual memory reservations called *cages*, in which case WebKit allocates the virtual
memory and the malloc has to associate that memory with some heap. Libpas supports multiple kinds of cages.
# Libpas Style
Libpas is written in C. Ultimately, I chose C because I felt that the language provides better support for
extremely low-level code:
- C++ is usually my favorite, because it makes code easier to write, but for libpas, I wanted something easier
to read. It's easier to read C when auditing for subtle bugs, because there's nothing hidden. C doesn't have
stuff like destructor invocations or operator overloading, which result in surprising effectfulness in
otherwise innocent-looking code. Memory management code like libpas has to be read a lot, so C is better.
- C makes casting between pointers and integers very simple with its style of cast operator. It feels weird to
use the C cast operator in C++, so when I have to do a lot of uintptr_t'ing, I prefer C.
C lets you do most of what C++ can if you rely on `always_inline`. This didn't used to be the case, but modern C
compilers will meat-grind the code with repeated application of the following things:
- Inlining any `always_inline` call except if it's recursive or the function uses some very weird features that
libpas doesn't use (like goto pointer).
- Copy-propagating the values from the callsite into the function that uses the value.
Consequently, passing a function pointer (or struct of function pointers), where the pointer points to an
`always_inline` function and the callee is `always_inline` results in specialization akin to template
monomorphization. This works to any depth; the compiler won't be satisfied until there are no more
`always_inline` function calls. This fortuitous development in compilers allowed me to write very nice template
code in C. Libpas achieves templates in C using config structs that contain function pointers -- sometimes to
`always_inline` functions (when we want specialization and inlining) and sometimes to out-of-line functions
(when we want specialization but not inlining). Additionally, the C template style allows us to have true
polymorphic functions. Lots of libpas slow paths are huge and not at all hot. We don't want that code
specialized for every config. Luckily, this works just fine in C templates -- those polymorphic functions just
pass around a pointer to the config they are using, and dynamically load and call things in that config, almost
exactly the same way that the specialized code would do. This saves a lot of code size versus C++ templates.
Most of libpas is written in an object-oriented style. Structs are used to create either by-value objects or
heap-allocated objects. It's useful to think of these as classes, but a loose way since there are many ways to
do classes in C, and libpas uses whatever techniques are best on a per-class basis. But heap allocated objects
have a clear convention: for a class named `foo`, we would call the struct `pas_foo`, and for a method `bar` on
`foo`, we would call the function `pas_foo_bar` and have the first parameter be `pas_foo*`. The function that
creates instances of `foo` is called `pas_foo_create` (or `pas_foo_create_blah_blah` in case of overloading) and
returns a `pas_foo*`. The function that destroys `foo` objects is called `pas_foo_destroy` and takes a
`pas_foo*`.
Libpas classes are usually implemented in files called `pas_foo.h` (the header that defines the struct and a
subset of functions), `pas_foo_inlines.h` (the header that defines inline functions of `foo` that require
calling functions declared in headers that `pas_foo.h` can't include), and `pas_foo.c` (the implementations of
`foo` functions that can be out-of-line).
Some libpas "classes" are singletons. The standard way of implementing a singleton in libpas is that there is
really no struct, only global variables and functions that are declared in the header. See `pas_page_malloc` or
`pas_debug_heap` for examples of singletons.
Not everything in libpas is a class. In cases where a bunch of not-class-like things can be grouped together in
a way that makes sense, we usually do something like a singleton. In cases where a function can't easily be
grouped together with some class, even a singleton, we name the file it's in after the function. There are lots
of examples of this, like `pas_deallocate` or `pas_get_page_base`. Sometimes this gets fun, like
`pas_get_page_base_and_kind_for_small_other_in_fast_megapage.h`.
Finally, libpas avoids abbreviations even more so than WebKit usually does. Functions that have a quirky
meaning typically have a long name that tells the story. The point is to make it easy to appreciate the subtlety
of the algorithm when reading the code. This is the kind of code where complex situations should look complex
at any abstraction level.
# Design of Libpas
Libpas is organized into roughly eleven areas:
1. Heap configurations. This is the way that we tell libpas how to organize a heap. Heap configurations can
control a lot. They can change obvious things like the minalign and page size, but also more crazy things,
like how to find a page header given a page and vice-versa.
2. The large heaps. This is a first-fit heap based on arrays, cartesian trees, and hashtables. The large
heap has excellent type safety support and can be safely (though not efficiently) used for small objects.
3. Metacircularity. Libpas uses malloc-like APIs internally for managing its state. These are provided by the
so-called bootstrap, utility, and immortal heaps.
4. The segregated heaps and TLCs (thread-local caches). Libpas has a super fast simple segregated storage slab
allocator. It supports type safety and is the most commonly used kind of heap.
5. The bitfit heaps. This is a fast and memory-efficient type-unsafe heap based on slabs and bitmaps.
6. The scavenger. Libpas performs a bunch of periodic bookkeeping tasks in a scavenger thread. This includes,
but is not limited to, returning memory to the OS.
7. Megapages and page header tables. Libpas has multiple clever tricks for rapidly identifying which kind of
heap an object belongs to. This includes an arithmetic hack called megapages and some lock-free hashtables.
8. The enumerator. Libpas supports malloc heap enumeration APIs.
9. The basic configuration template, used to create the `bmalloc_heap` API that is used as a replacement for
all of bmalloc's functionality.
10. The JIT heap config.
11. The fast paths. The various heaps, TLCs, megapages and page header tables are glued together by fast paths
provided for allocation, deallocation, and various utility functions.
## Heap Configurations
The `pas_heap_config` struct defines all of the configurable behaviors of a libpas heap. This includes things
like how the heap gets its memory, what size classes use segregated, bitfit, or large allocators, and a bunch
of other things.
Heap configs are passed by-value to functions that are meant to be specialized and inlined. To support this,
the convention for defining a heap config is that you create a macro (like `BMALLOC_HEAP_CONFIG`) that gives a
heap config literal expression. So, a call like `pas_get_allocation_size(ptr, BMALLOC_HEAP_CONFIG)` will give
you an optimized fast path for getting the allocation size of objects in bmalloc. This works because such fast
paths are `always_inline`.
Heap configs are passed by-pointer to functions that are not meant to be specialized. To support this, all
heap configs also have a global variable like `bmalloc_heap_config`, so we can do things like
`pas_large_heap_try_deallocate(base, &bmalloc_heap_config)`.
Heap configs can have up to two segregated page configs (`config.small_segregated_config` and
`config.medium_segregated_config`) and up to three bitfit page configs (`config.small_bitfit_config`,
`config.medium_bitfit_config`, and `config.marge_bitfit_config`). Any of the page configs can be disabled,
though weird things might happen if the smallest ones are disabled (rather than disabling the bigger ones).
Page configs (`pas_segregated_page_config`, `pas_bitfit_page_config`, and the common supertype,
`pas_page_base_config`) get used in much the same way as heap configs -- either by-value for specialized and
inlined functions or by-pointer for unspecialized functions.
Heap and page configs also support specialized-but-not-inlined functions. These are supported using additional
function pointers in those configs that are filled in using macros -- so you don't need to fill them in
explicitly when creating your own config, like `BMALLOC_HEAP_CONFIG` or `JIT_HEAP_CONFIG`. The macros fill them
in to point at never_inline functions that call some specialized and inlined function with the config passed as
a constant. This means for example that:
BMALLOC_HEAP_CONFIG.specialized_local_allocator_try_allocate_small_segregated_slow(...);
Is an out-of-line direct function call to the specialization of
`pas_local_allocator_try_allocate_small_segregated_slow`. And this would be a virtual call to the same
function:
const pas_heap_config* config = ...;
config->specialized_local_allocator_try_allocate_small_segregated_slow(...);
Note that in many cases where you have a `pas_heap_config`, you are in specialized code and the heap config is
a known constant at compile to, so then:
config.specialized_local_allocator_try_allocate_small_segregated_slow(...);
Is an out-of-line direct function call.
## The Large Heaps
Libpas's large heaps serve multiple purposes:
- Everything is bootstrapped on large heaps. When segregated and bitfit heaps allocate memory, they do so from
some large heap.
- Segregated and bitfit heaps have object size ceilings in the tens or hundreds of kilobytes. So, objects that
are too large for the other heaps get allocated from large heaps.
Large heaps are broken into two parts:
1. The large free heap. In libpas jargon, a *free heap* is a heap that requires that deallocation passes the
object size, requires that the freed object size matches the allocated object size for that object, and makes
no guarantees about what kind of mess you'll get yourself into if you fail to obey that rule.
2. The large map. This maps object pointer to size and heap.
3. The large heap. This is an abstraction over both (1) and (2).
Large free heaps just maintain a free-list; they know nothing about allocated objects. But combined with the
large map, the large heaps provide a user-friendly deallocation API: you just need the object pointer, and the
large map figures out the rest, including identifying which free heap the object should be deallocated into, and
the size to pass to that free heap.
Large heaps operate under a single global lock, called the *heap lock*. Most libpas heaps use fine-grained
locking or avoid locking entirely. But for the large heap, libpas currently just uses one lock.
### Large Free Heap
Large free heaps are built out of a generic algorithm that doesn't know how to represent the free-list and gets
instantiated with either of two free-list representations, *simple* and *fast*. The simple large free heap uses
an unordered array of coalesced free object descriptions. The fast large free heap uses a cartesian tree of
coalesced free object descriptions.
A free object description is represented by the `pas_large_free` object in libpas; let's just call it a *large
free* for brevity. Large frees can tell you the beginning and end of a free chunk of memory. They can also tell
if the memory is already known to be zero and what the *type skew* of the free memory is. The large heap can be
used to manage arrays of some type that is either larger than the heap's minimum alignment, or that is smaller
than and not a divisor of the alignment. Especially when this is combined with `memalign`, the free heap will
have to track free memory that isn't type-aligned. Just consider a type of size 1000 that is allocated with
alignment 16384. The rules of memalign say that the size must be 16384 in that case. Assuming that the free heap
had 32768 contiguous bytes of free memory to begin with, it will now have 16384 bytes that starts with a type
skew of 384 bytes. The type skew, or `offset_in_type` as libpas calls it, is the offset of the beginning of the
large free inside the heap's type. In extremely complex cases, this means that finding where the first valid
object address is inside a large free for some type and alignment requires computing the offset least common
multiple (see `pas_coalign`), which relies on the right bezout coefficient of the extended GCD of the type size
and alignment (see `pas_extended_gcd`).
Large frees support an API for coalescing (merging as libpas calls it) and splitting. The generic large free
heap handles searching through large frees to find the first one that matches an allocation request for some
size and alignment. It also handles coalescing freed memory back into the heap, by searching for adjacent free
memory. The searches are through a struct of function pointers that may be implemented either efficiently (like
the simple large free heap's O(n) search through an unordered array) or efficiently (like the O(1)-ish or
O(log n)-ish operations on the cartesian tree in the fast large free heap). The generic algorithm uses the C
generic idiom so there are no actual function pointer calls at runtime.
Large free heaps allow you to give them callbacks for allocating and deallocating memory. The allocation
callback will get called if you ask a large free heap for memory and it doesn't have it. That allocation
callback could get the memory from the OS, or it could get it from some other heap. The deallocation callback
is for those cases where the large free heap called the allocation callback and then decided it wanted to give
some fraction of that memory back. Both callbacks are optional (can be NULL), though the case of a NULL
allocation callback and non-NULL deallocation callback is not useful since the deallocation callback only gets
called on the path where we had an allocation callback.
Note that large free heaps do not do anything to decommit their free memory. All decommit of memory in large
free heaps is accomplished by the *large sharing pool*, which is part of the scavenger.
### Large Map
The large map is a hashtable that maps object addresses to *large entries*, which contain the size of the object
and the heap it belongs to. The large map has a fairly complex hashtable algorithm because of my past attempts
at making the large heap at least somewhat efficient even for small objects. But it's conceptually a simple part
of the overall algorithm. It's also legal to ask the large map about objects it doesn't know about, in which
case, like a normal hashtable, it will just tell you that it doesn't know about your object. Combined with the
way that segregated and bitfit heaps use megapage tables and page header tables, this means that libpas can do
fall-through to another malloc for objects that libpas doesn't manage.
Note that it might be OK to remove the small object optimizations in the large map. On the other hand, they are
reliable, and they aren't known to increase the cost of the algorithm. Having that capability means that as part
of tuning the algorithm, it's more safe than it would otherwise be to try putting some small objects into the
large heap to avoid allocating the data structures required for operating a segregated or bitfit heap.
### The Large Heap
The large free heap and large map are combined into a high-level API with the `pas_large_heap`. In terms of
state, this is just a
`pas_large_free_heap` plus some data to help with small-object optimizations in the large map. The functions
of the large heap do a lot of additional work:
- They give the free heap an allocator for getting new memory. The large heap routes memory allocation
requests to the heap config's allocation callback.
- They ensure that each free heap allocation ends up in the large map.
- They implement deallocation by removing something from the large map and then deallocating it into the free
heap.
- They provide integration with the scavenger's large sharing pool so that free memory can be decommitted.
The large heap is always used as a member of the `pas_heap` object. It's useful to think of `pas_large_heap` as
never being a distinct object; it's more of a way of compartmentalizing `pas_heap`. The heap object also
contains a segregated heap and some other stuff.
## Metacircularity
I'm used to programming with dynamically allocated objects. This lets me build arrays, trees, hashtables,
look-up tables, and all kinds of lock-free data structures. So, I wanted libpas's internals to be able to
allocate objects just like any other kind of algorithm would do. But libpas is engineered so that it can
be a "bottom of the world" malloc -- where it is the implementation of `malloc` and `free` and cannot rely on
any memory allocation primitives other than what the kernel provides. So, libpas uses its own allocation
primitives for its own objects that it uses to implement those primitives. This is bootstrapped as follows:
- The *bootstrap heap* is a simple large free heap. A simple large free heap needs to be able to allocate
exactly one variable-length array of large frees. The bootstrap heap has hacks to allow itself to allocate
that array out of itself. This trick then gives us a complete malloc implementation for internal use by
libpas, albeit one that is quite slow, can only be used under the heap lock, and requires us to know the
object's size when freeing it. All other simple large free heaps allocate their free lists from the bootstrap
heap. The bootstrap heap is the only heap in libpas that asks the OS for memory. All other heaps ask either
the bootstrap heap for memory, or they ask one of the other heaps.
- The *compact reservation* is 128MB of memory that libpas uses for objects that can be pointed at with 24-bit
(3 byte) pointers assuming 8-byte alignment. Libpas needs to manage a lot of heaps, and that requires a lot
of internal meta-data, and having compact pointers reduces the cost of doing this. The compact reservation is
allocated from the bootstrap heap.
- The *immortal heap* is a heap that bump-allocates out of the compact reservation. It's intended for small
objects that are immortal.
- The *compact bootstrap heap* is like the bootstrap heap, except that it allocates its memory from the compact
reservation, and allocates its free list from the bootstrap heap rather than itself.
- The *compact large utility free heap* is a fast large free heap that supports decommitting free memory (see
the scavenger section) and allocates its memory from the compact bootstrap heap.
- The *utility heap* is a segregated heap configured to be as simple as possible (no thread-local caches for
example) and can only be used while holding the heap lock. It only supports objects up to some size
(`PAS_UTILITY_LOOKUP_SIZE_UPPER_BOUND`), supports decommitting free memory, and gets its memory from the
compact bootstrap heap. One example of how the utility heap gets used is the nodes in the cartesian trees
used to implement fast large free heaps. So, for example, the compact large utility free heap relies on the
utility heap.
- The *large utility free heap* is a fast large free heap that supports decommitting free memory and allocates
its memory from the bootstrap heap.
Note how the heaps pull memory from one another. Generally, a heap will not return memory to the heap it got
memory from except to "undo" part of an allocation it had just done. So, this arrangement of
who-pulls-memory-from-who is designed for type safety, memory efficiency, and elegantly supporting weird
alignments:
- Libpas uses decommit rather than unmapping free memory because this ensures that we don't ever change the type
of memory after that memory gets its type for the first time.
- The lower-level heaps (like bootstrap and compact bootstrap) do not support decommit. So, if a higher-level
heap that does support decommit ever returned memory to the lower-level heap, then the memory would never get
decommitted.
- Page allocation APIs don't let us easily allocate with alignment greater than page size. Libpas does this by
over-allocating (allocating size + alignment and then searching for the first aligned start within that larger
reservation). This is all hidden inside the bootstrap heap; all other heaps that want memory on some weird
alignment just ask some other heap for memory (often the bootstrap heap) and that heap, or ultimately the
bootstrap heap, figure out what that means in terms of system calls.
One missing piece to the metacircularity is having a *fast utility heap* that uses thread-local caches. There is
currently maybe one utility heap callsite that only grabs the heap lock just because it wants to allocate in the
utility heap. There's a possibility of a small speed-up if any callsite like that used a fast utility heap
instead, and then no locking would be required. It's not clear how easy that would be; it's possible that some
bad hacks would be required to allow code that uses TLCs to call into a heap that then also uses TLCs.
## Segregated Heaps and TLCs (Thread Local Caches)
Libpas's great performance is mostly due to the segregated heaps and how they leverage thread-local caches. TLCs
provide a cache of global memory which. In the best case, this cache prevents threads from doing any
synchronization during allocation and deallocation. Even when they do have to do some synchronization, TLCs
make it unlikely that one thread will ever want to acquire a lock held by another thread. The strategy is
three-fold:
- The TLC has a per-size-class allocator that caches some amount of that size class's memory. This means that
the allocation fast path doesn't have to do any locking or atomic instructions except when its cache runs out.
Then, it will have to do some synchronization -- in libpas's case, fine-grained locking and some lock-free
algorithms -- to get more memory. The amount of memory each allocator can cache is bounded (usually 16KB) and
allocators can only hold onto memory for about a second without using it before it gets returned (see the
scavenger section).
- The TLC has a deallocation log. The fast path of deallocating a segregated heap object is just pushing it onto
the deallocation log without any locking. The slow path is to walk the log and free all of the objects. The
libpas deallocation log flush algorithm cleverly avoids doing per-object locking; in the best case it will
acquire a couple of locks before flushing and release them after flushing the whole log.
- When the deallocation log flush frees memory, it tries to first make that memory available exclusively to the
thread that freed it by putting the free memory into the *local view cache* for that size class in that
thread. Memory moves from the local view caches into the global heap only if the view cache is full (has about
1.6MB of memory in it) or if it hasn't been used in about a second (see the scavenger section).
This section lays out the details of how this works. *Segregated heaps* are organized into *segregated
directories*. Each segregated directory is an array of page *views*. Each page view may or may not have a *page*
associated with it. A view can be *exclusive* (the view has the page to itself), *partial* (it's a view into a
page shared by others), or *shared* (it represents a page shared by many partial views). Pages have a *page
boundary* (the address of the beginning of the page) and a *page header* (the object describing the page, which
may or may not actually be inside the page). Pages maintain *alloc bits* to tell which objects are live.
Allocation uses the heap's lookup tables to find the right *allocator index* in the TLC, which yields a *local
allocator*; that allocator usually has a cache of memory to allocate from. When it doesn't, it first tries to
pop a view from the local view cache, and if that fails, it uses the *find first eligible* algorithm on the
corresponding directory to find an eligible view. One the allocator has a view, it ensures that the view has a
page, and then scans the alloc bits to create a cache of free memory in that page. Deallocation fast paths just
push the object onto the deallocation log. When the log is full, the TLC flushes its log while trying to
amortize lock acquisition. Freeing an object in a page means clearing the corresponding alloc bit. Once enough
alloc bits are clear, either the page's view ends up on the view cache, or the directory is notified to mark the
page either eligible or empty. The sections that follow go into each of these concepts in detail.
### The Thread Local Cache
Each thread has zero or one `pas_thread_local_cache`s. Libpas provides slow paths for allocating and
deallocating without a TLC in cases where TLC creation is forbidden (like when a thread is shutting down). But
usually, allocation and deallocation create a TLC if the thread doesn't already have one. TLCs are structured
as follows:
- TLCs contain a fixed-size deallocation log along with an index that tells how much of the log is full.
Deallocation pushes onto that log.
- TLCs contain a variable-length array of *allocators*, which are really either *local allocators* or *local
view caches*. Allocators are variable length. Clients access allocators using an allocator index, which they
usually get from the directory that the allocator corresponds to.
- TLCs can get reallocated and deallocated, but they always point to a `pas_thread_local_cache_node`, which is
an immortal and compact descriptor of a TLC. TLC nodes are part of a global linked list. Each TLC node may or
may not have a live TLC associated with it. TLCs cannot be created or destroyed unless the heap lock is held,
so if you hold the heap lock, you can iterate the TLC node linked list to find all TLCs.
- The layout of allocator indices in a TLC is controlled by both the directories and the TLC layout data
structure (`pas_thread_local_cache_layout`). This is a global data structure that can tell us about all of the
allocators in a TLC. When holding the heap lock, it's possible to loop over the TLC layout linked list to
find what all of the valid allocator indices are and to introspect what is at those indices.
Thread local caches tend to get large because both the local allocators and local view caches have inline
arrays. The local allocator has an array of bits that tell the allocator where the free objects are. The local
view cache has an array of view pointers (up to 1.6MB / 16KB = 100 entries, each using a 24-bit pointer). When
used in single-heap applications, these overheads don't matter -- they end up being accounting for less than
10<sup>-5</sup> of the overall process footprint (not just in WebKit but when I experimented with libpas in
daemons). But when used for many heaps, these overheads are substantial. Given thousands or tens of thousands
of heaps, TLCs account for as much as 1% of memory. So, TLCs support partial decommit. Those pages that only
have allocators that are inactive get decommitted. Note that TLC decommit has landed in the libpas.git repo
as of [247029@main](https://commits.webkit.org/247029@main), but hasn't yet been merged into WebKit.
The TLC deallocation log flush algorithm is designed to achieve two performance optimizations:
- It achieves temporal locality of accesses to page headers. If freeing each object meant flipping a bit in the
page header, then many of those operations would miss cache, since the page header is not accessed by normal
program operation -- it's only accessed during some allocation slow paths and when deallocating. But because
deallocation accesses page headers only during log flush and log flush touches about 1000 objects, it's likely
that the flush will touch the same page header cache lines multiple times.
- It reduces the average number of lock acquisitions needed to free an object. Each page uses its own lock to
protect its page header, and the page header's `alloc_bits`. But deallocation log flush will do no locking or
unlocking if the object at index *i* and the object at index *i+1* use the same lock. Pages can dynamically
select which lock they use (thanks to `page->lock_ptr`), and they select it so that pages allocated out of by
a thread tend to share the same lock, so deallocation log flush usually only just acquires a lock at the start
of the 1000 objects and releases it when it finishes the 1000 objects.
### The Segregated Heap
The `pas_segregated_heap` object is the part of `pas_heap` that facilitates segregated heap and bitfit heap
allocation. The details of the bitfit heap are discussed in a later section. Segregated heaps can be created
separately from a `pas_heap`, but segregated heaps are almost always part of a `pas_heap` (and it would be easy
to refactor libpas to make it so that segregated heaps are always part of heaps).
Segregated heaps use *size directories* to track actual memory. Most of the exciting action of allocation and
deallocation happens in directories. Each directory corresponds to some size class. Segregated heaps make it
easy to find the directory for a particular size. They also make it easy to iterate over all the directories.
Also, segregated heaps make it easy to find the allocator index for a particular size (using a lookup table that
is essentially a cache of what you would get if you asked for the directory using the size-to-directory lookup
tables and then asked the directory for the allocator index). The most exciting part of the segregated heap
algorithm is `pas_segregated_heap_ensure_size_directory_for_size`, which decides what to do about allocating a
size it hadn't encountered before. This algorithm will either return an existing directory, create a new one, or
even retire an existing one. It handles all of the issues related to type size, type alignment, and the
alignment argument to the current malloc call.
The lookup tables maintained by segregated heaps have some interesting properties:
- They can be decommitted and rematerialized. This is a useful space saving when having lots of isoheaps. The
rematerialization happens because a heap also maintains a linked list of directories, and that linked list
never goes away. Each directory in the linked list knows what its representation would have been in the lookup
tables.
- They are optional. Some heaps can be configured to have a preferred size, called the *basic size class*. This
is very common for isoheaps, which may only ever allocate a single size. For isoheaps based on type, the
basic size class is just that type's size. Other isoheaps dynamically infer a preferred size based on the
first allocation. When a heap only has the basic size class, it will have no lookup tables.
- There are separate lookup tables for smaller sizes (not related to the small_segregated_config -- the
threshold is set separately) are just arrays whose index is the size divided by the heap's minalign, rounded
up. These may be populated under heap lock while they are accessed without any locks. So, accesses to them
have some guards against races.
- There are separate lookup tables for medium sizes (anything above the threshold for small lookup tables).
The medium table is a sorted array that the allocator binary-searches. It may be mutated, decommitted,
rematerialized, or reallocated under heap lock. The algorithm defends itself against this with a bunch of
compiler fences, a mutation count check. Mutating the table means incrementing-to-mutate before making changes
and incrementing-to-finish after making changes. So, the algorithm for lock-free lookup checks mutation count
before and after, and makes sure they are the same and neither indicates that we are mutating. This involves
clever use of dependency threading (like the ARM64 eor-self trick) to make sure that the mutation count reads
really happen before and after the binary search.
### Segregated Directories
Much of the action of managing memory in a segregated heap happens in the segregated directories. There are two
kinds:
- Segregated size directories, which track the views belonging to some size class in some heap. These may be
*exclusive views*, which own a page, or *partial views*, which own part of a *shared page*. Partial views
range in size between just below 512 bytes to possibly a whole page (in rare cases).
- Segregated shared page directories, which track *shared views*. Each shared view tracks shared page and which
partial views belong to it. However, when a shared page is decommitted, to save space, the shared view will
forget which partial views belong to it; they will re-register themselves the first time someone allocates in
them.
Both of them rely on the same basic state, though they use it a bit differently:
- A lock-free-access vector of compact view pointers. These are 4-byte pointers. This is possible because views
are always allocated out of the compact reservation (they are usually allocated in the immortal
heap). This vector may be appended to, but existing entries are immutable. So, resizing just avoids deleting
the smaller-sized vectors so that they may still be accessed in case of a race.
- A lock-free-access segmented vector of bitvectors. There are two bitvectors, and we interleave their 32-bit
words of bits. The *eligible* bitvector tells us which views may be allocated out of. This means different
things for size directories than shared page directories. For size directories, these are the views that have
some free memory and nobody is currently doing anything with them. For shared page directories, these are the
shared pages that haven't yet been fully claimed by partial views. The *empty* bitvector tells us which pages
are fully empty and can be decommitted. It's never set for partial views. It means the same thing for both
exclusive views in size directories and shared views in shared page directories.
Both bitvectors are searched in order:
- Eligible bitvectors are searched first-fit.
- Empty bitvectors are searched last-fit.
Searches are made fast because the directory uses the lock-free tricks of `pas_versioned_field` to maintain two
indices:
- A first-eligible index. This always points to the first eligible bit, except in cases where some thread has
set the bit but hasn't gotten around to setting the first-eligible index. In other words, this may have some
lag, but the lag is bounded.
- A last-empty-plus-one index. This always points to the index right after the last empty bit. If it's zero, it
means there are no set empty bits. If it's the number of views, then it means the last view is empty for sure
and there may be any number of other empty views.
These versioned indices can be read without any atomic instructions in many cases, though most mutations to them
require a pair of 128-bit compare-and-swaps.
The eligible bitvector together with the first-eligible index allow for very fast searches to find the first
eligible view. Bitvector searches are fast to begin with, even over a segmented vector, since the segmented
vector has large-enough chunks. Even searching the whole bitvector is quite efficient because of the properties
of bitvector simd (i.e. using a 32-bit or 64-bit or whatever-bit word to hold that many bits). But the
first-eligible index means that most searches never go past where that index points, so we get a mostly-O(1)
behavior when we do have to find the first eligible view.
The empty bitvector gives a similar property for the scavenger, which searches backwards to find empty views.
The efficiency here arises from the fact that empty pages know the timestamp of when they became empty, and the
scavenger will terminate its backwards search when it finds a too-recently emptied page.
Directories have to make some choices about how to add views. View addition happens under heap lock and must be
in this order:
- First we make sure that the bitvectors have enough room for a bit at the new index. The algorithm relies on
the size of the view vector telling us how many views there are, so it's fine for the bitvectors are too big
for a moment. The segmented vector algorithm used for the bitvectors requires appending to happen under heap
lock but it can run concurrently to accesses to the vector. It accomplishes this by never deallocating the
too-small vector spines.
- Then we append the the view to the view vector, possibly reallocating the view vector. Reallocation keeps the
old two-small copy of the vector around, allowing concurrent reads to the vector. The vector append stores the
value, executes an acqrel fence (probably overkill -- probably could just be a store fence), and then
increments the size. This ensures nobody sees the view until we are ready.
Directories just choose what kind of view they will create and then create an empty form of that view. So, right
at the point where the vector append happens, the view will report itself as not yet being initialized. However,
any thread can initialize an empty view. The normal flow of allocation means asking a view to "start
allocating". This actually happens in two steps (`will_start_allocating` and `did_start_allocating`). The
will-start step checks if the view needs to commit its memory, which will cause empty exclusive views to
allocate a page. Empty partial views get put into the *partial primordial* state where they grab their first
chunk of memory from some shared view and prepare to possibly grab more chunks of that shared view, depending on
demand. But all of this happens after the directory has created the view and appended it. This means that there
is even the possibility that one thread creates a view, but then some other thread takes it right after it was
appended. In that case, the first thread will loop around and try again, maybe finding some other view that had
been made eligible in the meantime, or against appending another new view.
Size directories maintain additional state to make page management easy and to accelerate allocation.
Size directories that have enabled exclusive views have a `full_alloc_bits` vector that has bits set for those
indices in their pages where an object might start. Pages use bitvectors indexed by minalign, and only set those
bits that correspond to valid object offsets. The `full_alloc_bits` vector is the main way that directories tell
where objects could possibly be in the page. The other way they tell is with
`offset_from_page_boundary_to_first_object` and `offset_from_page_boundary_to_end_of_last_object`, but the
algorithm relies on those a bit less.
Size directories can tell if they have been assigned an allocator index or a view cache index, and control the
policies of when they get them. A directory without an allocator index will allocate out of *baseline
allocators*, which are shared by all threads. Having an allocator index implies that the allocator index has
also been stored in the right places in the heap's lookup tables. Having a view cache index means that
deallocation will put eligible pages on the view cache before marking them eligible in the directory.
### Page Boundaries, Headers and Alloc Bits
"Pages" in the segregated heap are a configurable concept. Things like their size and where their header lives
can be configured by the `pas_segregated_page_config` of their directory. The config can tell which parts of
the page are usable as object payload, and they can provide callbacks for finding the page header.
The page header contains:
- The page's kind. The segregated page header is a subtype of the `pas_page_base`, and we support safely
downcasting from `pas_page_base*` to `pas_segregated_page*`. Having the kind helps with this.
- Whether the page is in use for allocation right now. This field is only for exclusive views. Allocation in
exclusive pages can only happen when some local allocator claims a page. For shared pages, this bit is in
each of the the partial views.
- Whether we have freed an object in the page while also allocating in it. This field is only for exclusive
views. When we finish allocating, and the bit is set, we do the eligibility stuff we would have done if we had
freed objects without the page being used for allocation. For shared pages, this bit is in each of the partial
views.
- The sizes of objects that the page manages. Again, this is only used for exclusive views. For shared pages,
each partial view may have a different size directory, and the size directory tells the object size. It's also
possible to get the object size by asking the exclusive view for its size directory, and you will get the same
answer as if you had asked the page in that case.
- A pointer to the lock that the page uses. Pages are locked using a kind of locking dance: you load the
`page->lock_ptr`, lock that lock, and then check if `page->lock_ptr` still points at the lock you tried.
Anyone who holds both the current lock and some other lock can change `page->lock_ptr` to the other lock.
For shared pages, the `lock_ptr` always points at the shared view's ownership lock. For exclusive views, the
libpas allocator will change the lock of a page to be the lock associated with their TLC. If contention
on `page->lock_ptr` happens, then we change the lock back to the view's ownership lock. This means that in
the common case, flushing the deallocation log will encounter page after page that wants to hold the same
lock -- usually the TLC lock. This allows the deallocation log flush to only do a handful of lock acquisitions
for deallocating thousands of objects.
- The timestamp of when the page became empty, using a unit of time libpas calls *epoch*.
- The view that owns the page. This is either an exclusive view or a *shared handle*, which is the part of the
shared view that gets deallocated for decommitted pages. Note: an obvious improvement is if shared handles
were actually part of the page header; they aren't only because until recently, the page header size had to be
the same for exclusive and shared pages.
- View cache index, if the directory enabled view caching. This allows deallocation to quickly find out which
view cache to use.
- The alloc bits.
- The number of 32-bit alloc bit words that are not empty.
- Optionally, the granule use counts. It's possible for the page config to say that the page size is larger than
the system page size, but that the page is divided up into *granules* which are system page size. In that
case, the page header will have an array of 1-byte use counts per granule, which count the number of objects
in that granule. They also track a special state when the granule is decommitted. The medium_segregated_config
uses this to offer fine-grained decommit of 128KB "pages".
Currently we have two ways of placing the page header: either at the beginning of the page, or what we call
the page boundary, or in an object allocated in the utility heap. In the latter case, we use the
mostly-lock-free page header table to map between the page boundary and page header, or vice-versa. The page
config has callbacks that allow either approach. I've also used page config hacking to attempt other kinds of
strategies, like saying that every aligned 16MB chunk of pages has an array of page headers at the start of it;
but those weren't any better than either of the two current approaches.
The most important part of the page header is the alloc bits array and the `num_non_empty_words` counter. This
is where most of the action of allocating and deallocating happens. The magic of the algorithm arises from the
simple bitvector operations we can perform on `page->alloc_bits`, `full_alloc_bits` (from the size
directory in case of exclusive pages or from the partial view in case of shared pages), and the
`allocator->bits`. These operations allow us to achieve most of the algorithm:
- Deallocation clears a bit in `page->alloc_bits` and if this results in the word becoming zero, it decrements
the `num_non_empty_words`. The bit index is just the object's offset shifted by the page config's
`min_aligh_shift`, which is a compile-time constant in most of the algorithm. If the algorithm makes any bit
(for partials) or any word (for exclusives) empty, it makes the page eligible (either by putting it on a view
cache or marking the view eligible in its owning directory). If `num_non_empty_words` hits zero, the
deallocator also makes the view empty.
- Allocation does a find-first-set-bit on the `allocator->bits`, but in a very efficient way, because the
current 64-bit word of bits that the allocator is on is cached in `allocator->current_word` -- so allocation
rarely searches an array. So, the allocator just loads the current word, does a `ctz` or `clz` kind of
operation (which is super cheap on modern CPUs), left-shifts the result by the page config's minalign, and
adds the `allocator->page_ish` (the address in memory corresponding to the first bit in `current_word`).
That's the allocator fast path.
- We prepare `allocator->bits` by basically saying `allocator->bits = full_alloc_bits & ~page->alloc_bits`. This
is a loop since each of the `bits` is an array of words of bits and each array is the same size. For a 16384
size page (the default for `small_segregated_config`) and a minalign shift of 4 (so minalign = 16, the default
for `small_segregated_config`), this means 1024 bits, or 32 32-bit words, or 128 bytes. The loop over the 32
32-bit words is usually fully unrolled by the compiler. There are no loop-carried dependencies. This loop
shows up in profiles, and though I've tried to make it faster, I've never succeeded.
### Local Allocators
Each size directory can choose to use either baseline allocators or TLC local allocators for allocation. Each
size directory can choose to have a local view cache or not. Baseline allocators are just local allocators that
are global and not part of any TLC and allocation needs to grab a lock to use them. TLC local allocators don't
require any locking to get accessed.
Local allocators can be in any of these modes:
- They are totally uninitialized. All fast paths fail and slow paths will initialize the local allocator by
asking the TLC layout. This state happens if TLC decommit causes a local allocator to become all zero.
- They are in bump allocation mode. Bump allocation happens either when a local allocator decides to allocate in
a totally empty exclusive page, or for primordial partial allocation. In the former case, it's worth about 1%
performance to sometimes bump-allocate. In the latter case, using bump allocation is just convenient -- the
slow path will decide that the partial view should get a certain range of memory within a shared page and it
knows that this memory has never been used before, so it's natural to just set up a bump range over that
memory.
- They are in free bits mode. This is slightly more common than the bump mode. In this mode, the
`allocator->bits` is computed using `full_alloc_bits & ~page->alloc_bits` and contains a bit for the start of
every free object.
- They are in bitfit mode. In this mode, the allocator just forwards allocations to the `pas_bitfit_allocator`.
Local allocators can be *stopped* at any time; this causes them to just return all of their free memory back to
the heap.
Local allocators in a TLC can be used without any conventional locking. However, there is still synchronization
taking place because the scavenger is allowed to stop allocators. To support this, local allocators set an
`in_use` bit (not atomically, but protected by a `pas_compiler_fence`) before they do any work and clear it when
done. The scavenger thread will suspend threads that have TLCs and then while the TLC is suspended, they can
stop any allocators that are not `in_use`.
## The Bitfit Heaps
Libpas is usually used with a combination of segregated and large heaps. However, sometimes we want to have a
heap that is more space-efficient than segregated but not quite as slow as large. The bitfit heap follows a
similar style to segregated, but:
- While bitfit has a bit for each minalign index like segregated, bitfit actually uses all of the bits. To
allocate an object in bitfit, all of the bits corresponding to all of the minalign granules that the object
would use have to be free before the allocation and have to be marked as not free after the allocation.
Freeing has to clear all of the bits.
- The same page can have objects of any size allocated in it. For example, if a 100 byte object gets freed, then
it's legal to allocate two 50 byte objects out of the freed space (assuming 50 is a multiple of minalign).
- A bitfit directory does not represent a size class. A bitfit heap has one directory per bitfit page config
and each page config supports a large range of sizes (the largest object is ~250 times larger than the
smallest).
Bitfit pages have `free_bits` as well as `object_end_bits`. The `free_bits` indicates every minalign granule
that is free. For non-free granules, the `object_end_bits` has an entry for every granule that is the last
granule in some live object. These bits get used as follows:
- To allocate, we find the first set free bit and then find the first clear free bit after that. If this range
is big enough for the allocation, we clear all of the free bits and set the object end bit. If it's not big
enough, we keep searching. We do special things (see below) when we cannot allocate in a page.
- To free, we find the first set object end bit, which then gives us the object size. Then we clear the object
end bit and set the free bits.
This basic style of allocation is usually called *bitmap* allocation. Bitfit is a special kind of bitmap
allocation that makes it cheap to find the first page that has enough space for an allocation of a given size.
Bitfit makes allocation fast even when it is managing lots of pages by using two tricks:
- Bitfit directories have an array of bitfit views and a corresponding array of `max_free` bytes. Bitfit views
are monomorphic, unlike the polymorphic views of segregated heaps. Each bitfit view is either uninitialized or
has a bitfit page. A `max_free` byte for a page tells the maximum free object size in that page. So, in the
worst case, we search the `max_free` vector to find the find byte that is large enough for our allocation.
- Bitfit uses size classes to short-circuit the search. Bitfit leverages segregated heaps to create size
classes. Segregated size directories choose at creation time if they want to support segregated allocation or
bitfit allocation. If the latter, the directory is just used as a way of locating the bitfit size class. Like
with segregated, each local allocator is associated with a segregated size directory, even if it's a local
allocator configured for bitfit. Each size class maintains the index of the first view/page in the directory
that has a free object big enough for that size class.
The updates to `max_free` and the short-circuiting indices in size classes happen when an allocation fails in
a page. This is an ideal time to set those indices since failure to allocate happens to also tell you the size
of the largest free object in the page.
When any object is freed in a page, we mark the page as having `PAS_BITFIT_MAX_FREE_UNPROCESSED` and rather than
setting any short-circuiting indices in size classes, we just set the `first_unprocessed_free` index in the
`pas_bitfit_directory`. Allocation will start its search from the minimum of `directory->first_unprocessed_free`
and `size_class->first_free`. All of these short-circuiting indices use `pas_versioned_field` just like how
short-circuiting works in segregated directories.
Bitfit heaps use fine-grained locking in the sense that each view has its own locks. But, there's no attempt
made to have different threads avoid allocating in the same pages. Adding something like view caches to the
bitfit heap is likely to make it much faster. You could even imagine that rather than having the
`directory->first_unprocessed_free`, we instead have freeing in a page put the page's view onto a local view
cache for that bitfit directory, and then we allocate out of the view cache until it's empty. Failure to
allocate in a page in the view cache will then tell us the `max_free`, which will allow us to mark the view
eligible in the directory.
## The Scavenger
Libpas returns memory to the OS by madvising it. This makes sense for a malloc that is trying to give strong
type guarantees. If we unmapped memory, then the memory could be used for some totally unrelated type in the
future. But by just decommitting the memory, we get the memory savings from free pages of memory and we also
get to preserve type safety.
The madvise system call -- and likely any mechanism for saying "this page is empty" -- is expensive enough that
it doesn't make sense to do it anytime a page becomes empty. Pages often become empty only to refill again. In
fact, last time I measured it, just about half of allocations went down the bump allocation path and (except for
at start-up) that path is for completely empty pages. So, libpas has mechanisms for stashing the information
that a page has become empty and then having a *scavenger thread* return that memory to the OS with an madvise
call (or whatever mechanism). The scavenger thread is by default configured to run every 100ms, but will shut
itself down if we have a period of non-use. At each tick, it returns all empty pages that have been empty for
some length of time (currently 300ms). Those two thresholds -- the period and the target age for decommit -- are
independently configurable at it might make sense (for different reasons) to have either number be bigger than
the other number.
This section describes the scavenging algorithm is detail. This is a large fraction of what makes libpas fast
and space-efficient. The algorithm has some crazy things in it that probably didn't work out as well as I wanted
but nonetheless those things seem to avoid showing up in profiles. That's sort of the outcome of this algorithm
being tuned and twisted so many times during the development of this allocator. First I'll describe the
*deferred decommit log*, which is how we coalesce madvise calls. Then I'll describe the *page sharing pool*,
which is a mechanism for multiple *participants* to report that they have some empty pages. Then I'll describe
how the large heap implements this with the large sharing pool, which is one of the singleton participants. Then
I'll describe the segregated directory participants -- which are a bit different for shared page directories
versus size directories. I'll also describe the bitfit directory participants, which are quite close to their
segregated directory cousins. Then I'll describe some of the things that the scavenger does that aren't to do
with the page sharing pool, like stopping baseline allocators, stopping utility heap allocators, stopping
TLC allocators, flushing TLC deallocation logs, decommitting unused parts of TLCs, and decommitting expendable
memory.
### Deferred Decommit Log
Libpas's decommit algorithm coalesces madvise calls -- so if two adjacent pages, even from totally unrelated
heaps, become empty, then their decommit will be part of one syscall. This is achieved by having places in the
code that want to decommit memory instead add that memory to a deferred decommit log. This log internally uses
a minheap based on address. The log also stores what lock was needed to decommit the range, since the decommit
algorithm relies on fine-grained locking of memory rather than having a global commit lock. So, when the
scavenger asks the page sharing pool to go find some memory to decommit, it gives it a deferred decommit log.
The page sharing pool will usually return all of the empty pages in one go, so the deferred decommit logs can
get somewhat big. They are allocated out of the bootstrap free heap (which isn't the best idea if they ever get
very big, since bootstrap free heap memory is not decommitted -- but right now this is convenient because for
a heap to support decommit, it needs to talk to deferred decommit logs, so we want to avoid infinite recursion).
After the log is filled up, we can decommit everything in the log at once. This involves heapifying the array
and then scanning it backwards while detecting adjacent ranges. This is the loop that actually calls decommit. A
second loop unlocks the locks.
Two fun complications arise:
- As the page sharing pool scans memory for empty pages, it may arrive at pages in a random order, which may
be different from any valid order in which to acquire commit locks. So, after acquiring the first commit lock,
all subsequent lock acquisitions are `try_lock`'s. If any `try_lock` fails, the algorithm returns early, and
the deferred decommit log helps facilitate detecting when a lock failed to be acquired.
- Libpas supports calling into the algorithm even if some commit locks, or the heap lock, are held. It's not
legal to try to acquire any commit locks other than by `try_lock` in that case. The deferred decommit log will
also make sure that it will not relock any commit locks that are already held.
Libpas can be configured to return memory either by madvising it or by mmap-zero-filling it, which has a similar
semantic effect but is slower. Libpas supports both symmetric and asymmetric forms of madvise, though the
asymmetric form is faster and has a slight (though mostly theoretical) memory usage edge. By *asymmetric* I mean
that you call some form of madvise to decommit memory and then do nothing to commit it. This works on Darwin and
it's quite efficient -- the kernel will clear the decommit request and give the page real memory if you access
the page. The memory usage edge of not explicitly committing memory in the memory allocator is that programs may
allocate large arrays and never use the whole array. It's OK to configure libpas to use symmetric decommit, but
the asymmetric variant might be faster or more efficient, if the target OS allows it.
### Page Sharing Pool
Different kinds of heaps have different ways of discovering that they have empty pages. Libpas supports three
different kinds of heaps (large, segregated, and bitfit) and one of those heaps has two different ways of
discovering free mempry (segregated shared page directories and segregated size directories). The page sharing
pool is a data structure that can handle an arbitrary number of *page sharing participants*, each of which is
able to say whether they have empty pages and whether those empty pages are old enough to be worth decommitting.
Page sharing participants need to be able to answer the following queries:
- Getting the epoch of the oldest free page. This can be approximate; for example, it's OK for the participant
to occasionally give an epoch that is newer than the true oldest so long as this doesn't happen all the time.
We call this the `use_epoch`.
- Telling if the participant thinks it *is eligible*, i.e. has any free pages right now. It's OK for this to
return true even if there aren't free pages. It's not OK for this to return false if there are free pages. If
this returns true and there are no free pages, then it must return false after a bounded number of calls to
`take_least_recently_used`. Usually if a participant incorrectly says that it is eligible, then this state
will clear with exactly one call to `take_least_recently_used`.
- Taking the least recently used empty page (`pas_page_sharing_participant_take_least_recently_used`). This is
allowed to return that there aren't actually any empty pages. If there are free pages, this is allowed to
return any number of them (doesn't have to be just one). Pages are "returned" via a deferred decommit log.
Participants also notify the page sharing pool when they see a *delta*. Seeing a delta means one of:
- The participant found a new free page and it previously thought that it didn't have any.
- The participant has discovered that the oldest free page is older than previously thought.
It's OK to only report a delta if the participant knows that it was previously not advertising itself as being
eligible, and it's also OK to report a delta every time that a free page is found. Most participants try to
avoid reporting deltas unless they know that they were not previously eligible. However, some participants (the
segregated and bitfit directories) are sloppy about reporting the epoch of the oldest free page. Those
participants will conservatively report a delta anytime they think that their estimate of the oldest page's age
has changed.
The page sharing pool itself comprises:
- A segmented vector of page sharing participant pointers. Each pointer is tagged with the participant's type,
which helps the pool decide how to get the `use_epoch`, decide whether the participant is eligible, and take
free pages from it.
- A bitvector of participants that have reported deltas.
- A minheap of participants sorted by epoch of the oldest free memory.
- The `current_participant`, which the page sharing pool will ask for pages before doing anything else, so long
as there are no deltas and the current participant continues to be eligible and continues to report the same
use epoch.
If the current participant is not set or does not meet the criteria, the heap lock is taken and all of the
participants that have the delta bit set get reinserted into the minheap based on updated use epochs. It's
possible that the delta bit causes the removal of entries in the minheap (if something stopped being eligible).
It's possible that the delta bit causes the insertion of entries that weren't previously there (if something has
just become eligible). And it's possible for an entry to be removed and then readded (if the use epoch changed).
Then, the minimum of the minheap becomes the current participant, and we can ask it for pages.
The pool itself is a class but it happens to be a singleton right now. It's probably a good idea to keep it as a
class, because as I've experimented with various approaches to organizing memory, I have had versions where
there are many pools. There is always the physical page sharing pool (the singleton), but I once had sharing
pools for moving pages between threads and sharing pools for trading virtual memory.
The page sharing pool exposes APIs for taking memory from the pool. There are two commonly used variants:
- Take a set number of bytes. The page sharing pool will try to take that much memory unless a try_lock fails,
in which case it records that it should take this additional amount of bytes on the next call.
- Take all free pages that are same age or older than some `max_epoch`. This API is called
`pas_physical_page_sharing_pool_scavenge`. This algorithm will only return when it is done. It will reloop in
case of `try_lock` failure, and it does this in a way that avoids spinning.
The expected behavior is that when the page sharing pool has to get a bunch of pages it will usually get a run
of them from some participant -- hence the emphasis on the `current_participant`. However, it's possible that
various tweaks that I've made to the algorithm have made this no longer be the case. In that case, it might be
worthwhile to try to come up with a way of recomputing he `current_participant` that doesn't require holding the
`heap_lock`. Maybe even just a lock for the page sharing pool, rather than using the `heap_lock`, would be
enough to get a speed-up, and in the best case that speed-up would make it easier to increase scavenger
frequency.
The page sharing pool's scavenge API is the main part of what the scavenger does. The next sections describe the
inner workings of the page sharing participants.
### Large Sharing Pool
The large sharing pool is a singleton participant in the physical page sharing pool. It tracks which ranges of
pages are empty across all of the large heaps. The idea is to allow large heaps to sometimes split a page among
each other, so then no single large heap knows whether the page can be decommitted. So, all large heaps as well
as the large utility free heap and compact large utility free heap report when they allocate and deallocate
memory to the large sharing pool.
Internally, the large sharing pool maintains a red-black tree and minheap. The tree tracks coalesced ranges of
pages and their states. The data structure thinks it knows about all of memory, and it "boots up" with a single
node representing the whole address space and that node claims to be allocated and committed. When heaps that
talk to the large sharing pool acquire memory, they tell the sharing pool that the memory is now free. Any
free-and-committed ranges also reside in the minheap, which is ordered by use epoch (the time when the memory
became free).
The large sharing pool can only be used while holding the `heap_lock`, but it uses a separate lock, the
`pas_virtual_range_common_lock`, for commit and decommit. So, while libpas is blocked in the madvise syscall,
it doesn't have to hold the `heap_lock` and the other lock only gets acquired when committing or decommitting
large memory.
It's natural for the large sharing pool to handle the participant API:
- The large sharing pool participant says it's eligible when the minheap is non-empty.
- The large sharing pool participant reports the minheap's minimum use epoch as its use epoch (though it can be
configured to do something else; that something else may not be interesting anymore).
- The large sharing pool participant takes least recently used by removing the minimum from the minheap and
adding that node's memory range to the deferred decommit log.
The large sharing pool registers itself with the physical page sharing pool when some large heap reports the
first bit of memory to it.
### Segregated Directory Participant
Each segregated directory is a page sharing participant. They register themselves once they have pages that
could become empty. Segregated directories use the empty bits and the last-empty-plus-one index to satisfy the
page sharing pool participant API:
- A directory participant says it's eligible when last-empty-plus-one is nonzero.
- A directory participant reports the use epoch of the last empty page before where last-empty-plus-one points.
Note that in extreme cases this means having to search the bitvector for a set empty bit, since the
last-empty-plus-one could lag in being set to a lower value in case of races.
- A directory participant takes least recently used by searching backwards from last-empty-plus-one and taking
the last empty page. This action also updates last-empty-plus-one using the `pas_versioned_field` lock-free
tricks.
Once an empty page is found the basic idea is:
1. Clear the empty bit.
2. Try to take eligibility; i.e. make the page not eligible. We use the eligible bit as a kind of lock
throughout the segregated heap; for example, pages will not be eligible if they are currently used by some
local allocator. If this fails, we just return. Note that it's up to anyone who makes a page ineligible to
then check if it should set the empty bit after they make it eligible again. So, it's fine for the scavenger
to not set the empty bit again after clearing it and failing to take the eligible bit. This also prevents a
spin where the scavenger keeps trying to look at this allegedly empty page even though it's not eligible. By
clearing the empty bit and not setting it again in this case, the scavenger will avoid this page until it
becomes eligible.
3. Grab the *ownership lock* to say that the page is now decommitted.
4. Grab the commit lock and then put the page on the deferred decommit log.
5. Make the page eligible again.
Sadly, it's a bit more complicated than that:
- Directories don't track pages; they track views. Shared page directories have views of pages whose actual
eligibility is covered by the eligible bits in the segregated size directories that hold the shared page's
partial views. So, insofar as taking eligibility is part of the algorithm, shared page directories have to
take eligibility for each partial view associated with the shared view.
- Shared and exclusive views could both be in a state where they don't even have a page. So, before looking at
anything about the view, it's necessary to take the ownership lock. In fact, to have a guarantee that nobody
is messing with the page, we need to grab the ownership lock and take eligibility. The actual algorithm does
these two things together.
- It's possible for the page to not actually be empty even though the empty bit is set. We don't require that
the empty bit is cleared when a page becomes nonempty.
- Some of the logic of decommitting requires holding the `page->lock_ptr`, which may be a different lock than
the ownership lock. So the algorithm actually takes the ownership lock, then the page lock, then the ownership
lock again, and then the commit lock.
- If a `try_lock` fails, then we set both the eligible and empty bits, since in that case, we really do want
the page sharing pool to come back to us.
Some page configs support segregated pages that have multiple system pages inside them. In that case, the empty
bit gets set when any system page becomes empty (using granule use counts), and the taking algorithm just
decommits the granules rather than decommitting the whole page.
### Bitfit Directory Participant
Bitfit directories also use an empty bitvector and also support granules. Although bitfit directories are an
independent piece of code, their approach to participating in page sharing pools exactly mirrors what segregated
directories do.
This concludes the discussion about the page sharing pool and its participants. Next, I will cover some of the
other things that the scavenger does.
### Stopping Baseline Allocators
The scavenger also routinely stops baseline allocators. Baseline allocators are easy to stop by the scavenger
because they can be used by anyone who holds their lock. So the scavenger can stop any baseline allocator it
wants. It will only stop those allocators that haven't been used in a while (by checking and resetting a dirty
bit).
### Stopping Utility Heap Allocators
The scavenger also does the same thing for utility allocators. This just requires holding the `heap_lock`.
### Stopping TLC Allocators
Allocators in TLCs also get stopped, but this requires more effort. When the scavenger thread is running, it's
possible for any thread to be using any of its allocators and no locks are held when this happens. So, the
scavenger uses the following algorithm:
- First it tries to ask the thread to stop certain allocators. In each allocation slow path, the allocator
checks if any of the other allocators in the TLC have been requested to stop by the scavenger, and if so, it
stops those allocators. That doesn't require special synchronization because the thread that owns the
allocator is the one stopping it.
- If that doesn't work out, the scavenger suspends the thread and stops all allocators that don't have the
`in_use` bit set. The `in_use` bit is set whenever a thread does anything to a local allocator, and cleared
after.
One wrinkle about stopping allocators is that stopped allocators might get decommitted. The way that this is
coordinated is that a stopped allocator is in a special state that means that any allocation attempt will take
a slow path that acquires the TLC's scavenging lock and possibly recommits some pages and then puts the
allocator back into a normal state.
### Flushing TLC Deallocation Logs
The TLC deallocation logs can be flushed by any thread that holds the scavenging lock. So, the scavenger thread
flushes all deallocation logs that haven't been flushed recently.
When a thread flushes its own log, it holds the scavenger lock. However, appending doesn't grab the lock. To
make this work, when the scavenger flushes the log, it:
- Replaces any entry in the log that it deallocated with zero. Actually, the deallocation log flush always does
this.
- Does not reset the `thread_local_cache->deallocation_log_index`. In fact, it doesn't do anything except read
the field.
Because of the structure of the deallocation log flush, it's cheap for it to null-check everything it loads from
the deallocation log. So when, a thread goes to flush its log after the scavenger has done it, it will see a
bunch of null entries, and it will skip them. If a thread tries to append to the deallocation log while the
scavenger is flushing it, then this just works, because it ends up storing the new value above what the
scavenger sees. The object is still in the log, and will get deallocated on the next flush.
### Decommitting Unused Parts of TLCs
For any page in the TLC consisting entirely of stopped allocators, the scavenger can decommit those pages. The
zero value triggers a slow path in the local allocator and local view cache that then commits the page and
rematerializes the allocator. This is painstakingly ensured; to keep this property you generally have to audit
the fast paths to see which parts of allocators they access, and make sure that the allocator goes to a slow
path if they are all zero. That slow path then has to check if the allocator is stopped or decommitted, and if
it is either, it grabs the TLC's scavenger lock and recommits and rematerializes.
This feature is particularly valuable because of how big local allocators and local view caches are. When there
are a lot of heaps, this accounts for tens of MBs in some cases. So, being able to decommit unused parts is a
big deal.
### Decommitting Expendable Memory
Segregated heaps maintain lookup tables that map "index", i.e. the object size divided by the heap's minalign,
to allocator indices and directories. There are three tables:
- The index-to-allocator-index table. This only works for small-enough indices.
- The index-to-directory table. This also only works for small-enough indices.
- The medium index-to-directory-tuple binary search array. This works for the not-small-enough indices.
It makes sense to let those tables get large. It might make sense for each isoheap to have a large lookup table.
Currently, they use some smaller threshold. But to make that not use a lot of memory, we need to be able to
decommit memory that is only used by lookup tables of heaps that nobody is using.
So, libpas has this weird thing called `pas_expendable_memory`, which allows us to allocate objects of immortal
memory that automatically get decommitted if we don't "touch" them frequently enough. The scavenger checks the
state of all expendable memory, and decommits those pages that haven't been used in a while. The expendable
memory algorithm is not wired up as a page sharing participant because the scavenger really needs to poke the
whole table maintained by the algorithm every time it does a tick; otherwise the algorithm would not work.
Fortunately, there's never a lot of this kind of memory. So, this isn't a performance problem as far as I know.
Also, it doesn't actually save that much memory right now -- but also, right now isoheaps use smaller-size
tables.
This concludes the discussion of the scavenger. To summarize, the scavenger periodically:
- Stops baseline allocators.
- Stops utility heap allocators.
- Stops TLC allocators.
- Flushes TLC deallocation logs.
- Decommits unused parts of TLCs.
- Decommits expendable memory.
- Asks the physical page sharing pool to scavenge.
While it does these things, it notes whether anything is left in a state where it would be worthwhile to run
again. A steady state might look like that all empty pages have been decommitted, rather than that only old
enough ones were decommitted. If a steady state looks like it's being reached, the scavenger first sleeps for
a while, and then shuts down entirely.
## Megapages and Page Header Tables
Libpas provides non-large heaps two fast and scalable ways of figuring out about an object based on its
address, like during deallocation, or during user queries like `malloc_size`.
1. Megapages and the megapage table. Libpas describes every 16MB of memory using a two-bit enum called
`pas_fast_megapage_kind`. The zero state indicates that this address does not have a *fast megapage*. The
other two describe two different kinds of fast megapages, one where you know exactly the type of page it
is (small exclusive segregated) and one where you have to find out by asking the page header, but at least
you know that it's small (so usually 16KB).
2. Page header table. This is a lock-free-to-read hashtable that tells you where to find the page header for
some page in memory. For pages that use page headers, we also use the page header table to find out if the
memory address is in memory owned by some kind of "page" (either a segregated page or a bitfit page).
Usually, the small segregated and small bitfit configs use megapages. The medium segregated, medium bitfit, and
marge bitfit configs use page header tables. Large heaps don't use either, since they have the large map.
## The Enumerator
Libpas supports the libmalloc enumerator API. It even includes tests for it. The way that this is implemented
revolves around the `pas_enumerator` class. Many of the data structures that are needed by the enumerator have
APIs to support enumeration that take a `pas_enumerator*`. The idea behind how it works is conceptually easy:
just walk the heap -- which is possible to do accurately in libpas -- and report the metadata areas, the areas
that could contain objects, and the live objects. But, we have to do this while walking a heap in a foreign
process, and we get to see that heap through little copies of it that we request a callback to make for us. The
code for enumeration isn't all that pretty but at least it's easy to find most of it by looking for functions
and files that refer to `pas_enumerator`.
But a challenge of enumeration is that it happens when the remote process is stopped at some arbitrary
point. That point has to be an instruction boundary -- so CPU memory model issues aren't a problem. But this
means that the whole algorithm has to ensure that at any instruction boundary, the enumerator will see the right
thing. Logic to help the enumerator still understand the heap at any instruction is spread throughout the
allocation algorithms. Even the trees and hashtables used by the large heap have special hacks to enable them to
be enumerable at any instruction boundary.
The enumerator is maintained so that it's 100% accurate. Any discrepancy -- either in what objects are reported
live or what their sizes are -- gets flagged as an error by `test_pas`. The goal should be to maintain perfect
enumerator accuracy. This makes sense for two reasons:
1. I've yet to see a case where doing so is a performance or memory regression.
2. It makes the enumerator easy to test. I don't know how to test an enumerator that is not accurate by design.
If it's accurate by design, then any discrepancy between the test's understanding of what is live and what
the enumerator reports can be flagged as a test failure. If it wasn't accurate by design, then I don't know
what it would mean for a test to fail or what the tests could even assert.
## The Basic Configuration Template
Libpas's heap configs and page configs allow for a tremendous amount of flexibility. Things like the utility
heap and the `jit_heap` leverage this flexibility to do strange things. However, if you're using libpas to
create a normal malloc, then a lot of the configurability in the heap/page configs is too much. A "normal"
malloc is one that is exposed as a normal API rather than being internal to libpas, and that manages memory that
doesn't have special properties like that it's marked executable and not writeable.
The basic template is provided by `pas_heap_config_utils.h`. To define a new config based on this template, you
need to:
- Add the appropriate heap config and page config kinds to `pas_heap_config_kind.def`,
`pas_segregated_page_config_kind.def`, and `pas_bitfit_page_config_kind.def`. You also have to do this if you
add any kind of config, even one that doesn't use the template.
- Create the files `foo_heap_config.h` and `foo_heap_config.c`. These are mostly boilerplate.
The header file usually looks like this:
#define ISO_MINALIGN_SHIFT ((size_t)4)
#define ISO_MINALIGN_SIZE ((size_t)1 << ISO_MINALIGN_SHIFT)
#define ISO_HEAP_CONFIG PAS_BASIC_HEAP_CONFIG( \
iso, \
.activate = pas_heap_config_utils_null_activate, \
.get_type_size = pas_simple_type_as_heap_type_get_type_size, \
.get_type_alignment = pas_simple_type_as_heap_type_get_type_alignment, \
.dump_type = pas_simple_type_as_heap_type_dump, \
.check_deallocation = true, \
.small_segregated_min_align_shift = ISO_MINALIGN_SHIFT, \
.small_segregated_sharing_shift = PAS_SMALL_SHARING_SHIFT, \
.small_segregated_page_size = PAS_SMALL_PAGE_DEFAULT_SIZE, \
.small_segregated_wasteage_handicap = PAS_SMALL_PAGE_HANDICAP, \
.small_exclusive_segregated_logging_mode = pas_segregated_deallocation_size_oblivious_logging_mode, \
.small_shared_segregated_logging_mode = pas_segregated_deallocation_no_logging_mode, \
.small_exclusive_segregated_enable_empty_word_eligibility_optimization = false, \
.small_shared_segregated_enable_empty_word_eligibility_optimization = false, \
.small_segregated_use_reversed_current_word = PAS_ARM64, \
.enable_view_cache = false, \
.use_small_bitfit = true, \
.small_bitfit_min_align_shift = ISO_MINALIGN_SHIFT, \
.small_bitfit_page_size = PAS_SMALL_BITFIT_PAGE_DEFAULT_SIZE, \
.medium_page_size = PAS_MEDIUM_PAGE_DEFAULT_SIZE, \
.granule_size = PAS_GRANULE_DEFAULT_SIZE, \
.use_medium_segregated = true, \
.medium_segregated_min_align_shift = PAS_MIN_MEDIUM_ALIGN_SHIFT, \
.medium_segregated_sharing_shift = PAS_MEDIUM_SHARING_SHIFT, \
.medium_segregated_wasteage_handicap = PAS_MEDIUM_PAGE_HANDICAP, \
.medium_exclusive_segregated_logging_mode = pas_segregated_deallocation_size_aware_logging_mode, \
.medium_shared_segregated_logging_mode = pas_segregated_deallocation_no_logging_mode, \
.use_medium_bitfit = true, \
.medium_bitfit_min_align_shift = PAS_MIN_MEDIUM_ALIGN_SHIFT, \
.use_marge_bitfit = true, \
.marge_bitfit_min_align_shift = PAS_MIN_MARGE_ALIGN_SHIFT, \
.marge_bitfit_page_size = PAS_MARGE_PAGE_DEFAULT_SIZE, \
.pgm_enabled = false)
PAS_API extern const pas_heap_config iso_heap_config;
PAS_BASIC_HEAP_CONFIG_DECLARATIONS(iso, ISO);
Note the use of `PAS_BASIC_HEAP_CONFIG`, which creates a config literal that automatically fills in a bunch of
heap config, segregated page config, and bitfit page config fields based on the arguments you pass to
`PAS_BASIC_HEAP_CONFIG`. The corresponding `.c` file looks like this:
const pas_heap_config iso_heap_config = ISO_HEAP_CONFIG;
PAS_BASIC_HEAP_CONFIG_DEFINITIONS(
iso, ISO,
.allocate_page_should_zero = false,
.intrinsic_view_cache_capacity = pas_heap_runtime_config_zero_view_cache_capacity);
Note that this just configures whether new pages are zeroed and what the view cache capacity for the intrinsic
heap are. The *intrinsic heap* is one of the four categories of heaps that the basic heap configuration template
supports:
- Intrinsic heaps are global singleton heaps, like the common heap for primitives. WebKit's fastMalloc bottoms
out in an intrinsic heap.
- Primitive heaps are heaps for primitive untyped values, but that aren't singletons. You can have many
primitive heaps.
- Typed heaps have a type, and the type has a fixed size and alignment. Typed heaps allow allocating single
instances of objects of that type or arrays of that type.
- Flex heaps are for objects with flexible array members. They pretend as if their type has size and alignment
equal to 1, but in practice they are used for objects that have some base size plus a variable-length array.
Note that libpas doesn't correctly manage flex memory in the large heap; we need a variant of the large heap
that knows that you cannot reuse flex memory between different sizes.
The basic heap config template sets up some basic defaults for how heaps work:
- It makes small segregated and small bitfit page configs put the page header at the beginning of the page and
it arranges to have those pages allocated out of megapages.
- It makes medium segregated, medium bitfit, and marge bitfit use page header tables.
- It sets up a way to find things like the page header tables from the enumerator.
- It sets up segregated shared page directories for each of the segregated page configs.
The `bmalloc_heap_config` is an example of a configuration that uses the basic template. If we ever wanted to
put libpas into some other malloc library, we'd probably create a heap config for that library, and we would
probably base it on the basic heap config template (though we don't absolutely have to).
## JIT Heap Config
The JIT heap config is for replacing the MetaAllocator as a way of doing executable memory allocation in WebKit.
It needs to satisfy two requirements of executable memory allocation:
- The allocator cannot read or write the memory it manages, since that memory may have weird permissions at any
time.
- Clients of the executable allocator must be able to in-place shrink allocations.
The large heap trivially supports both requirements. The bitfit heap trivially supports the second requirement,
and can be made to support the first requirement if we use page header tables for all kinds of memory, not just
medium or marge. So, the JIT heap config focuses on just using bitfit and large and it forces bitfit to use
page header tables even for the small bitfit page config.
## Security Considerations
### Probabilistic Guard Malloc
Probabilistic Guard Malloc (PGM) is a new allocator designed to catch use after free attempts and out of bounds accesses.
It behaves similarly to AddressSanitizer (ASAN), but aims to have minimal runtime overhead.
The design of PGM is quite simple. Each time an allocation is performed an additional guard page is added above and below the newly
allocated page(s). An allocation may span multiple pages. When a deallocation is performed, the page(s) allocated will be protected
using mprotect to ensure that any use after frees will trigger a crash. Virtual memory addresses are never reused, so we will never run
into a case where object 1 is freed, object 2 is allocated over the same address space, and object 1 then accesses the memory address
space of now object 2.
PGM does add notable memory overhead. Each allocation, no matter the size, adds an additional 2 guard pages (8KB for X86_64 and 32KB
for ARM64). In addition, there may be free memory left over in the page(s) allocated for the user. This memory may not be used by any
other allocation.
We added limits on virtual memory and wasted memory to help limit the memory impact on the overall system. Virtual memory for this
allocator is limited to 1GB. Wasted memory, which is the unused memory in the page(s) allocated by the user, is limited to 1MB.
These overall limits should ensure that the memory impact on the system is minimal, while helping to tackle the problems of catching
use after frees and out of bounds accesses.
## The Fast Paths
All of the discussion in the previous sections is about the innards of libpas. But ultimately, clients want to
just call malloc-like and free-like functions to manage memory. Libpas provides fast path templates that actual
heap implementations reuse to provide malloc/free functions. The fast paths are:
- `pas_try_allocate.h`, which is the single object allocation fast path for isoheaps. This function just takes a
heap and no size; it allocates one object of the size and alignment that the heap's type wants.
- `pas_try_allocate_array.h`, which is the array and aligned allocation fast path for isoheaps. You want to use
it with heaps that have a type, and that type has a size and alignment, and you want to allocate arrays of
that type or instances of that type with special alignment.
- `pas_try_allocate_primitive.h`, which is the primitive object allocation fast path for heaps that don't have
a type (i.e. they have the primitive type as their type -- the type says it has size and alignment equal to
1).
- `pas_try_allocate_intrinsic.h`, which is the intrinsic heap allocation fast path.
- `pas_try_reallocate.h`, which provides variants of all of the allocators that reallocate memory.
- `pas_deallocate.h`, which provides the fast path for `free`.
- `pas_get_allocation_size.h`, which is the fast path for `malloc_size`.
One thing to remember when dealing with the fast paths is that they are engineered so that malloc/free functions
do not have a stack frame, no callee saves, and don't need to save the LR/FP to the stack. To facilitate this,
we have the fast path call an inline-only fast path, and if that fails, we call a "casual case". The inline-only
fast path makes no out-of-line function calls, since if it did, we'd need a stack frame. The only slow call (to
the casual case) is a tail call. For example:
static PAS_ALWAYS_INLINE void* bmalloc_try_allocate_inline(size_t size)
{
pas_allocation_result result;
result = bmalloc_try_allocate_impl_inline_only(size, 1);
if (PAS_LIKELY(result.did_succeed))
return (void*)result.begin;
return bmalloc_try_allocate_casual(size);
}
The way that the `bmalloc_try_allocate_impl_inline_only` and `bmalloc_try_allocate_casual` functions are created
is with:
PAS_CREATE_TRY_ALLOCATE_INTRINSIC(
bmalloc_try_allocate_impl,
BMALLOC_HEAP_CONFIG,
&bmalloc_intrinsic_runtime_config.base,
&bmalloc_allocator_counts,
pas_allocation_result_identity,
&bmalloc_common_primitive_heap,
&bmalloc_common_primitive_heap_support,
pas_intrinsic_heap_is_designated);
All allocation fast paths require this kind of macro that creates a bunch of functions -- both the inline paths
and the out-of-line paths. The deallocation, reallocation, and other fast paths are simpler. For example,
deallocation is just:
static PAS_ALWAYS_INLINE void bmalloc_deallocate_inline(void* ptr)
{
pas_deallocate(ptr, BMALLOC_HEAP_CONFIG);
}
If you look at `pas_deallocate`, you'll see cleverness that ensures that the slow path call is a tail call,
similarly to how allocators work. However, for deallocation, I haven't had the need to make the slow call
explicit in the client side (the way that `bmalloc_try_allocate_inline` has to explicitly call the slow path).
This concludes the discussion of libpas design.
# Testing Libpas
I've tried to write test cases for every behavior in libpas, to the point that you should feel comfortable
dropping a new libpas in WebKit (or wherever) if `test_pas` passes.
`test_pas` is a white-box component, regression, and unit test suite. It's allowed to call any libpas function,
even internal functions, and sometimes functions that libpas exposes only for the test suite.
Libpas testing errs on the side of being comprehensive even if this creates annoying situations. Many tests
assert detailed things about how many objects fit in a page, what an object's offset is in a page, and things
like that. This means that some behavior changes in libpas that aren't in any way wrong will set off errors in
the test suite. So, it's common to have to rebase tests when making libpas changes.
The libpas test suite is written in C++ and uses its own test harness that forks for each test, so each test
runs in a totally pristine state. Also, the test suite can use `malloc` and `new` as much as it likes, since in
the test suite, libpas does not replace `malloc` and `free`.
The most important libpas tests are the so-called *chaos* tests, which randomly create and destroy objects and
assert that the heap's state is still sane (like that no live objects overlap, that all live objects are
enumerable, etc).
# Conclusion
Libpas is a beast of a malloc, designed for speed, memory efficiency, and type safety. May whoever maintains it
find some joy in this insane codebase!
|