1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462
|
TABLE OF CONTENTS:
I. Introduction
II. Primitives
III. The whole picture
IV. Processes Vs. threads
V. Communications between core process and plugins
VI. Memory table plugin
VII. SQL issues and *SQL plugins
VIII. pmacctd flows counter implementation
IX. Classifier and connection tracking engines
X. Jumps and flow diversions in Pre-Tagging infrastructure
XI. BGP daemon thread dimensioning
I. Introduction
The goal of this document would be to give extra insight on some of the internals of
pmacct (useful for development or simply constructive critics). Since March 2005,
this document is complemented by a paper about an architectural overview of the
project 'pmacct: steps forward interface counters'; the referred paper is available
for download at the pmacct homepage.
II. Primitives
Individual packets and specific traffic flows are identified by their header fields.
The union of such header fields generates (a rather large) set of primitives. Traffic
aggregates are typically identified by a reduced set of primitives instead. Packets
or flows are merged into aggregates by defining a sub-set of primitives, by shaving
off unused primitives from the original set and finally by summing their counters
(ie. bytes and packets). Additional operations involved into the process can also
include logical grouping of specific primitives into more general entities (ie. IP
addresses into network prefixes), temporal aggregation (ie. splitting aggregates in
time bins), tagging, sampling and filtering.
What are primitives? They are atomic expressions ie. "src_port", "dst_host", "proto";
currently the unique boolean operator supported to glue expressions is "and". Hence,
traffic could be aggregated translating a "who connects where, using which service"
speech language statement into one recognized by pmacct: "src_host,dst_host,dst_port,
proto". Comma, because of the unique logical connective "and", is simply intended as
a separator.
III. The high-level picture
----[ nfacctd loop ]--------------------------------------------
| |
| [ check ] [ handle ] [ Correlate ] |
| ... =====[ Allow ]===== [ maps ]===== [ BGP, IGP ] ... |
| [ table ] |
| nfacctd.c pretag.c bgp/bgp.c |
| pretag_handlers.c isis/isis.c |
| pretag-data.h |
----------------------------------------------------------------
\
|
|
-----[ core process ]--------------------------------------------------------------------------------
| | |
| | [ apply ] [ evaluate ] [ handle ] [ write buffer ] |
| | [ pre_tag_filter ] [ primitives ] |==[ channel buffer ]====[ to plugin ]==== ... |
mirrored | / && && | |
traffic | / [ apply ] [ apply ] | [ handle ] [ write buffer ] |
====================[ aggregate_filter ]====[ post_tag ]======|==[ channel buffer ]====[ to plugin ]==== ... |
NetFlow | \ && | |
| | [ evaluate ] | [ handle ] [ write buffer ] |
| | [ packet sampling ] |==[ channel buffer ]====[ to plugin ]==== ... |
| | |
| \ plugin_hooks.c |
-----------------------------------------------------------------------------------------------------
|
|
/
----[ pmacctd loop ]--------------------------------------------------------------------------------
| |
| [ handle ] [ handle ] [ handle ] [ handle ] [ handle ] |
| ... ====[ link layer ]=====[ IP layer ]====[ fragments ]==== [ flows ]==== [ classification ] ... |
| ll.c nl.c ip_frag.c ip_flow.c nDPI library |
| |
| [ handle ] [ Correlate ] |
| ... ====[ maps ]===== [ BGP, IGP ] ... |
| pretag.c bgp/bgp.c |
| pretag_handlers.c isis/isis.c |
| pretag-data.h |
----------------------------------------------------------------------------------------------------
Except for the protocol specifics sfacctd loop is similar to nfacctd loop; same goes
for uacctd being similar to pmacctd.
IV. Processes Vs. threads
pmacct daemons, ie. pmacctd, nfacctd, rely over both a multi-thread and multi-process
architecture. Processes are used to encapsulate each plugin instance and, indeed, the
Core Process. Threads are used to encapsulate specific functions within each process -
ie. the BGP daemon thread within the Core Process. The Core Process either captures
packets via the well-known libpcap API (pmacctd) or listens for specific packets coming
from the network (nfacctd, for example, listens for NetFlow packets); packets are then
processed (parsed, filtered, sampled, tagged, aggregated and buffered, if required as
per runtime config) and sent to the active plugins. Plugins in turn pick and handle in
some meaningful way aggregated data (struct pkt_data), ie. writing them to a RDBMS, a
memory table, etc. A diagram follows:
|===> [ pmacctd/plugin ]
libpcap shared mem | ZeroMQ
===========> [ pmacctd/core ]==============|===> [ pmacctd/plugin ]
socket
To conclude a position on threads: threads are a necessity because of the tendency
modern CPU are built (ie. multi-core). So far pmacct limits itself to a macro-usage
of threads, ie. where it makes sense to save on IPC or where big memory structures
would lead the pages' copy-on-write to perform horrendly. The rationale is that fine-
grained multi-threading can often become a fertile source of bugs.
V. Communications between core process and plugins
A single running Core Process, which gathers packets or flows from the network, is able to
feed aggregated data to multiple plugins; plugins are distinguished by their name. Names,
for this purpose, need to be unique. Data is pushed to active plugins via either a home-
grown circular queue or a ZeroMQ queue. Rest of this chapter focuses only on the home-
grown circular queue; also, although the home-grown queue is still the default pick (as
it does not bring any external library dependencies), the ZeroMQ queue is currently the
recommended way to go. Each plugin has a private control channel with the Core Process.
Circular queues are encapsulated into a more complex channel structure which also includes:
copy of the aggregation method, an OOB (Out-of-Band) signalling channel, buffers, one or
more filters and a pointer to the next free queue element. The Core Process simply loops
around all established channels, in a round-robin fashion, feeding data to active plugins.
The circular queue is effectively a shared memory segment; if the Plugin is sleeping (ie.
because the arrival of new data from the network is not at sustained rates), then the Core
Process kicks the specific Plugin flagging that new data is now available at the specified
memory address; the Plugin catches the message and copies the buffer into its private memory
space; if the Plugin is not sleeping instead, once finished processing the current element,
will check the next queue element to see whether new data are available. Either cases, the
Plugin continues processing the received buffer.
'plugin_pipe_size' configuration directive aims to tune manually the circular queue size;
raising its size is vital when facing large volumes of traffic, because the amount of data
pushed onto the queue is directly (linearly) proportional to the number of packets captured
by the core process. A small additional space is allocated for the out-of-band signalling
mechanism, which is UNIX pipe-based. 'plugin_buffer_size' defines the transfer buffer size
and is disabled by default. Its value has to be <= the circular queue size, hence the queue
will be divided into 'plugin_buffer_size'/'plugin_pipe_size' chunks. Let's write down a
few simple equations:
dss = Default Segment Size
dbs = Default Buffer Size = sizeof(struct pkt_data)
as = Address Size = sizeof(char *) (depends on hardware architecture)
bs = 'plugin_buffer_size' value
ss = 'plugin_pipe_size' value
a) no 'plugin_buffer_size' and no 'plugin_pipe_size':
circular queue size = 4MB
signalling queue size = (4MB / dbs) * as
b) 'plugin_buffer_size' defined but no 'plugin_pipe_size':
circular queue size = 4MB
signalling queue size = (4MB / bs) * as
c) no 'plugin_buffer_size' but 'plugin_pipe_size' defined:
circular queue size = ss
signalling queue size = (ss / dbs) * as
d) 'plugin_buffer_size' and 'plugin_pipe_size' defined:
circular queue size = ss
signalling queue size = (ss / bs) * as
If 'plugin_buffer_size' is not defined, it is set to the minimum size possible in order
to contain one element worth of data for the selected aggregation method. The signalling
queue must be able to address the full circular buffer: should that not be possible due
to OS restrictions, ie. on Linux systems /proc/sys/net/core/[rw]mem_max and
/proc/sys/net/core/[rw]mem_default, a log message is printed (along the lines of "ctrl
channel: obtained=<X> bytes target=<Y> bytes", where Y is what is needed in order to
address the circular queue and X is what the underlying OS conceded).
Few final remarks: a) buffer size of 10KB and pipe size of 10MB are well-tailored for most
common environments; b) by enabling buffering, attaching the collector to a mute interface
and doing some pings will not show any result (... data are buffered); c) take care to the
ratio between the buffer size and pipe size; choose for a ratio not less than 1:100. But
keeping it around 1:1000 is strongly adviceable; selecting a reduced ratio could lead to
filling the queue: when that happens a warning message indicating the risk of data loss is
printed. You may alternatively do some calculations based on the knowledge of your network
environment:
average_traffic = packets per seconds in your network segment
sizeof(struct pkt_data) = ~70 bytes
pipe size >= average_traffic * sizeof(struct pkt_data)
circular queue
[ pmacctd/core ] =================================> [ pmacctd/plugin ]
| | | |
| | enqueued buffers free | |
| |==|==|==|==|==|==|==|===========| |
| |
`-------------------------------------------'
OOB signalling queue
VI. Memory table plugin
In-Memory Table plugin (IMT) stores the aggregates as they have been assembled by core
process in a memory structure, organized as an hash table. Such table is divided in a
number of buckets. Aggregates are framed into a structure defined as 'struct acc' and
then direct mapped to a bucket by the mean of a modulo function. Collisions in each
bucket are solved building collision chains. An auxiliar structure, a LSU cache (Last
Recently Used), is provided to speed up searches and updates into the main table. LSU
saves last updated or searched element for each bucket: when a new operation on the
bucket is required, the LSU cache is compared first; if it doesn't match, the collision
chain gets traversed.
It's adviceable to use a prime number of buckets (defined by 'imt_buckets' configuration
directive), because it helps in achieving better data dispersion when applying the modulo
function. Collision chains are organized as linked lists of elements, so they should be
kept short because of the linear search over them; having a flat table (so, raising the
number of buckets) helps in keeping chains short. Memory is allocated in large chunks,
called memory pools, to limit as possible bad effects (such as trashing) derived from
dispersion through the memory pages. In fact, drawbacks of the dense use of malloc()
calls are extensively described on every Operating Systems textbook. Memory allocations
are tracked via a linked list of chunk descriptors (struct memory_pool_desc) for later
jobs such as freeing unused memory chunks, operations over the list, etc.
The memory structure can be allocated either 'fixed' or 'dynamic'; when dealing with a
fixed table, all descriptors are allocated in a single shot when the daemon is fired up;
when dealing with a 'dynamic' memory table (which is allowed to grow undefinitely in
memory new chunks of memory are allocated and added to the list during the execution.
Using a fixed table places a maximum limit to the number of entries the table is able
to store; the following calculation may help in building a fixed table:
ES (Entry Size) ~ 50 bytes
NE (Number of entries)
NE = ('imt_mem_pools_number' * 'imt_mem_pools_size') / ES
Default values are: imt_mem_pools_number = 16; imt_mem_pools_size = 8192; this will let
the default fixed table to contain a maximum of slightly more than 2600 aggregates.
However note the entry size is indicative and can very consistently, ie. depending if
IPv6 or Layer2 are enabled at compile time or whether BGP, MPLS, NAT, etc. primitives
are in use as part of the aggregation key. When a fixed size table is needed, it is
better to constrain it on the size rather than the estimated number of entries to fit.
IMT plugin does not rely any way over the realloc() function, but only mmap(). Table
grows and shrinks with the help of the above described tracking structures. This is
because of a few assumptions about the use of realloc():
(a) try to reallocate on the original memory block and (b) if (a) failed, allocate
another memory block and copy the contents of the original block to this new location.
In this scheme (a) can be done in constant time; in (b) only the allocation of new memory
block and the deallocation of original block are done in constant time, but the copy of
previous memory area, for large in-memory tables, could perform horrendly.
Data stored into the memory structure can be either fetched, erased or zeroed by a client
tool, pmacct, communicating through a Unix Domain socket (/tmp/collect.pipe by default).
The available queries are 'bulk data retrieval', 'group data retrieval' (partial match),
'single entry retrieval' (exact match) and 'erase table'. Additionally, both partial and
full matches may supply a request for resetting the counters.
On the server side, the client query is evaluated: requests that need just a short stroll
through the memory structure are accomplished by the plugin itself, the others (for example
batch queries or bulk data retrieval) are served by a child process spawned by the plugin.
Because memory table is allocated 'shared', operations requiring table modifications by
such child (eg. resetting counters for an entry) are handled by raising a flag instead:
next time the plugin will update that entry, it will also serve any pending request.
With the introduction of batch queries (which enable to group into a single query up to
4096 requests) transfers may be fragmented by the Operating System. IMT plugin will take
care of recomposing all fragments, expecting also a '\x4' placeholder as 'End of Message'
marker. If an incomplete message is received, it's discarded as soon as current transfer
timeout expires (1s).
VII. SQL issues and *SQL plugins
Storing aggregates into a persistent backend leaves chances for advanced operations and
so these plugins are intended to give a wider range of features (eg. write to a backup
DB if the primary DB fails, counters breakdown, etc.) not available in other plugins.
Let's firstly give a whole picture of how these SQL plugins work. As packets received
from core process via communication channel get unpacked, they are inserted in a direct-
mapped cache; then, at fixed time intervals (configurable via 'sql_refresh_time' key)
the cache scanner kicks in and purges content to the database; optionally triggers
may be executed (sql_trigger_exec, sql_trigger_time). Data to cache bucket mapping is
computed via a modulo function. If the selected bucket already contains valid data then
a conflict chain is built (or traversed if it already exists); the first free node
encountered is used; if no free nodes are found then two more chances are explored: if
any node has been marked as stale (it happens when an allocated node is unused for some
consecutive time-bins) it is then reused by moving it away from its old chain; if no
free nodes are available then a new one is allocated. Stale nodes are then retired if
they still remain unused for longer times (RETIRE_TIME**2). To speed up nodes reuse and
retirement, an additional LRU list of nodes is also mantained.
If out of memory or the maximum number of allowed elements in the cache is reached data
is immediately written to the database so to make room for further elements. The maximum
number of allowed elements is defined to prevent the cache to grow in memory without any
control. Such limit is internally calculated as:
max cache elements = sql_cache_entries + ( sql_refresh_time * 100 )
As said before, aggregates are pushed into the DB at regular intervals; to speed up such
operation a queue of pending queries is built as nodes are used; this allows to avoid long
walks through the whole cache structure given, for various reasons (ie. classification,
sql_startup_delay) not all elements might be eligible for purging.
When the cache scanner kicks in a new writer process is spawned and in charge of processing
the pending elements queue; SQL statements are built and sent to the RDBMS. Writers can,
depending on the conditions of the DB, take long time before completing, ie. longer than
pmacct interval to purge to the DB, sql_refresh_time. A sql_max_writers feature allows to
impose a maximum number of writers to prevent forming an endless queue, hence starving
system resources, and at the expense of data loss.
Because we, at this moment, don't known if INSERT queries would create duplicates, an
UPDATE query is launched first and only if no rows are affected, then an INSERT query
is trapped. 'sql_dont_try_update' twists this behaviour and skips directly to INSERT
queries; when enabling this configuration directive, you must be sure there are no risks
of duplicate aggregates to avoid data loss.
Data in the cache is never erased but simply marked as invalid; this way while correctess
of data is still preserved, we avoid the waste of CPU cycles (if there is no immediate
need to free memory up).
The number of cache buckets is tunable via the 'sql_cache_entries' configuration key; a
prime number is strongly advisable to ensure a better data dispersion through the cache.
Three notes about the above described process: (a) some time ago the concept of lazy data
refresh deadlines has been introduced. Expiration of timers is checked without the aid of
UNIX signals but when new data comes in. If such data arrival rate is low, data is not
kept stale into the cache but a poll() timeout makes the wheel spin. (b) SQL plugins
main loop has been purposedly kept sufficiently fast thanks to no direct interaction
with the RDBMS: it only gets data, computes modulo and handles both cache and queries
queue. (c) cache has been thought to exploit a kind of temporal locality in internet
flows. A picture follows:
|====> [ cache ] ===|
pipe | |
======> [ pmacctd/SQL plugin ] =====|====> [ cache ] ===|=============+=====| primary DB |=====>
| | |
|====> [ cache ] ===| `-----| secondary DB |===>
Now, let's keep an eye on how aggregates are structured on the DB side. Data is simply organized
in flat tuples, without any external references. After being not full convinced about better
normalized solutions aimed to satifsy an abstract concept of flexibility, we've (and here come
into play the load of mails exchanged with Wim Kerkhoff) found that simple means faster. And
spinning the wheel quickly is a key achievement, because pmacct needs not only to insert new
records but also update existing ones, putting under heavy pressure RDBMS when placed in busy
network environments and an high number of primitives are required.
Now a pair of concluding practical notes: (a) default SQL table and its primary key are suitable
for many normal usages, however unused fields will be filled by zeroes. We took this choice a long
time ago to allow people to compile sources and quickly get involved into the game, without caring
too much about SQL details (assumption: who is involved in network management, shoult not have
necessarily to be also involved into SQL stuff). So, everyone with a busy network segment under his
feet has to carefully tune the table himself to avoid performance constraints; 'sql_optimize_clauses'
configuration key evaluates what primitives have been selected and avoids long 'WHERE' clauses in
'INSERT' and 'UPDATE' queries. This may involve the creation of auxiliar indexes to let the execution
of 'UPDATE' queries to work smoothly. A custom table might be created, trading flexibility with disk
space wasting. (b) when using timestamps to break down aggregates into timeframes ('sql_history' key),
validity of data is connected not only to data itself but also to its timeframe; as stated before,
aggregates are pushed into DB at regular intervals ('sql_refresh_time' key). Connecting these two
elements (refresh time and historical timeframe width) with a multiplicative factor helps in avoiding
transient cache aliasing phenomena and in fully exploiting cache benefits.
VIII. pmacctd flows counter implementation
Let's take the definition of IP flows from RFC3954, titled 'Cisco Systems NetFlow Services Export
Version 9': an IP flow is defined as a set of IP packets passing an observation point in the network
during a certain time interval. All packets that belong to a particular flow have a set of common
properties derived both from the data contained in the packet and from the packet treatment at the
observation point. Packets belonging to a specific flow also feature a very high temporal locality.
While the handmade IP flows implementation in pmacctd mantains the fore-mentioned properties, it
behaves quite differently when compared with NetFlow. In fact, the typical NetFlow implementation
accumulates packets belonging to the same flow into a single memory object; when it comes to expire
(because either the flow hits a timeout or an intercepted quit message - ie. TCP FIN/RST -) it is
released and pushed into a NetFlow packet which is in turn sent to the collector. On the contrary,
pmacctd does not accumulate; each packet is looked up against the flow buffer - a memory structure
for active flows bookeping -: if it belongs to an already active flows its 'new flow' flag is
deactivated (0); otherwise it's activated (1).
While the above method is savy in terms of resource consumption, it could have some side-effects:
for example it causes an entry to have a flow value '0' after either a reset of the backend memory
structure (ie. pmacct -e, pmacct ... -r, etc.) or the beginning of a new timeframe when historical
accounting is enabled (ie. print plugin, 'sql_history', etc.).
IX. Classifier and connection tracking engines
Classification and connection tracking features were introduced in pmacctd and uacctd daemons as
early as 0.10.0 release. As of pmacct 1.7, classification is switched from the mixed home-grown
implementation + L7 layer project to the nDPI library. Firstly, let's give a look to the global
picture; then how they work:
----[ pmacctd loop ]-------------------------------------------------------------
| [ regular ] |
| [ expression ] |
| ___[ patterns ] |
| / / |
| / ______/ |
| | / |
| | / |
| [ fragment ] [ flow ] [ flow ] [ connection ] |
| ... ==>[ handling ]==>[ handling ]==>[ classification ]==>[ tracking ]==> ... |
| [ engine ] [ engine ] [ engine ] [ engine ] |
| ip_frag.c ip_flow.c nDPI library \ conntrack.c |
| | \___ |
| \ \ |
| \ [ shared ] |
| --[ object ] |
| [ pattern ] |
---------------------------------------------------------------------------------
As the above picture shows, the classification engine is hooked to the flow handling engine. In
fact, being able to successfully classify single packets means we can mark accordingly the whole
bi-directional flow (referred also as stream) they belong to. The flow engine determines whether
a flow is either newly established or terminated, sets its timeouts per protocol and state and
handles timestamps. The classification engine coordinates the efforts to classify the packets by
setting a maximum number of classification tentatives, handling bytes/packets accumulators for
(still) unknown flows and attaching connection tracking modules whenever required. In case of
successful classification, accumulators are released and sent to the active plugins, which, in
turn, whenever possible (ie. counters have not been cleared, sent to the DB, etc.) will move
such amounts from the 'unknown' class to the newly determined one.
A connection tracking module might be assigned to certain classified streams if they belong to
a protocol which is known to be based over a control channel (ie. FTP, RTSP, SIP, H.323, etc.).
However, some protocols (ie. MSN messenger) spawn data channels that can still be distinguished
because of some regular patterns into the payload; in such cases a classificator exists rather
than a tracking module. Connection tracking modules hint IP address/port couples for upcoming
data streams as signalled by one of the parties into the control channel; such pieces of
information are then meant to classify the new data streams.
In this context, 'snaplen' directive, which specifies the maximum number of bytes to capture for
each packet, has key importance. In fact, some protocols (mostly text-based eg. RTSP, SIP, etc.)
benefit of extra bytes because they give more chances to identify new data streams spawned by
by the control channel. But it must be also noted that capturing larger packet portion require
more system resources. Thus, the right value need to be traded-off. By enabling classification,
values under 200 bytes are often meaningless. 500-750 bytes should be enough even for text-based
protocols.
X. Jumps and flow diversions in Pre-Tagging infrastructure
Pre-Tagging infrastructure allows to flexibly mark either packets or flows collected from the
network. It's basically a chain of rules: the first matching rule wins (ie. tags collected data).
This approach is effective to mantain a 1-to-1 relationship between data and tags; but this is
somewhat limiting in some scenarios, ie. the need to account internal data to both the sender
and the receiver: this time you would need a 1-to-2 relationship. In this context, being able
to handle 1-to-2 relationships becomes a requirement when sampling comes into play as any
single sample is required to be assigned to both parties in order for the algorithms to work
correctly. Relationships 1-to-1 are precisely the aim for jeq, stack and return keys.
XI. BGP daemon thread dimensioning
Memory structure of the BGP daemon thread can be broken down in three modules: IP prefixes, BGP
information and BGP attributes. Up to version 0.12.3 the multi-RIB nature of the BGP daemon was
achieved separating IP prefixes and BGP information on a per active BGP peer basis, while BGP
attributes were shared among all the peers. From a logical point these structures were relating
each other as follows: IP prefix -(1:1)-> BGP information -(N:1)-> BGP attributes.
Version 0.12.4 sees introduction of a memory model by which also IP prefixes are being shared
among the BGP peers - leading to consistent memory savings whenever multiple BGP peers export
full tables due to the almost total overlap of information. In this new model, structures are
relating each other as follows: IP prefix -(1:N)-> BGP information -(N:1)-> BGP attributes. The
longest match nature of IP lookups required to raise BGP peer awareness of the lookup algorithm
in order to fully support the newly established 1:N relation.
Following are some calculations that can ease memory dimensioning in order to support a certain
number of peers, num(B), each exporting the same full-routing table, T, which consists of num(P)
number of prefixes. The shared base of BGP attributes, A, is estimated in roughly 200MB. The BGP
information base, I, is also calculated.
sz(P) = 16 (size of an IP prefix)
sz(Tn) = 48 (size of routing table node)
sz(In) = 32 (size of BGP information base node)
sz(THn) = 8 (size of routing table hash node)
T = num(P) * (sz(Tn) + sz(P) + (sz(THn) * bgp_table_peer_buckets))
I = num(B) * num(P) * sz(In)
RIB = T + I + A
Sizes are based on worse-case scenario, ie. 64-bit executable with support for IPv6 compiled in.
num(P) is a way to simplify calculations while retaining good accuracy; whereas higher precision
can be achieved by using union(P), which is the union of all prefixes exported by all BGP peers.
bgp_table_peer_buckets is set by default to 13 and is adviceable to keep its value to 1/10 of the
expected number of BGP peers; increasing such ratio improves lookup performances at the expense
of more sustained memory usage.
As an example, let's imagine a scenario where 500 BGP peers export the same 500K IPv4 routes and
50K IPv6 routes and bgp_table_peer_buckets is set to a value of 50:
T = (500K+50K) * (48 + 16 + (8 * 50)) = 256MB
I = 500 * (500K+50K) * 32 = 8.8GB
A = ~200MB
RIB = 256MB + 8.8GB + 200MB = ~9.26GB
Still, this example assumes a 64-bit executable. For an educated guess on how 32-bit executables
would look like, it's sufficient to divide by half output of the sz() functions presented above.
|