File: profiling.rst

package info (click to toggle)
ns3 3.46-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 105,864 kB
  • sloc: cpp: 624,863; python: 14,863; ansic: 6,772; makefile: 1,950; sh: 987; javascript: 167; perl: 102
file content (1438 lines) | stat: -rw-r--r-- 69,075 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
.. include:: replace.txt
.. highlight:: bash

Profiling
---------

Memory profiling is essential to identify issues that
may cause memory corruption, which may lead to all sorts of
side-effects, such as crashing after many hours of simulation and
producing wrong results that invalidate the entire simulation.

It also can help tracking sources of excessive memory allocations,
the size of these allocations and memory usage during simulation.
These can affect simulation performance, or limit the complexity
and the number of concurrent simulations.

Performance profiling on the other hand is essential for
high-performance applications, as it allows for the identification
of bottlenecks and their mitigation.

Another type of profiling is related to system calls. They
can be used to debug issues and identify hotspots that
may cause performance issues in specific conditions. Excessive
calls results in more context switches, which interrupt the
simulations, ultimately slowing them down.

Other than profiling the simulations, which can highlight bottlenecks
in the simulator, we can also profile the compilation process.
This allows us to identify and fix bottlenecks, which speed up
build times.


Memory Profilers
****************

.. _Sanitizers : https://github.com/google/sanitizers/wiki
.. _Heaptrack : https://apps.kde.org/heaptrack/
.. _Valgrind: https://valgrind.org/
.. _MacOS's leaks: https://developer.apple.com/library/archive/documentation/Performance/Conceptual/ManagingMemory/Articles/FindingLeaks.html
.. _Bytehound : https://github.com/koute/bytehound
.. _gperftools : https://github.com/gperftools/gperftools

Memory profilers are tools that help identifying memory related
issues.

There are two well known tools for finding bugs such as uninitialized memory usage,
out-of-bound accesses, dereferencing null pointers and other memory-related bugs:

* `Valgrind`_

  * Pros: very rich tooling, no need to recompile programs to profile the program.
  * Cons: very slow and limited to Linux and MacOS.
* `Sanitizers`_

  * Pros: sanitizers are distributed along with compilers, such as GCC, Clang and MSVC.
    They are widely available, cross platform and faster than Valgrind.
  * Cons: false positives, high memory usage, memory sanitizer is incompatible
    with other sanitizers (e.g. address sanitizer), requiring two instrumented
    compilations and two test runs. The memory sanitizer requires Clang.

There are also tools to count memory allocations, track memory usage and memory leaks,
such as: `Heaptrack`_, `MacOS's leaks`_, `Bytehound`_ and `gperftools`_.

An overview on how to use `Valgrind`_, `Sanitizers`_ and
`Heaptrack`_ is provided in the following sections.

Valgrind
++++++++

.. _manual : https://valgrind.org/docs/manual/mc-manual.html

`Valgrind`_ is suite of profiling tools, being the main tool called Memcheck.
To check for memory errors including leaks, one can call valgrind directly:

.. sourcecode:: console

    valgrind --leak-check=yes ./relative/path/to/program argument1 argument2

Or can use the ``ns3`` script:

.. sourcecode:: console

    ./ns3 run "program argument1 argument2" --valgrind

Additional Valgrind options are listed on its `manual`_.

Sanitizers
++++++++++

`Sanitizers`_ are a suite of libraries made by Google and part of the LLVM project,
used to profile programs at runtime and find issues related to undefined behavior,
memory corruption (out-of-bound access, uninitialized memory use), leaks, race
conditions and others.

Sanitizers are shipped with most modern compilers and can be used by instructing the
compiler to link the required libraries and instrument the code.

To build ns-3 with sanitizers, enable the ``NS3_SANITIZE`` option. This can be done
directly via CMake:

.. sourcecode:: console

    ~/ns-3-dev/cmake_cache/$ cmake -DNS3_SANITIZE=ON ..

Or via the ``ns3`` wrapper:

.. sourcecode:: console

    ~/ns-3-dev$ ./ns3 configure --enable-sanitizers

.. _issue 166 : https://gitlab.com/nsnam/ns-3-dev/-/issues/166
.. _MR703 : https://gitlab.com/nsnam/ns-3-dev/-/merge_requests/703
.. _MR704 : https://gitlab.com/nsnam/ns-3-dev/-/merge_requests/704

The memory sanitizer can be enabled with ``NS3_SANITIZE_MEMORY``, but it is not
compatible with ``NS3_SANITIZE`` and only works with the Clang compiler.

Sanitizers were used to find issues in multiple occasions:

* A global buffer overflow in the LTE module

  * When the wrong index (-1) was used to access a ``int [][]`` variable, a different variable that is stored closely in memory was accessed.
  * In the best case scenario, this results in reading an incorrect value that causes the program to fail
  * In the worst case scenario, this value is overwritten corrupting the program memory
  * The likely scenario: wrong value is read and the program continued running, potentially producing incorrect results

  .. sourcecode:: console

    ~/ns-3-dev/src/lte/model/lte-amc.cc:303:43: runtime error: index -1 out of bounds for type 'int [110][27]'
    =================================================================
    ==51636==ERROR: AddressSanitizer: global-buffer-overflow on address 0x7fe78cc2dbbc at pc 0x7fe78ba65e65 bp 0x7ffde70b25c0 sp 0x7ffde70b25b0
    READ of size 4 at 0x7fe78cc2dbbc thread T0
        #0 0x7fe78ba65e64 in ns3::LteAmc::GetDlTbSizeFromMcs(int, int) ~/ns-3-dev/src/lte/model/lte-amc.cc:303
        #1 0x7fe78c538aba in ns3::TdTbfqFfMacScheduler::DoSchedDlTriggerReq(ns3::FfMacSchedSapProvider::SchedDlTriggerReqParameters const&) ~/ns-3-dev/src/lte/model/tdtbfq-ff-mac-scheduler.cc:1160
        #2 0x7fe78c564736 in ns3::MemberSchedSapProvider<ns3::TdTbfqFfMacScheduler>::SchedDlTriggerReq(ns3::FfMacSchedSapProvider::SchedDlTriggerReqParameters const&) ~/ns-3-dev/build/include/ns3/ff-mac-sched-sap.h:409
        #3 0x7fe78c215596 in ns3::LteEnbMac::DoSubframeIndication(unsigned int, unsigned int) ~/ns-3-dev/src/lte/model/lte-enb-mac.cc:588
        #4 0x7fe78c20921d in ns3::EnbMacMemberLteEnbPhySapUser::SubframeIndication(unsigned int, unsigned int) ~/ns-3-dev/src/lte/model/lte-enb-mac.cc:297
        #5 0x7fe78b924105 in ns3::LteEnbPhy::StartSubFrame() ~/ns-3-dev/src/lte/model/lte-enb-phy.cc:764
        #6 0x7fe78b949d54 in ns3::MakeEvent<void (ns3::LteEnbPhy::*)(), ns3::LteEnbPhy*>(void (ns3::LteEnbPhy::*)(), ns3::LteEnbPhy*)::EventMemberImpl0::Notify() (~/ns-3-dev/build/lib/libns3-dev-lte-deb.so+0x3a9cd54)
        #7 0x7fe795252022 in ns3::EventImpl::Invoke() ~/ns-3-dev/src/core/model/event-impl.cc:51
        #8 0x7fe795260de2 in ns3::DefaultSimulatorImpl::ProcessOneEvent() ~/ns-3-dev/src/core/model/default-simulator-impl.cc:151
        #9 0x7fe795262dbd in ns3::DefaultSimulatorImpl::Run() ~/ns-3-dev/src/core/model/default-simulator-impl.cc:204
        #10 0x7fe79525436f in ns3::Simulator::Run() ~/ns-3-dev/src/core/model/simulator.cc:176
        #11 0x7fe7b0f77ee2 in LteDistributedFfrAreaTestCase::DoRun() ~/ns-3-dev/src/lte/test/lte-test-frequency-reuse.cc:1777
        #12 0x7fe7952d125a in ns3::TestCase::Run(ns3::TestRunnerImpl*) ~/ns-3-dev/src/core/model/test.cc:363
        #13 0x7fe7952d0f4d in ns3::TestCase::Run(ns3::TestRunnerImpl*) ~/ns-3-dev/src/core/model/test.cc:357
        #14 0x7fe7952e39c0 in ns3::TestRunnerImpl::Run(int, char**) ~/ns-3-dev/src/core/model/test.cc:1094
        #15 0x7fe7952e427e in ns3::TestRunner::Run(int, char**) ~/ns-3-dev/src/core/model/test.cc:1118
        #16 0x564a13d67c9c in main ~/ns-3-dev/utils/test-runner.cc:23
        #17 0x7fe793cde0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)
        #18 0x564a13d67bbd in _start (~/ns-3-dev/build/utils/test-runner+0xae0bbd)
    0x7fe78cc2dbbc is located 40 bytes to the right of global variable 'McsToItbsUl' defined in '~/ns-3-dev/src/lte/model/lte-amc.cc:105:18' (0x7fe78cc2db20) of size 116
    0x7fe78cc2dbbc is located 4 bytes to the left of global variable 'TransportBlockSizeTable' defined in '~/ns-3-dev/src/lte/model/lte-amc.cc:118:18' (0x7fe78cc2dbc0) of size 11880
    SUMMARY: AddressSanitizer: global-buffer-overflow ~/ns-3-dev/src/lte/model/lte-amc.cc:303 in ns3::LteAmc::GetDlTbSizeFromMcs(int, int)
    Shadow bytes around the buggy address:
      0x0ffd7197db50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 04 f9
      0x0ffd7197db60: f9 f9 f9 f9 00 00 00 00 00 00 00 00 00 00 00 00
    =>0x0ffd7197db70: 00 00 04 f9 f9 f9 f9[f9]00 00 00 00 00 00 00 00
    Shadow byte legend (one shadow byte represents 8 application bytes):
      Addressable:           00
      Partially addressable: 01 02 03 04 05 06 07
      Global redzone:        f9
    ==51636==ABORTING

  * The output above shows the type of error (``global-buffer-overflow``),
    the stack-trace of where the bug happened (``LteAmc::GetDlTbSizeFromMcs``),
    affected variables (``McsToItbsUl`` and ``TransportBlockSizeTable``),
    and a shadow bytes map, showing the wrong access between square brackets.
  * The the global redzone (f9) shadow bytes are empty memory allocated between global variables (00s and 04s),
    which are left there to be corrupted by the bugged program.
    Any eventual corruption is then traced back to the source, without affecting the program execution.
  * The adopted solution in merge request `MR703`_ was to fix one of the schedulers that could produce the index value of -1,
    and updating the asserts to catch the illegal index value.

* A wrong downcast in the wimax module (note: wimax is no longer a module within the ns-3 mainline):

  * The pointer was casted incorrectly to U16TlvValue instead of U8TvlValue, which could have different sizes in memory
    leading to the program reading the wrong memory address.
    Reading the wrong memory address can result in unexpected or invalid values being read, which could change the
    program flow and corrupt memory, producing wrong simulation results or crashing the program.

  .. sourcecode:: console

    ~/ns-3-dev/src/wimax/model/service-flow.cc:159:86: runtime error: downcast of address 0x6020000148b0 which does not point to an object of type 'U16TlvValue'
    0x6020000148b0: note: object is of type 'ns3::U8TlvValue'
    48 00 00 36  c8 09 02 62 5c 7f 00 00  00 be be be be be be be  03 00 00 00 00 00 00 04  10 00 00 00
                  ^~~~~~~~~~~~~~~~~~~~~~~
                  vptr for 'ns3::U8TlvValue'
    ~/ns-3-dev/src/wimax/model/service-flow.cc:159:99: runtime error: member call on address 0x6020000148b0 which does not point to an object of type 'U16TlvValue'
    0x6020000148b0: note: object is of type 'ns3::U8TlvValue'
    48 00 00 36  c8 09 02 62 5c 7f 00 00  00 be be be be be be be  03 00 00 00 00 00 00 04  10 00 00 00
                  ^~~~~~~~~~~~~~~~~~~~~~~
                  vptr for 'ns3::U8TlvValue'
    ~/ns-3-dev/src/wimax/model/wimax-tlv.cc:589:10: runtime error: member access within address 0x6020000148b0 which does not point to an object of type 'U16TlvValue'
    0x6020000148b0: note: object is of type 'ns3::U8TlvValue'
    48 00 00 36  c8 09 02 62 5c 7f 00 00  00 be be be be be be be  03 00 00 00 00 00 00 04  10 00 00 00
                  ^~~~~~~~~~~~~~~~~~~~~~~
                  vptr for 'ns3::U8TlvValue'

  * The bug was fixed with the correct cast in merge request `MR704`_.


Heaptrack
+++++++++

.. _KDE : https://kde.org

`Heaptrack`_ is an utility made by `KDE`_ to trace memory allocations
along with stack traces, allowing developers to identify code responsible
for possible memory leaks and unnecessary allocations.

For the examples below we used the default configuration of ns-3,
with the output going to the ``build`` directory. The actual executable
for the ``wifi-he-network`` example is ``./build/examples/wireless/ns3-dev-wifi-he-network``, which is what is
executed by ``./ns3 run wifi-he-network``.

To collect information of a program (in this case the ``wifi-he-network``
example), run:

.. sourcecode:: console

   ~ns-3-dev/$ heaptrack ./build/examples/wireless/ns3-dev-wifi-he-network --simulationTime=0.3 --frequency=5 --useRts=1 --minExpectedThroughput=6 --maxExpectedThroughput=745

If you prefer to use the ``ns3`` wrapper, try:

.. sourcecode:: console

   ~ns-3-dev/$ ./ns3 run "wifi-he-network --simulationTime=0.3 --frequency=5 --useRts=1 --minExpectedThroughput=6 --maxExpectedThroughput=745" --heaptrack --no-build

In both cases, heaptrack will print to the terminal the output file:

.. sourcecode:: console

    ~ns-3-dev/$ ./ns3 run "wifi-he-network --simulationTime=0.3 --frequency=5 --useRts=1 --minExpectedThroughput=6 --maxExpectedThroughput=745" --heaptrack --no-build
    heaptrack output will be written to "~ns-3-dev/heaptrack.ns3-dev-wifi-he-network.210305.zst"
    starting application, this might take some time...
    MCS value               Channel width           GI                      Throughput
    0                       20 MHz                  3200 ns                 5.91733 Mbit/s
    0                       20 MHz                  1600 ns                 5.91733 Mbit/s
    ...
    11                      160 MHz                 1600 ns                 479.061 Mbit/s
    11                      160 MHz                 800 ns                  524.459 Mbit/s
    heaptrack stats:
            allocations:            149185947
            leaked allocations:     10467
            temporary allocations:  21145932
    Heaptrack finished! Now run the following to investigate the data:

        heaptrack --analyze "~/ns-3-dev/heaptrack.ns3-dev-wifi-he-network.210305.zst"

The output above shows a summary of the stats collected: ~149 million allocations,
~21 million temporary allocations and ~10 thousand possible leaked allocations.

If ``heaptrack-gui`` is installed, running ``heaptrack`` will launch it. If it is not installed,
the command line interface will be used.

.. sourcecode:: console

    ~/ns-3-dev$ heaptrack --analyze "~/ns-3-dev/heaptrack.ns3-dev-wifi-he-network.210305.zst"
    reading file "~/ns-3-dev/heaptrack.ns3-dev-wifi-he-network.210305.zst" - please wait, this might take some time...
    Debuggee command was: ~/ns-3-dev/build/examples/wireless/ns3-dev-wifi-he-network --simulationTime=0.3 --frequency=5 --useRts=1 --minExpectedThroughput=6 --maxExpectedThroughput=745
    finished reading file, now analyzing data:

    MOST CALLS TO ALLOCATION FUNCTIONS
    23447502 calls to allocation functions with 1.12MB peak consumption from
    ns3::Packet::Copy() const
    in ~/ns-3-dev/build/lib/libns3-dev-network.so
    4320000 calls with 0B peak consumption from:
        ns3::UdpSocketImpl::DoSendTo(ns3::Ptr<>, ns3::Ipv4Address, unsigned short, unsigned char)
        in ~/ns-3-dev/build/lib/libns3-dev-internet.so
        ns3::UdpSocketImpl::DoSend(ns3::Ptr<>)
        in ~/ns-3-dev/build/lib/libns3-dev-internet.so
        ns3::UdpSocketImpl::Send(ns3::Ptr<>, unsigned int)
        in ~/ns-3-dev/build/lib/libns3-dev-internet.so
        ns3::Socket::Send(ns3::Ptr<>)
        in ~/ns-3-dev/build/lib/libns3-dev-network.so
        ns3::UdpClient::Send()
        in ~/ns-3-dev/build/lib/libns3-dev-applications.so
        ns3::DefaultSimulatorImpl::ProcessOneEvent()
        in ~/ns-3-dev/build/lib/libns3-dev-core.so
        ns3::DefaultSimulatorImpl::Run()
        in ~/ns-3-dev/build/lib/libns3-dev-core.so
        main
        in ~/ns-3-dev/build/examples/wireless/ns3-dev-wifi-he-network

    ...

    MOST TEMPORARY ALLOCATIONS
    6182320 temporary allocations of 6182701 allocations in total (99.99%) from
    ns3::QueueDisc::DropBeforeEnqueue(ns3::Ptr<>, char const*)
    in ~/ns-3-dev/build/lib/libns3-dev-traffic-control.so
    1545580 temporary allocations of 1545580 allocations in total (100.00%) from:
        std::_Function_handler<>::_M_invoke(std::_Any_data const&, ns3::Ptr<>&&, char const*&&)
        in ~/ns-3-dev/build/lib/libns3-dev-traffic-control.so
        std::function<>::operator()(ns3::Ptr<>, char const*) const
        in ~/ns-3-dev/build/lib/libns3-dev-traffic-control.so
        ns3::MemPtrCallbackImpl<>::operator()(ns3::Ptr<>, char const*)
        in ~/ns-3-dev/build/lib/libns3-dev-traffic-control.so
        ns3::TracedCallback<>::operator()(ns3::Ptr<>, char const*) const
        in ~/ns-3-dev/build/lib/libns3-dev-traffic-control.so
        ns3::QueueDisc::DropBeforeEnqueue(ns3::Ptr<>, char const*)
        in ~/ns-3-dev/build/lib/libns3-dev-traffic-control.so
        ns3::CoDelQueueDisc::DoEnqueue(ns3::Ptr<>)
        in ~/ns-3-dev/build/lib/libns3-dev-traffic-control.so
        ns3::QueueDisc::Enqueue(ns3::Ptr<>)
        in ~/ns-3-dev/build/lib/libns3-dev-traffic-control.so
        ns3::FqCoDelQueueDisc::DoEnqueue(ns3::Ptr<>)
        in ~/ns-3-dev/build/lib/libns3-dev-traffic-control.so
        ns3::QueueDisc::Enqueue(ns3::Ptr<>)
        in ~/ns-3-dev/build/lib/libns3-dev-traffic-control.so
        ns3::TrafficControlLayer::Send(ns3::Ptr<>, ns3::Ptr<>)
        in ~/ns-3-dev/build/lib/libns3-dev-traffic-control.so
        ns3::Ipv4Interface::Send(ns3::Ptr<>, ns3::Ipv4Header const&, ns3::Ipv4Address)
        in ~/ns-3-dev/build/lib/libns3-dev-internet.so
        ns3::Ipv4L3Protocol::SendRealOut(ns3::Ptr<>, ns3::Ptr<>, ns3::Ipv4Header const&)
        in ~/ns-3-dev/build/lib/libns3-dev-internet.so
        ns3::Ipv4L3Protocol::Send(ns3::Ptr<>, ns3::Ipv4Address, ns3::Ipv4Address, unsigned char, ns3::Ptr<>)
        in ~/ns-3-dev/build/lib/libns3-dev-internet.so
        ns3::UdpL4Protocol::Send(ns3::Ptr<>, ns3::Ipv4Address, ns3::Ipv4Address, unsigned short, unsigned short, ns3::Ptr<>)
        in ~/ns-3-dev/build/lib/libns3-dev-internet.so
        ns3::UdpSocketImpl::DoSendTo(ns3::Ptr<>, ns3::Ipv4Address, unsigned short, unsigned char)
        in ~/ns-3-dev/build/lib/libns3-dev-internet.so
        ns3::UdpSocketImpl::DoSend(ns3::Ptr<>)
        in ~/ns-3-dev/build/lib/libns3-dev-internet.so
        ns3::UdpSocketImpl::Send(ns3::Ptr<>, unsigned int)
        in ~/ns-3-dev/build/lib/libns3-dev-internet.so
        ns3::Socket::Send(ns3::Ptr<>)
        in ~/ns-3-dev/build/lib/libns3-dev-network.so
        ns3::UdpClient::Send()
        in ~/ns-3-dev/build/lib/libns3-dev-applications.so
        ns3::DefaultSimulatorImpl::ProcessOneEvent()
        in ~/ns-3-dev/build/lib/libns3-dev-core.so
        ns3::DefaultSimulatorImpl::Run()
        in ~/ns-3-dev/build/lib/libns3-dev-core.so
        main
        in ~/ns-3-dev/build/examples/wireless/ns3-dev-wifi-he-network

    ...

    total runtime: 156.30s.
    calls to allocation functions: 149185947 (954466/s)
    temporary memory allocations: 21757614 (139201/s)
    peak heap memory consumption: 4.87MB
    peak RSS (including heaptrack overhead): 42.02MB
    total memory leaked: 895.45KB


The terminal output above lists the most frequently called functions that allocated memory.

Here is a short description of what each line of the last block of the output means:

* Allocation functions are all functions that allocated memory, either explicitly
  via C-style ``malloc`` and C++ ``new``, or implicitly via RAII and automatic conversions.

* Temporary memory allocations are allocations that are followed by the
  deallocation without modifying the data.

* Peak heap memory is the maximum memory allocated by the program throughout its execution.
  The memory allocator may reuse memory freed by previous destructors, ``del`` and ``free`` calls,
  reducing the number of system calls and maximum memory allocated.

* RSS is the Resident Set Size, which is the amount of physical memory occupied by the process.

* Total memory leak refers to memory allocated but never freed. This includes static initialization,
  so it is not uncommon to be different than 0KB. However this does not mean the program does not
  have memory leaks. Other memory profilers such as Valgrind and memory sanitizers are better
  suited to track down memory leaks.

Based on the stack trace, it is fairly easy to locate the corresponding code and act on it to
reduce the number of allocations.

.. _MR830 : https://gitlab.com/nsnam/ns-3-dev/-/merge_requests/830

In the case of ``ns3::QueueDisc::DropBeforeEnqueue`` shown above, the
allocations were caused by the transformation of C strings (``char*``) into C++ strings
(std::string) before performing the search in ns3::QueueDisc::Stats maps.
These unnecessary allocations were prevented by making use of the transparent
comparator ``std::less<>``, part of merge request `MR830`_.

Heaptrack also has a GUI that provides the same information printed by the command line interface,
but in a more interactive way.

.. image:: figures/heaptrack.png

Heaptrack was used in merge request `MR830`_ to track and reduce the number of allocations
in the ``wifi-he-network`` example mentioned above. About 29 million unnecessary allocations
were removed, which translates to a 20% reduction. This resulted in a 1.07x speedup of the
test suite with Valgrind (``./test.py -d -g``) and 1.02x speedup without it.


Memray
++++++

.. _Memray : https://bloomberg.github.io/memray/

`Memray`_ is an utility made by Bloomberg to trace memory allocations of Python programs,
including native code called by them. Along with stack traces, developers can trace down
possible memory leaks and unnecessary allocations.

Note: Memray is ineffective for profiling the ns-3 python bindings since Cppyy hides away
the calls to the ns-3 module libraries. However, it is still useful for python scripts
in general, for example ones used to parse and consolidate simulation results.

The ``ns3`` script includes a run option to launch Python programs with Memray.
Memray can produce different types of reports, such as a flamegraph in HTML, or
text reports (``summary`` and ``stats``).

.. sourcecode:: console

    ~/ns-3-dev/$ ./ns3 run sample-rng-plot.py --memray
    Writing profile results into memray.output
    Memray WARNING: Correcting symbol for aligned_alloc from 0x7fd97023c890 to 0x7fd97102fce0
    [memray] Successfully generated profile results.

    You can now generate reports from the stored allocation records.
    Some example commands to generate reports:

    /usr/bin/python3 -m memray flamegraph memray.output
    ~/ns-3-dev$ /usr/bin/python3 -m memray stats memray.output
       Total allocations:
             5364235

       Total memory allocated:
             10.748GB

       Histogram of allocation size:
             min: 0.000B
             ----------------------------------------------
             < 8.000B   :  264149 |||
             < 78.000B  : 2051906 |||||||||||||||||||||||
             < 699.000B : 2270941 |||||||||||||||||||||||||
             < 6.064KB  :  608993 |||||||
             < 53.836KB :  165307 ||
             < 477.912KB:    2220 |
             < 4.143MB  :     511 |
             < 36.779MB :     188 |
             < 326.492MB:      19 |
             <=2.830GB  :       1 |
             ----------------------------------------------
             max: 2.830GB

       Allocator type distribution:
              MALLOC: 4647765
              CALLOC: 435525
              REALLOC: 277736
              POSIX_MEMALIGN: 2686
              MMAP: 523

       Top 5 largest allocating locations (by size):
             - include:/usr/local/lib/python3.10/dist-packages/cppyy/__init__.py:243 -> 8.814GB
             - <stack trace unavailable> -> 746.999MB
             - show:~/.local/lib/python3.10/site-packages/matplotlib/backends/backend_gtk4.py:340 -> 263.338MB
             - load_library:/usr/local/lib/python3.10/dist-packages/cppyy/__init__.py:235 -> 245.684MB
             - __init__:/usr/lib/python3.10/ctypes/__init__.py:374 -> 225.797MB

       Top 5 largest allocating locations (by number of allocations):
             - include:/usr/local/lib/python3.10/dist-packages/cppyy/__init__.py:243 -> 2246145
             - show:~/.local/lib/python3.10/site-packages/matplotlib/backends/backend_gtk4.py:340 -> 1264614
             - <stack trace unavailable> -> 1098543
             - __init__:~/.local/lib/python3.10/site-packages/matplotlib/backends/backend_gtk4.py:61 -> 89466
             - run:/usr/lib/python3/dist-packages/gi/overrides/Gio.py:42 -> 79582


Performance Profilers
*********************

.. _Perf : https://perf.wiki.kernel.org/index.php/Tutorial
.. _Hotspot : https://github.com/KDAB/hotspot
.. _AMD uProf : https://www.amd.com/en/developer/uprof.html
.. _Intel VTune : https://www.intel.com/content/www/us/en/docs/vtune-profiler/get-started-guide/2023-1/overview.html
.. _Windows Performance Toolkit : https://docs.microsoft.com/en-us/windows-hardware/test/wpt/
.. _Sysprof : https://wiki.gnome.org/Apps/Sysprof
.. _Oprofile : https://oprofile.sourceforge.io/faq/
.. _Gprof : https://sourceware.org/binutils/docs/gprof/

Performance profilers are programs that collect runtime information and help to
identify performance bottlenecks. In some cases, they can point out hotspots
and suggest solutions.

There are many tools to profile your program, including:

* profilers from CPU manufacturers, such as `AMD uProf`_ and `Intel VTune`_
* profilers from the operating systems, such as Linux's `Perf`_ and `Windows Performance Toolkit`_

  * `Perf`_ also has a few graphical user interfaces available, being `Hotspot`_ one of them
* instrumented compilation and auxiliary tools provided by compilers, such as `Gprof`_
* third-party tools, such as `Sysprof`_ and `Oprofile`_

An overview on how to use `Perf`_ with `Hotspot`_, `AMD uProf`_ and
`Intel VTune`_ is provided in the following sections.

.. _Profiling and optimization :

Profiling and optimization
++++++++++++++++++++++++++

While profilers will help point out interest hotspots, they won't help fixing the issues.
At least for now...

After profiling, optimization should be done following the Pareto principle.
Focus first on least amount of work that can provide most benefits.
Let us see an example, based on ``lte-frequency-reuse``. In the example
below you will see a flamegraph generated from the CPU-``cycles`` metrics
collected by perf for this LTE example.

.. image:: figures/perf-lte-frequency-reuse.png

The vertical axis contain stacked bars. Every bar corresponds to a function.
The stack itself represents the call stack (when a function call other,
and other, etc). The width of each bar corresponds to the time it takes to
execute the function itself and the functions it calls (bars on top of it).
This is where we start looking for bottlenecks. Start looking at widest
columns at the top, and move down. As an example, we can select
``LteChunkProcessor::End()`` (highlighted in purple).

.. image:: figures/perf-chunk-processor.png

If we click on the ``LteChunkProcessor::End()``, we have a closer view of
where it is actually spending time on. We can see many bars related to callback
indirections. Note that each  bar is slightly shorter and offset to the right.
This small difference is the overhead of each function call. When everything
is added together, it corresponds to about 35% of total time of
``LteChunkProcessor::End()``. Yes, callbacks are really expensive,
even in optimized builds. Now, let us take a look at the big gap and big
``LteAmc`` function being called by this callback.

.. image:: figures/perf-detail.png

Now that we are closed to the top of the stack, we have a better idea of
what is actually happening and can optimize more easily.

For example, notice that about 14% of ``LteAmc::CreateCqiFeedbacks()`` is
spent on itself (gap on top of bar). This usually means one of the following:

- many conditional operations (if-else, switches), causing instruction cache-misses or pipeline stalls in case of wrong branch prediction
- accessing data in erratic patterns, causing data cache-misses
- waiting for a value to be computed at a congested CPU unit (e.g. slow division, trigonometric), causing pipeline stalls
- a lot of actual work, also known as retired instructions

To check which case it really is, instead of CPU-``cycles``, one can collect complementary metrics
such as ``cache-misses``, ``branch-misses``, ``stalled-cycles-frontend``, ``stalled-cycles-backend``.
It is easy to spot which one of them is the actual culprit, because its bar will be wider.
After figuring out the issue, you need to look at the code to locate the source of the issue.
Some profilers, such as `Intel VTune`_ and `AMD uProf`_ can give you the rough location of the
instructions (line of code) causing these misses and stalls.

We also see that 86% of the time is spent elsewhere. In LteMiErrorModel, and some vector related calls.
One of these vector calls is ``push_back``, which takes incredible 30% of the time.
Now let us look at the code to see where we can improve it.

.. sourcecode:: cpp

    std::vector<int>
    LteAmc::CreateCqiFeedbacks(const SpectrumValue& sinr, uint8_t rbgSize)
    {
        NS_LOG_FUNCTION(this);

        std::vector<int> cqi;

        if (m_amcModel == MiErrorModel)
        {
            NS_LOG_DEBUG(this << " AMC-VIENNA RBG size " << (uint16_t)rbgSize);
            NS_ASSERT_MSG(rbgSize > 0, " LteAmc-Vienna: RBG size must be greater than 0");
            std::vector<int> rbgMap;
            int rbId = 0;
            for (auto it = sinr.ConstValuesBegin(); it != sinr.ConstValuesEnd(); it++)
            {
                /// WE CAN CUT 30% OF TIME BY AVOIDING THESE REALLOCATIONS WITH rbgMap.resize(rbgSize) AT BEGINNING
                /// THEN USING std::iota(rbgMap.begin(), rbgMap.end(),rbId+1-rbgSize) INSIDE THE NEXT IF
                rbgMap.push_back(rbId++);
                if ((rbId % rbgSize == 0) || ((it + 1) == sinr.ConstValuesEnd()))
                {
                    uint8_t mcs = 0;
                    TbStats_t tbStats;
                    while (mcs <= 28)
                    {
                        HarqProcessInfoList_t harqInfoList;
                        tbStats = LteMiErrorModel::GetTbDecodificationStats(
                            sinr,
                            rbgMap,
                            (uint16_t)GetDlTbSizeFromMcs(mcs, rbgSize) / 8, /// DIVISIONS ARE EXPENSIVE. OPTIMIZER WILL
                                                                            /// REPLACE THIS TRIVIAL CASE WITH >> 3,
                                                                            /// BUT THIS ISN'T ALWAYS THE CASE.
                                                                            /// IF THE DIVISION IS BY A LOOP INVARIANT,
                                                                            /// THEN COMPUTE ITS INVERSE OUTSIDE
                                                                            /// THE LOOP TO REPLACE THE DIVISION WITH
                                                                            /// A MULTIPLICATION.
                            mcs,
                            harqInfoList);
                        if (tbStats.tbler > 0.1)
                        {
                            break;
                        }
                        mcs++;
                    }
                    if (mcs > 0)
                    {
                        mcs--;
                    }
                    NS_LOG_DEBUG(this << "\t RBG " << rbId << " MCS " << (uint16_t)mcs << " TBLER "
                                      << tbStats.tbler);
                    int rbgCqi = 0;
                    if ((tbStats.tbler > 0.1) && (mcs == 0))
                    {
                        rbgCqi = 0; // any MCS can guarantee the 10 % of BER
                    }
                    else if (mcs == 28)
                    {
                        rbgCqi = 15; // all MCSs can guarantee the 10 % of BER
                    }
                    else
                    {
                        double s = SpectralEfficiencyForMcs[mcs];
                        rbgCqi = 0;
                        while ((rbgCqi < 15) && (SpectralEfficiencyForCqi[rbgCqi + 1] < s))
                        {
                            ++rbgCqi;
                        }
                    }
                    NS_LOG_DEBUG(this << "\t MCS " << (uint16_t)mcs << "-> CQI " << rbgCqi);
                    // fill the cqi vector (per RB basis)
                    /// WE CAN CUT 30% OF TIME BY AVOIDING THESE REALLOCATIONS WITH cqi.resize(cqi.size()+rbgSize)
                    /// THEN std::fill(cqi.rbegin(), cqi.rbegin()+rbgSize, rbgCqi)
                    for (uint8_t j = 0; j < rbgSize; j++)
                    {
                        cqi.push_back(rbgCqi);
                    }
                    rbgMap.clear();
                }
            }
        }

        return cqi;
    }


Try to explore by yourself with the interactive flamegraph below
(only available for html-based documentation).

.. raw:: html
    :file: figures/perf.svg

Note: interactive flamegraph SVGs were generated with `perf.data` output exported by Linux Perf,
then transformed to text and finally rendered as SVG.

.. sourcecode:: console

    git clone https://github.com/brendangregg/FlameGraph
    ./FlameGraph/stackcollapse-perf.pl perf.data > perf.folded
    ./FlameGraph/flamegraph.pl perf.folded --width=800 > perf.svg


.. _Linux Perf and Hotspot GUI :

Linux Perf and Hotspot GUI
++++++++++++++++++++++++++

`Perf`_ is the kernel tool to measure performance of the Linux kernel,
drivers and user-space applications.

Perf tracks some performance events, being some of the most important for performance:

* cycles

  * Clocks (time) spent running.
* cache-misses

  * When either data or instructions were not in the L1/L2 caches, requiring
    a L3 or memory access.
* branch-misses

  * How many branch instructions were mispredicted.
    Mispredictions causes the CPU to stall and clean the pipeline,
    slowing down the program.
* stalled-cycles-frontend

  * Cycles wasted by the processor waiting for the next instruction,
    usually due to instruction cache miss or mispredictions.
    Starves the CPU pipeline of instructions and slows down the program.
* stalled-cycles-backend

  * Cycles wasted waiting for pipeline resources to finish their work.
    Usually waiting for memory read/write, or executing long-latency instructions.


Just like with ``heaptrack``, perf can be executed using the ``ns3`` wrapper
command template. In the following command we output perf data from ``wifi-he-network``
to the ``perf.data`` output file.

.. sourcecode:: console


    ~/ns-3-dev$ ./ns3 run "wifi-he-network --simulationTime=0.3 --frequency=5 --useRts=1 --minExpectedThroughput=6 --maxExpectedThroughput=745" --command-template "perf record -o ./perf.data --call-graph dwarf --event cycles,cache-misses,branch-misses --sample-cpu %s" --no-build

For ease of use, ``ns3`` also provides the ``--perf`` run option, that
include the recommended settings.

.. sourcecode:: console

    ~/ns-3-dev$ ./ns3 run "wifi-he-network --simulationTime=0.3 --frequency=5 --useRts=1 --minExpectedThroughput=6 --maxExpectedThroughput=745" --perf --no-build

When running for the first time, you may receive the following error:

.. sourcecode:: console

    ~/ns-3-dev$ ./ns3 run "wifi-he-network --simulationTime=0.3 --frequency=5 --useRts=1 --minExpectedThroughput=6 --maxExpectedThroughput=745" --perf --no-build
    Error:
    Access to performance monitoring and observability operations is limited.
    Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open
    access to performance monitoring and observability operations for processes
    without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
    More information can be found at 'Perf events and tool security' document:
    https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
    perf_event_paranoid setting is 1:
      -1: Allow use of (almost) all events by all users
          Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
    >= 0: Disallow raw and ftrace function tracepoint access
    >= 1: Disallow CPU event access
    >= 2: Disallow kernel profiling
    To make the adjusted perf_event_paranoid setting permanent preserve it
    in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>)
    Command 'build/examples/wireless/ns3-dev-wifi-he-network-default record --call-graph dwarf -a -e cache-misses,branch-misses,cpu-cycles,instructions,context-switches build/examples/wireless/ns3-dev-wifi-he-network-default -n=100' returned non-zero exit status 255.

This error is related to lacking permissions to access performance events from the kernel and CPU.
As said in the error, permissions can be granted for the current session
by changing the ``perf_event_paranoid`` setting with ``echo 0 > /proc/sys/kernel/perf_event_paranoid``.
This change can be made permanent by changing the setting in ``/etc/sysctl.conf``, but
this is not recommended. Administrative permissions (``sudo su``) are required in both cases.

After the program finishes, it will print recording statistics.

.. sourcecode:: console

    MCS value               Channel width           GI                      Throughput
    0                       20 MHz                  3200 ns                 6.01067 Mbit/s
    0                       20 MHz                  1600 ns                 5.936 Mbit/s
    ...
    11                      160 MHz                 1600 ns                 493.397 Mbit/s
    11                      160 MHz                 800 ns                  534.016 Mbit/s
    [ perf record: Woken up 9529 times to write data ]
    Warning:
    Processed 517638 events and lost 94 chunks!

    Check IO/CPU overload!

    Warning:
    1 out of order events recorded.
    [ perf record: Captured and wrote 2898,307 MB perf.data (436509 samples) ]


Results saved in ``perf.data`` can be reviewed with the ``perf report`` command.

`Hotspot`_ is a GUI for Perf, that makes performance profiling more
enjoyable and productive. It can parse the ``perf.data`` and show in
a more friendly way.

To record the same perf.data from Hotspot directly, fill the fields
for working directory, path to the executable, arguments, perf
events to track and output directory for the ``perf.data``.
Then run to start recording.

.. image:: figures/hotspot-setup.png

The cycles per function for this program is shown in the following
image.

.. image:: figures/hotspot-cycles.png

The data is also presented in a tabular format in the bottom-up,
top-down and caller/callee tabs (top left of the screen).

.. image:: figures/hotspot-top-down.png

.. _issue 280 : https://gitlab.com/nsnam/ns-3-dev/-/issues/280
.. _issue 426 : https://gitlab.com/nsnam/ns-3-dev/-/issues/426
.. _MR681 : https://gitlab.com/nsnam/ns-3-dev/-/merge_requests/681
.. _MR685 : https://gitlab.com/nsnam/ns-3-dev/-/merge_requests/685

Hotspot was used to identify performance bottlenecks in multiple occasions:

#.  ``wifi-primary-channels`` test suite was extremely slow due to unnecessary RF processing.
    The adopted solution was to replace the filtering step of the entire channel to just the desired
    sub-band, and assuming sub-bands are uniformly sized, saving multiplications in the integral
    used to compute the power of each sub-band. This resulted in a 6x speedup with
    ``./ns3 run "test-runner --fullness=TAKES_FOREVER --test-name=wifi-primary-channels"``.
    Hotspot was used along with AMD uProf to track this and other bottlenecks in `issue 426`_.

#.  ``WifiMacQueue::TtlExceeded`` dereferenced data out of cache when calling Simulator::Now().
    The adopted solution was to move Simulator::Now() out of TtlExceeded and reuse the value
    and inlining TtlExceeded. This resulted in a ~1.20x speedup with the test suite (``./test.py -d``).
    Hotspot was used along with AMD uProf to track this and other bottlenecks in `issue 280`_
    and merge request `MR681`_.

#.  MpduAggregator and MsduAggregator required an expensive attribute lookup to get the maximum sizes
    from the RegularWifiMac. Bypassing the attribute lookup reduced cache misses and unnecessary branches.
    The adopted solution was to move Simulator::Now() out of TtlExceeded and reuse the value
    and inlining TtlExceeded. This resulted in a ~1.02x speedup with the test suite (``./test.py -d``).
    More details on: merge requests `MR681`_ and `MR685`_.

Perf on WSL
~~~~~~~~~~~

WSLv1 cannot use perf due to the lack of the linux kernel and its performance counters.

WSLv2 users need to manually build perf to profile their programs,
which can be accomplished with the following commands:

.. sourcecode:: console

    sudo apt install git flex bison libdwarf-dev libelf-dev libnuma-dev \
             libunwind-dev libnewt-dev libdwarf++0 libelf++0 libdw-dev \
             libbfb0-dev systemtap-sdt-dev libssl-dev libperl-dev \
             python-dev-is-python3 binutils-dev libiberty-dev libzstd-dev \
             libcap-dev libbabeltrace-dev libpfm4-dev libtraceevent-dev pkg-config
    git clone https://github.com/microsoft/WSL2-Linux-Kernel --depth 1
    cd WSL2-Linux-Kernel/tools/perf
    make -j8
    sudo cp perf /usr/local/bin

Note that hardware `performance counters <https://github.com/microsoft/WSL/issues/4678>`_
are only available in Windows 11.

AMD uProf
+++++++++

`AMD uProf`_ works much like `Linux Perf and Hotspot GUI`_, but
is available in more platforms (Linux, Windows and BSD) using AMD
processors. Differently from Perf, it provides more performance
trackers for finer analysis.

To use it, open uProf then click to profile an application. If you
have already profile an application, you can reuse those settings for
another application by clicking in one of the items in the ``Recently Used
Configurations`` section.

.. image:: figures/uprof-start.png

Fill the fields with the application path, the arguments and
the working directory.

You may need to add the LD_LIBRARY_PATH environment variable
(or PATH on Windows), pointing it to the library output
directory (e.g. ``ns-3-dev/build/lib``).

Then click next:

.. image:: figures/uprof-profile-application.png

Now select custom events and pick the events you want.

The recommended ones for performance profiling are:
 * CYCLES_NOT_IN_HALT

   * Clocks (time) spent running.
 * RETIRED_INST

   * How many instructions were completed.
   * These do not count mispredictions, stalls, etc.
   * Instructions per clock (IPC) = RETIRED_INST / CYCLES_NOT_IN_HALT
 * RETIRED_BR_INST_MISP

   * How many branch instructions were mispredicted.
   * Mispredictions causes the CPU to stall and clean the pipeline,
     slowing down the program.
 * L2_CACHE_MISS.FROM_L1_IC_MISS

   * L2 cache misses caused by instruction L1 cache misses.
   * Results in L3/memory accesses due to missing instructions in L1/L2.
 * L2_CACHE_MISS.FROM_L1_DC_MISS

   * L2 cache misses caused by data L1 cache misses.
   * Results in L3/memory accesses due to missing instructions in L1/L2
 * MISALIGNED_LOADS

   * Loads not aligned with processor words.
   * Might result in additional cache and memory accesses.

.. image:: figures/uprof-select-events.png

Now click in advanced options to enable collection of the call stack.

.. image:: figures/uprof-collect-callstack.png

Then click ``Start Profile`` and wait for the program to end.
After it finishes you will be greeted with a hotspot summary screen,
but the ``Analyze`` tab (top of the screen) has sub-tabs with more
relevant information.

In the following image the metrics are shown per module, including the
C library (libc.so.6) which provides the ``malloc`` and ``free`` functions.
Values can be shown in terms of samples or percentages for easier reading
and to decide where to optimize.

.. image:: figures/uprof-stats.png


.. _MR677 : https://gitlab.com/nsnam/ns-3-dev/-/merge_requests/677
.. _MR680 : https://gitlab.com/nsnam/ns-3-dev/-/merge_requests/680
.. _MR681 : https://gitlab.com/nsnam/ns-3-dev/-/merge_requests/681
.. _MR777 : https://gitlab.com/nsnam/ns-3-dev/-/merge_requests/777
.. _semantic interposition: https://maskray.me/blog/2021-05-09-fno-semantic-interposition

Here are a few cases where AMD uProf was used to identify performance bottlenecks:

#.  ``WifiMacQueue::TtlExceeded`` dereferenced data out of cache when calling Simulator::Now().
    The adopted solution was to move Simulator::Now() out of TtlExceeded and reuse the value
    and inlining TtlExceeded. This resulted in a ~1.20x speedup with the test suite (``./test.py -d``).
    More details on: `issue 280`_ and merge request `MR681`_.

#.  ``wifi-primary-channels`` test suite was extremely slow due to unnecessary RF processing.
    The adopted solution was to replace the filtering step of the entire channel to just the desired
    sub-band, and assuming sub-bands are uniformly sized, saving multiplications in the integral
    used to compute the power of each sub-band. This resulted in a 6x speedup with
    ``./ns3 run "test-runner --fullness=TAKES_FOREVER --test-name=wifi-primary-channels"``.
    More details on: `issue 426`_ and merge request `MR677`_.

#.  Continuing the work on ``wifi-primary-channels`` test suite, profiling showed an excessive
    number of cache misses in ``InterferenceHelper::GetNextPosition``.
    This function searches for an iterator on a map, which is very fast
    if the map is small and fits in the cache, which was not the case. After reviewing the code,
    it was noticed in most cases this call was unnecessary as the iterator was already known.
    The adopted solution was to reuse the iterator whenever possible.
    This resulted in a 1.78x speedup on top of the previous 6x with
    ``./ns3 run "test-runner --fullness=TAKES_FOREVER --test-name=wifi-primary-channels"``.
    More details on: `issue 426`_ and merge requests `MR677`_ and `MR680`_.

#.  Position-Independent Code libraries (``-fPIC``) have an additional layer of indirection that increases
    instruction cache misses. The adopted solution was to disable `semantic interposition`_ with flag
    ``-fno-semantic-interposition`` on GCC. This is the default setting on Clang. This results in
    approximately 1.14x speedup with ``./test.py -d``. More details on: `MR777`_.

Note: all speedups above were measured on the same machine. Results may differ based on clock speeds,
cache sizes, number of cores, memory bandwidth and latency, storage throughput and latency.

Intel VTune
+++++++++++

`Intel VTune`_ works much like `Linux Perf and Hotspot GUI`_, but
is available in more platforms (Linux, Windows and Mac) using Intel
processors. Differently from Perf, it provides more performance
trackers for finer analysis.

When you open the program, you will be greeted by the landing page
shown in the following image. To start a new profiling project, click
in the ``Configure Analysis`` button. If you already have a project,
right-click the entry and click to configure analysis to reuse the
settings.

.. image:: figures/vtune-landing.png

A configuration page will open, where you can fill the fields with
the path to the program, arguments, and set working directory and
environment variables.

Note: in this example on Windows using MinGW,
we need to define the ``PATH`` environment variable with the paths
to both ``~/ns-3-dev/build/lib`` and the MinGW binaries folder
(``~/msys64/mingw64/bin``), which contains essential libraries.
On Linux-like systems you will need to define the
``LD_LIBRARY_PATH`` environment variable instead of ``PATH``.

Clicking on the ``Performance Snapshot`` shows the different profiling
options.

.. image:: figures/vtune-configure.png

If executed as is, a quicker profiling will be executed to
determine what areas should be profiled with more details.
For the specific example, it is indicated that there are
microarchitectural bottlenecks and low parallelism
(not a surprise since ns-3 is single-threaded).

.. image:: figures/vtune-perf-snapshot.png

If the ``microarchitecture exploration`` option is selected, cycles,
branch mispredictions, cache misses and other metrics will be collected.

.. image:: figures/vtune-select-uarch-profiling.png

After executing the ``microarchitecture exploration``, a summary will
be shown. Hovering the mouse over the red flags will explain what
each sentence means and how it impacts performance.

.. image:: figures/vtune-uarch-profiling-summary.png

Clicking in the ``Bottom-up`` tab shows all the information per module.
A plethora of stats such as CPU time, instructions retired,
retiring percentage (how many of the dispatched instructions
were executed until the end, usually lower than 100% because of branch
mispredictions), bad speculation, cache misses, unused load ports,
and more.

The stats for the wifi module are shown below. The retiring
metric indicates about 40% of dispatched instructions are
executed. The diagram on the right shows the bottleneck is
in the front-end of the pipeline (red), due to high
instruction cache misses, translation lookaside buffer (TLB)
overhead and unknown branches (most likely callbacks).

.. image:: figures/vtune-uarch-wifi-stats.png

The stats for the core module are shown below.
More specifically for the ns3::Object::DoGetObject function.
Metrics indicates about 63% of bad speculations.
The diagram on the right shows that there are bottlenecks
both in the front-end and due to bad speculation (red).

.. image:: figures/vtune-uarch-core-stats.png


System calls profilers
**********************

.. _dtrace : https://github.com/opendtrace/documentation
.. _strace : https://strace.io/
.. _procmon : https://docs.microsoft.com/en-us/sysinternals/downloads/procmon

System call profilers collect information on which system
calls were made by a program, how long they took to be
fulfilled and how many of them resulted in errors.

There are many system call profilers, including `dtrace`_, `strace`_ and `procmon`_.

An overview on how to use `strace`_ is provided in the following section.

Strace
++++++


The `strace`_ is a system calls (syscalls) profiler for Linux. It can filter
specific syscalls, or gather stats during the execution.

To collect statistics, use ``strace -c``:

.. sourcecode:: console

    ~ns-3-dev/$ ./ns3 run "wifi-he-network --simulationTime=0.3 --frequency=5 --useRts=1 --minExpectedThroughput=6 --maxExpectedThroughput=745" --command-template "strace -c %s" --no-build
    MCS value               Channel width           GI                      Throughput
    0                       20 MHz                  3200 ns                 5.91733 Mbit/s
    ...
    11                      160 MHz                 800 ns                  524.459 Mbit/s
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     37.62    0.004332          13       326       233 openat
     35.46    0.004083           9       415           mmap
    ...
    ------ ----------- ----------- --------- --------- ----------------
    100.00    0.011515           8      1378       251 total

In the example above, the syscalls are listed in the right, after
the time spent on each syscall, number of calls and errors.

The errors can be caused due to multiple reasons and may not
be a problem. To check if they were problems, strace can log the
syscalls with ``strace -o calls.log``:

.. sourcecode:: console

    ~ns-3-dev/$ ./ns3 run "wifi-he-network --simulationTime=0.3 --frequency=5 --useRts=1 --minExpectedThroughput=6 --maxExpectedThroughput=745" --command-template "strace -o calls.log %s" --no-build
    MCS value               Channel width           GI                      Throughput
    0                       20 MHz                  3200 ns                 5.91733 Mbit/s
    ...
    11                      160 MHz                 800 ns                  524.459 Mbit/s


Looking at the ``calls.log`` file, we can see different sections. In the
following section, the example is executed (``execve``), architecture is checked (``arch_prctl``),
memory is mapped for execution (``mmap``) and LD_PRELOAD use is checked.

.. sourcecode:: console

    execve("~/ns-3-dev/build/examples/wireless/ns3-dev-wifi-he-network", ["~/ns-3-dev/b"..., "--simulationTime=0.3", "--frequency=5", "--useRts=1", "--minExpectedThroughput=6", "--maxExpectedThroughput=745"], 0x7fffb0f91ad8 /* 3 vars */) = 0
    brk(NULL)                               = 0x563141b37000
    arch_prctl(0x3001 /* ARCH_??? */, 0x7ffff8d63a50) = -1 EINVAL (Invalid argument)
    mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f103c2e9000
    access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)

Then the program searches for the wifi module library and fails multiple times (the errors seen in the table above).

.. sourcecode:: console

    openat(AT_FDCWD, "~/ns-3-dev/build/lib/glibc-hwcaps/x86-64-v3/libns3-dev-wifi.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
    newfstatat(AT_FDCWD, "~/ns-3-dev/build/lib/glibc-hwcaps/x86-64-v3", 0x7ffff8d62c80, 0) = -1 ENOENT (No such file or directory)
    ...
    openat(AT_FDCWD, "~/ns-3-dev/build/lib/x86_64/libns3-dev-wifi.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
    newfstatat(AT_FDCWD, "~/ns-3-dev/build/lib/x86_64", 0x7ffff8d62c80, 0) = -1 ENOENT (No such file or directory)

The library is finally found and its header is read:

.. sourcecode:: console

    openat(AT_FDCWD, "~/ns-3-dev/build/lib/libns3-dev-wifi.so", O_RDONLY|O_CLOEXEC) = 3
    read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0py\30\0\0\0\0\0"..., 832) = 832

Then other modules that wifi depends on are loaded, then execution of the program continues to the main
function of the simulation.

Strace was used to track down issues found while running the ``lena-radio-link-failure`` example.
Its ``strace -c`` table was the following:

.. sourcecode:: console

    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     31,51    0,246243           2    103480       942 openat
     30,23    0,236284           2    102360           write
     19,90    0,155493           1    102538           close
     16,65    0,130132           1    102426           lseek
      1,05    0,008186          18       437           mmap
      0,21    0,001671          16        99           newfstatat
      0,20    0,001595          11       134           mprotect
      0,18    0,001391          14        98           read
    ...
    ------ ----------- ----------- --------- --------- ----------------
    100,00    0,781554           1    411681       951 total

Notice the number of ``openat``, ``write``, ``close`` and ``lseek`` calls
are much more frequent than the other calls. These mean
``lena-radio-link-failure`` is opening, then seeking, then writing,
then closing at least one file handler.

Using ``strace``, we can easily find the most frequently used file handlers.

.. sourcecode:: console

    ~ns-3-dev/$./ns3 run "lena-radio-link-failure --numberOfEnbs=2 --useIdealRrc=0 --interSiteDistance=700 --simTime=17" --command-template="strace %s"
    ...
    openat(AT_FDCWD, "DlTxPhyStats.txt", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3
    lseek(3, 0, SEEK_END)                   = 9252
    write(3, "635\t1\t1\t1\t0\t20\t1191\t0\t1\t0\n", 26) = 26
    close(3)                                = 0
    openat(AT_FDCWD, "DlMacStats.txt", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3
    lseek(3, 0, SEEK_END)                   = 11100
    write(3, "0.635\t1\t1\t64\t6\t1\t20\t1191\t0\t0\t0\n", 31) = 31
    close(3)                                = 0
    openat(AT_FDCWD, "UlMacStats.txt", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3
    lseek(3, 0, SEEK_END)                   = 8375
    write(3, "0.635\t1\t1\t64\t6\t1\t0\t85\t0\n", 24) = 24
    close(3)                                = 0
    openat(AT_FDCWD, "DlRsrpSinrStats.txt", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3
    lseek(3, 0, SEEK_END)                   = 16058
    write(3, "0.635214\t1\t1\t1\t6.88272e-15\t22.99"..., 37) = 37
    close(3)                                = 0
    openat(AT_FDCWD, "UlTxPhyStats.txt", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3
    ...

.. _MR814: https://gitlab.com/nsnam/ns-3-dev/-/merge_requests/814

With the name of the files, we can look at the code that manipulates them.

The issue above was found in `MR777`_, were performance for some LTE examples
regressed for no apparent reason. The flame graph below, produced by `AMD uProf`_,
contains four large columns/"flames" in red, which
correspond to the ``write``, ``openat``, ``close`` and ``lseek`` syscalls.

.. image:: figures/uprof-strace-lte.png

Upon closer inspection, these syscalls take a long time to complete due to
the underlying filesystem of the machine running the example (NTFS mount
using the ntfs-3g FUSE filesystem). In other words, the bottleneck only
exists when running the example in slow file systems
(e.g. FUSE and network file systems).

The merge request `MR814`_ addressed the issue by keeping the files open
throughout the simulation. That alone resulted in a 1.75x speedup.


Compilation Profilers
*********************

Compilation profilers can help identifying which steps of the compilation
are slowing it down. These profilers are built into the compilers themselves,
only requiring third-party tools to consolidate the results.

The GCC feature is mentioned and exemplified, but is not the recommended
compilation profiling method. For that, Clang is recommended.

GCC
+++

.. _not a priority : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92396

GCC has a special flag ``-ftime-report``, which makes it print a table
with time spent per compilation phase for each compiled file. The printed
output for a file is shown below. The line of ``---`` was inserted for clarity.

.. sourcecode:: console

  Time variable                      usr           sys          wall               GGC
  phase setup            :   0.00 (  0%)   0.00 (  0%)   0.01 (  1%)    1478 kB (  2%)
  phase parsing          :   0.31 ( 46%)   0.17 ( 85%)   0.48 ( 55%)   55432 kB ( 71%)
  phase lang. deferred   :   0.03 (  4%)   0.00 (  0%)   0.03 (  3%)    4287 kB (  5%)
  phase opt and generate :   0.32 ( 48%)   0.03 ( 15%)   0.35 ( 40%)   16635 kB ( 21%)
  phase last asm         :   0.01 (  1%)   0.00 (  0%)   0.01 (  1%)     769 kB (  1%)
  ------------------------------------------------------------------------------------
  |name lookup           :   0.05 (  7%)   0.02 ( 10%)   0.04 (  5%)    2468 kB (  3%)
  |overload resolution   :   0.05 (  7%)   0.00 (  0%)   0.05 (  6%)    4217 kB (  5%)
  dump files             :   0.00 (  0%)   0.00 (  0%)   0.01 (  1%)       0 kB (  0%)
  callgraph construction :   0.01 (  1%)   0.00 (  0%)   0.01 (  1%)    2170 kB (  3%)
  ...
  preprocessing          :   0.05 (  7%)   0.06 ( 30%)   0.10 ( 11%)    1751 kB (  2%)
  parser (global)        :   0.06 (  9%)   0.03 ( 15%)   0.07 (  8%)   16303 kB ( 21%)
  parser struct body     :   0.06 (  9%)   0.04 ( 20%)   0.08 (  9%)   12525 kB ( 16%)
  parser enumerator list :   0.01 (  1%)   0.00 (  0%)   0.00 (  0%)     112 kB (  0%)
  parser function body   :   0.02 (  3%)   0.02 ( 10%)   0.02 (  2%)    3039 kB (  4%)
  parser inl. func. body :   0.03 (  4%)   0.00 (  0%)   0.01 (  1%)    2024 kB (  3%)
  parser inl. meth. body :   0.02 (  3%)   0.01 (  5%)   0.06 (  7%)    5792 kB (  7%)
  template instantiation :   0.09 ( 13%)   0.01 (  5%)   0.13 ( 15%)   12274 kB ( 16%)
  ...
  symout                 :   0.01 (  1%)   0.00 (  0%)   0.02 (  2%)    8114 kB ( 10%)
  ...
  TOTAL                  :   0.67          0.20          0.88          78612 kB

In the table above, the first few lines show the five main compilations steps: ``setup``,
``parsing``, ``lang. deferred`` (C++ specific transformations),
``opt(imize) and generate (code)``, ``last asm`` (produce binary code).

The lines below the ``---`` line show sub-steps of the five main compilation steps.
For this specific case, parsing global definitions (21%) and structures (16%),
``template instantiation`` (16%) and generating the code in ``symout`` (10%).

Aggregating the data into a meaningful output to help focus where to improve is not that easy
and it is `not a priority`_ for GCC developers.

It is recommended to use the Clang alternative.

.. _Clang :

Clang
+++++

.. _ClangBuildAnalyzer : https://github.com/aras-p/ClangBuildAnalyzer

Clang can output very similar results with the ``-ftime-trace`` flag, but can also aggregate
it in a more meaningful way. With the help of the third-party tool `ClangBuildAnalyzer`_,
we can have really good insights on where to spend time trying to speed up the compilation.

Support for building with ``-ftime-trace``, compiling `ClangBuildAnalyzer`_ and producing a
report for the project have been baked into the CMake project of ns-3, and can be enabled
with ``-DNS3_CLANG_TIMETRACE=ON``.

.. sourcecode:: console

  ~/ns-3-dev/cmake_cache$ cmake -DNS3_CLANG_TIMETRACE=ON ..

Or via ``ns3``:

.. sourcecode:: console

  ~/ns-3-dev$ ./ns3 configure -- -DNS3_CLANG_TIMETRACE=ON

The entire procedure looks like the following:

.. sourcecode:: console

  ~/ns-3-dev$ CXX="clang++" ./ns3 configure -d release --enable-examples --enable-tests -- -DNS3_CLANG_TIMETRACE=ON
  ~/ns-3-dev$ ./ns3 build timeTraceReport
  ~/ns-3-dev$ cat ClangBuildAnalyzerReport.txt
  Analyzing build trace from '~/ns-3-dev/cmake_cache/clangBuildAnalyzerReport.bin'...
  **** Time summary:
  Compilation (2993 times):
    Parsing (frontend):         2476.1 s
    Codegen & opts (backend):   1882.9 s

  **** Files that took longest to parse (compiler frontend):
    8966 ms: src/test/CMakeFiles/libtest.dir/traced/traced-callback-typedef-test-suite.cc.o
    6633 ms: src/wifi/examples/CMakeFiles/wifi-bianchi.dir/wifi-bianchi.cc.o
  ...

  **** Files that took longest to codegen (compiler backend):
  36430 ms: src/wifi/CMakeFiles/libwifi-test.dir/test/block-ack-test-suite.cc.o
  24941 ms: src/wifi/CMakeFiles/libwifi-test.dir/test/wifi-mac-ofdma-test.cc.o
  ...

  **** Templates that took longest to instantiate:
  12651 ms: std::unordered_map<int, int> (615 times, avg 20 ms)
  10950 ms: std::_Hashtable<int, std::pair<const int, int>, std::allocator<std::... (615 times, avg 17 ms)
  10712 ms: std::__detail::__hyperg<long double> (1172 times, avg 9 ms)
  ...

  **** Template sets that took longest to instantiate:
  111660 ms: std::list<$> (27141 times, avg 4 ms)
   79892 ms: std::_List_base<$> (27140 times, avg 2 ms)
   75131 ms: std::map<$> (11752 times, avg 6 ms)
   65214 ms: std::allocator<$> (66622 times, avg 0 ms)
  ...

  **** Functions that took longest to compile:
    7206 ms: OfdmaAckSequenceTest::CheckResults(ns3::Time, ns3::Time, unsigned ch... (~/ns-3-dev/src/wifi/test/wifi-mac-ofdma-test.cc)
    6146 ms: PieQueueDiscTestCase::RunPieTest(ns3::QueueSizeUnit) (~/ns-3-dev/src/traffic-control/test/pie-queue-disc-test-suite.cc)
  ...

  **** Function sets that took longest to compile / optimize:
  14801 ms: std::__cxx11::basic_string<$> ns3::CallbackImplBase::GetCppTypeid<$>() (2342 times, avg 6 ms)
  12013 ms: ns3::CallbackImpl<$>::DoGetTypeid[abi:cxx11]() (1283 times, avg 9 ms)
  10034 ms: ns3::Ptr<$>::~Ptr() (5975 times, avg 1 ms)
   8932 ms: ns3::Callback<$>::DoAssign(ns3::Ptr<$>) (591 times, avg 15 ms)
   6318 ms: ns3::CallbackImpl<$>::DoGetTypeid() (431 times, avg 14 ms)
  ...

  *** Expensive headers:
  293609 ms: ~/ns-3-dev/build/include/ns3/log.h (included 1404 times, avg 209 ms), included via:
    cqa-ff-mac-scheduler.cc.o  (758 ms)
    ipv6-list-routing.cc.o  (746 ms)
    ...

  239884 ms: ~/ns-3-dev/build/include/ns3/nstime.h (included 1093 times, avg 219 ms), included via:
    lte-enb-rrc.cc.o lte-enb-rrc.h  (891 ms)
    wifi-acknowledgment.cc.o wifi-acknowledgment.h  (877 ms)
    ...

  216218 ms: ~/ns-3-dev/build/include/ns3/object.h (included 1205 times, avg 179 ms), included via:
    energy-source-container.cc.o energy-source-container.h energy-source.h  (1192 ms)
    phased-array-model.cc.o phased-array-model.h  (1135 ms)
    ...

  206801 ms: ~/ns-3-dev/build/include/ns3/core-module.h (included 195 times, avg 1060 ms), included via:
    sample-show-progress.cc.o  (1973 ms)
    length-example.cc.o  (1848 ms)
    ...

  193116 ms: /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h (included 1499 times, avg 128 ms), included via:
    model-typeid-creator.h attribute-default-iterator.h type-id.h attribute.h string  (250 ms)
    li-ion-energy-source-helper.h energy-model-helper.h attribute.h string  (243 ms)
    ...

  185075 ms: /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/ios_base.h (included 1495 times, avg 123 ms), included via:
    iomanip  (403 ms)
    mpi-test-fixtures.h iomanip  (364 ms)
    ...

  169464 ms: ~/ns-3-dev/build/include/ns3/ptr.h (included 1399 times, avg 121 ms), included via:
    lte-test-rlc-um-e2e.cc.o config.h  (568 ms)
    lte-test-rlc-um-transmitter.cc.o simulator.h event-id.h  (560 ms)
    ...


    done in 2.8s.

The output printed out contain a summary of time spent on parsing and on code generation, along
with multiple lists for different tracked categories. From the summary, it is clear that parsing times
are very high when compared to the optimization time (``-O3``). Skipping the others categories and going straight
to the expensive headers section, we can better understand why parsing times are so high, with some headers
adding as much as 5 minutes of CPU time to the parsing time.

.. _drastically speed up parsing times : https://gitlab.com/nsnam/ns-3-dev/-/merge_requests/731#note_687176503

Precompiled headers (``-DNS3_PRECOMPILE_HEADERS=ON``) can `drastically speed up parsing times`_,
however, they can increase ccache misses, reducing the time of the first
compilation at the cost of increasing recompilation times.

.. _Perfetto UI : https://ui.perfetto.dev/
.. _NinjaTracing : https://github.com/nico/ninjatracing

NinjaTracing
++++++++++++

If the Ninja generator is being used (``./ns3 configure -G Ninja``), its build log
can be used to identify targets slowing down the build process. The `NinjaTracing`_
utility is used to convert the log format into a tracing Json file.

The following steps show how it can be used:

.. sourcecode:: console

  ~/ns-3-dev$ ./ns3 configure --enable-ninja-tracing
  ~/ns-3-dev$ ./ns3 build
  ~/ns-3-dev$ ./ns3 build ninjaTrace

The output ``ninja_performance_trace.json`` should be located in the ``~/ns-3-dev`` directory.
You can then visualize the results using the ``about:tracing`` panel available in
Chromium-based browser or with a compatible trace viewer such as `Perfetto UI`_.

It can also be used in conjunction with the `Clang`_ time-trace feature for more granular
information from within the compiler and linker.

.. sourcecode:: console

  ~/ns-3-dev$ CXX=clang++ ./ns3 configure --enable-ninja-tracing -- -DNS3_CLANG_TIMETRACE=ON
  ~/ns-3-dev$ ./ns3 build
  ~/ns-3-dev$ ./ns3 build ninjaTrace


CMake Profiler
**************

CMake has a built-in tracer that permits tracking hotspots in the CMake files slowing down the
project configuration. To use the tracer, call cmake directly from a clean CMake cache directory:

.. sourcecode:: console

  ~/ns-3-dev/cmake-cache$ cmake .. --profiling-format=google-trace --profiling-output=../cmake_performance_trace.log


Or using the ns3 wrapper:

.. sourcecode:: console

  ~/ns-3-dev$ ./ns3 configure --trace-performance

A ``cmake_performance_trace.log`` file will be generated in the ns-3-dev directory.
The tracing results can be visualized using the ``about:tracing`` panel available
in Chromium-based browsers or a compatible trace viewer such as `Perfetto UI`_.

After opening the trace file, select the traced process and click on
any of the blocks to inspect the different stacks and find hotspots.
An auxiliary panel containing the function/macro name, arguments
and location can be shown, providing enough information to trace
back the location of each specific call.

Just like in performance profilers, visual inspection makes it easier
to identify hotspots and focus on trying to optimize what matters most.

.. _issue #588: https://gitlab.com/nsnam/ns-3-dev/-/issues/588

The trace below was generated during the discussion of `issue #588`_,
while investigating the long configuration times, especially when using HDDs.

The single largest contributor was CMake's ``configure_file``, used to keeping
up-to-date copies of headers in the output directory.

.. image:: figures/perfetto-trace-cmake.png


.. _MR911: https://gitlab.com/nsnam/ns-3-dev/-/merge_requests/911

In `MR911`_, alternatives such as stub headers that include the original header
files, keeping them in their respective modules, and symlinking headers to the
output directory were used to reduce the configuration overhead.

Note: when testing I/O bottlenecks, you may want to drop filesystem caches,
otherwise the cache may hide the issues. In Linux, the caches can be cleared
using the following command:

.. sourcecode:: console

  ~/ns-3-dev$ sudo sysctl vm.drop_caches=3