1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657
|
libtracefs(3)
=============
NAME
----
tracefs_sql - Create a synthetic event via an SQL statement
SYNOPSIS
--------
[verse]
--
*#include <tracefs.h>*
struct tracefs_synth pass:[*]*tracefs_sql*(struct tep_handle pass:[*]_tep_, const char pass:[*]_name_,
const char pass:[*]_sql_buffer_, char pass:[**]_err_);
--
DESCRIPTION
-----------
Synthetic events are dynamically created events that attach two existing events
together via one or more matching fields between the two events. It can be used
to find the latency between the events, or to simply pass fields of the first event
on to the second event to display as one event.
The Linux kernel interface to create synthetic events is complex, and there needs
to be a better way to create synthetic events that is easy and can be understood
via existing technology.
If you think of each event as a table, where the fields are the column of the table
and each instance of the event as a row, you can understand how SQL can be used
to attach two events together and form another event (table). Utilizing the
SQL *SELECT* *FROM* *JOIN* *ON* [ *WHERE* ] syntax, a synthetic event can easily
be created from two different events.
For simple SQL queries to make a histogram instead of a synthetic event, see
HISTOGRAMS below.
*tracefs_sql*() takes in a _tep_ handler (See _tep_local_events_(3)) that is used to
verify the events within the _sql_buffer_ expression. The _name_ is the name of the
synthetic event to create. If _err_ points to an address of a string, it will be filled
with a detailed message on any type of parsing error, including fields that do not belong
to an event, or if the events or fields are not properly compared.
The example program below is a fully functional parser where it will create a synthetic
event from a SQL syntax passed in via the command line or a file.
The SQL format is as follows:
*SELECT* <fields> *FROM* <start-event> *JOIN* <end-event> *ON* <matching-fields> *WHERE* <filter>
Note, although the examples show the SQL commands in uppercase, they are not required to
be so. That is, you can use "SELECT" or "select" or "sElEct".
For example:
[source,c]
--
SELECT syscalls.sys_enter_read.fd, syscalls.sys_exit_read.ret FROM syscalls.sys_enter_read
JOIN syscalls.sys_exit_read
ON syscalls.sys_enter_read.common_pid = syscalls.sys_exit_write.common_pid
--
Will create a synthetic event that with the fields:
u64 fd; s64 ret;
Because the function takes a _tep_ handle, and usually all event names are unique, you can
leave off the system (group) name of the event, and *tracefs_sql*() will discover the
system for you.
That is, the above statement would work with:
[source,c]
--
SELECT sys_enter_read.fd, sys_exit_read.ret FROM sys_enter_read JOIN sys_exit_read
ON sys_enter_read.common_pid = sys_exit_write.common_pid
--
The *AS* keyword can be used to name the fields as well as to give an alias to the
events, such that the above can be simplified even more as:
[source,c]
--
SELECT start.fd, end.ret FROM sys_enter_read AS start JOIN sys_exit_read AS end ON start.common_pid = end.common_pid
--
The above aliases _sys_enter_read_ as *start* and _sys_exit_read_ as *end* and uses
those aliases to reference the event throughout the statement.
Using the *AS* keyword in the selection portion of the SQL statement will define what
those fields will be called in the synthetic event.
[source,c]
--
SELECT start.fd AS filed, end.ret AS return FROM sys_enter_read AS start JOIN sys_exit_read AS end
ON start.common_pid = end.common_pid
--
The above labels the _fd_ of _start_ as *filed* and the _ret_ of _end_ as *return* where
the synthetic event that is created will now have the fields:
u64 filed; s64 return;
The fields can also be calculated with results passed to the synthetic event:
[source,c]
--
select start.truesize, end.len, (start.truesize - end.len) as diff from napi_gro_receive_entry as start
JOIN netif_receive_skb as end ON start.skbaddr = end.skbaddr
--
Which would show the *truesize* of the _napi_gro_receive_entry_ event, the actual
_len_ of the content, shown by the _netif_receive_skb_, and the delta between
the two and expressed by the field *diff*.
The code also supports recording the timestamps at either event, and performing calculations
on them. For wakeup latency, you have:
[source,c]
--
select start.pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as lat from sched_waking as start
JOIN sched_switch as end ON start.pid = end.next_pid
--
The above will create a synthetic event that records the _pid_ of the task being woken up,
and the time difference between the _sched_waking_ event and the _sched_switch_ event.
The *TIMESTAMP_USECS* will truncate the time down to microseconds as the timestamp usually
recorded in the tracing buffer has nanosecond resolution. If you do not want that
truncation, use *TIMESTAMP* instead of *TIMESTAMP_USECS*.
Because it is so common to have:
[source,c]
--
(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS)
--
The above can be represented with *TIMESTAMP_DELTA_USECS* or if nanoseconds are OK, you can
use *TIMESTAMP_DELTA*. That is, the previous select can also be represented by:
[source,c]
--
select start.pid, TIMESTAMP_DELTA_USECS as lat from sched_waking as start JOIN sched_switch as end ON start.pid = end.next_pid
--
Finally, the *WHERE* clause can be added, that will let you add filters on either or both events.
[source,c]
--
select start.pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as lat from sched_waking as start
JOIN sched_switch as end ON start.pid = end.next_pid
WHERE start.prio < 100 && (!(end.prev_pid < 1 || end.prev_prio > 100) || end.prev_pid == 0)
--
*NOTE*
Although both events can be used together in the *WHERE* clause, they must not be mixed outside
the top most "&&" statements. You can not OR (||) the events together, where a filter of one
event is OR'd to a filter of the other event. This does not make sense, as the synthetic event
requires both events to take place to be recorded. If one is filtered out, then the synthetic
event does not execute.
[source,c]
--
select start.pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as lat from sched_waking as start
JOIN sched_switch as end ON start.pid = end.next_pid
WHERE start.prio < 100 && end.prev_prio < 100
--
The above is valid.
Where as the below is not.
[source,c]
--
select start.pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as lat from sched_waking as start
JOIN sched_switch as end ON start.pid = end.next_pid
WHERE start.prio < 100 || end.prev_prio < 100
--
If the kernel supports it, you can pass around a stacktrace between events.
[source, c]
--
select start.prev_pid as pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as delta, start.STACKTRACE as stack
FROM sched_switch as start JOIN sched_switch as end ON start.prev_pid = end.next_pid
WHERE start.prev_state == 2
--
The above will record a stacktrace when a task is in the UNINTERRUPTIBLE (blocked) state, and trigger
the synthetic event when it is scheduled back in, recording the time delta that it was blocked for.
It will record the stacktrace of where it was when it scheduled out along with the delta.
KEYWORDS AS EVENT FIELDS
------------------------
In some cases, an event may have a keyword. For example, regcache_drop_region has "from"
as a field and the following will not work
[source,c]
--
select from from regcache_drop_region
--
In such cases, add a backslash to the conflicting field, and this will tell the parser
that the "from" is a field and not a keyword:
[source,c]
--
select \from from regcache_drop_region
--
HISTOGRAMS
----------
Simple SQL statements without the *JOIN* *ON* may also be used, which will create a histogram
instead. When doing this, the struct tracefs_hist descriptor can be retrieved from the
returned synthetic event descriptor via the *tracefs_synth_get_start_hist*(3).
In order to utilize the histogram types (see xxx) the CAST command of SQL can be used.
That is:
[source,c]
--
select CAST(common_pid AS comm), CAST(id AS syscall) FROM sys_enter
--
Which produces:
[source,c]
--
# echo 'hist:keys=common_pid.execname,id.syscall' > events/raw_syscalls/sys_enter/trigger
# cat events/raw_syscalls/sys_enter/hist
{ common_pid: bash [ 18248], id: sys_setpgid [109] } hitcount: 1
{ common_pid: sendmail [ 1812], id: sys_read [ 0] } hitcount: 1
{ common_pid: bash [ 18247], id: sys_getpid [ 39] } hitcount: 1
{ common_pid: bash [ 18247], id: sys_dup2 [ 33] } hitcount: 1
{ common_pid: gmain [ 13684], id: sys_inotify_add_watch [254] } hitcount: 1
{ common_pid: cat [ 18247], id: sys_access [ 21] } hitcount: 1
{ common_pid: bash [ 18248], id: sys_getpid [ 39] } hitcount: 1
{ common_pid: cat [ 18247], id: sys_fadvise64 [221] } hitcount: 1
{ common_pid: sendmail [ 1812], id: sys_openat [257] } hitcount: 1
{ common_pid: less [ 18248], id: sys_munmap [ 11] } hitcount: 1
{ common_pid: sendmail [ 1812], id: sys_close [ 3] } hitcount: 1
{ common_pid: gmain [ 1534], id: sys_poll [ 7] } hitcount: 1
{ common_pid: bash [ 18247], id: sys_execve [ 59] } hitcount: 1
--
Note, string fields may not be cast.
The possible types to cast to are:
*HEX* - convert the value to use hex and not decimal
*SYM* - convert a pointer to symbolic (kallsyms values)
*SYM-OFFSET* - convert a pointer to symbolic and include the offset.
*SYSCALL* - convert the number to the mapped system call name
*EXECNAME* or *COMM* - can only be used with the common_pid field. Will show the task
name of the process.
*LOG* or *LOG2* - bucket the key values in a log 2 values (1, 2, 3-4, 5-8, 9-16, 17-32, ...)
The above fields are not case sensitive, and "LOG2" works as good as "log".
A special CAST to _COUNTER_ or __COUNTER__ will make the field a value and not
a key. For example:
[source,c]
--
SELECT common_pid, CAST(bytes_req AS _COUNTER_) FROM kmalloc
--
Which will create
[source,c]
--
echo 'hist:keys=common_pid:vals=bytes_req' > events/kmem/kmalloc/trigger
cat events/kmem/kmalloc/hist
{ common_pid: 1812 } hitcount: 1 bytes_req: 32
{ common_pid: 9111 } hitcount: 2 bytes_req: 272
{ common_pid: 1768 } hitcount: 3 bytes_req: 1112
{ common_pid: 0 } hitcount: 4 bytes_req: 512
{ common_pid: 18297 } hitcount: 11 bytes_req: 2004
--
RETURN VALUE
------------
Returns 0 on success and -1 on failure. On failure, if _err_ is defined, it will be
allocated to hold a detailed description of what went wrong if it the error was caused
by a parsing error, or that an event, field does not exist or is not compatible with
what it was combined with.
CREATE A TOOL
-------------
The below example is a functional program that can be used to parse SQL commands into
synthetic events.
[source, c]
--
man tracefs_sql | sed -ne '/^EXAMPLE/,/FILES/ { /EXAMPLE/d ; /FILES/d ; p}' > sqlhist.c
gcc -o sqlhist sqlhist.c `pkg-config --cflags --libs libtracefs`
--
Then you can run the above examples:
[source, c]
--
sudo ./sqlhist 'select start.pid, (end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as lat from sched_waking as start
JOIN sched_switch as end ON start.pid = end.next_pid
WHERE start.prio < 100 || end.prev_prio < 100'
--
EXAMPLE
-------
[source,c]
--
#include <stdio.h>
#include <stdlib.h>
#include <stdarg.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#include <tracefs.h>
static void usage(char **argv)
{
fprintf(stderr, "usage: %s [-ed][-n name][-s][-S fields][-m var][-c var][-T][-t dir][-f file | sql-command-line]\n"
" -n name - name of synthetic event 'Anonymous' if left off\n"
" -t dir - use dir instead of /sys/kernel/tracing\n"
" -e - execute the commands to create the synthetic event\n"
" -m - trigger the action when var is a new max.\n"
" -c - trigger the action when var changes.\n"
" -s - used with -m or -c to do a snapshot of the tracing buffer\n"
" -S - used with -m or -c to save fields of the end event (comma deliminated)\n"
" -T - used with -m or -c to do both a snapshot and a trace\n"
" -f file - read sql lines from file otherwise from the command line\n"
" if file is '-' then read from standard input.\n",
argv[0]);
exit(-1);
}
enum action {
ACTION_DEFAULT = 0,
ACTION_SNAPSHOT = (1 << 0),
ACTION_TRACE = (1 << 1),
ACTION_SAVE = (1 << 2),
ACTION_MAX = (1 << 3),
ACTION_CHANGE = (1 << 4),
};
#define ACTIONS ((ACTION_MAX - 1))
static int do_sql(const char *instance_name,
const char *buffer, const char *name, const char *var,
const char *trace_dir, bool execute, int action,
char **save_fields)
{
struct tracefs_synth *synth;
struct tep_handle *tep;
struct trace_seq seq;
enum tracefs_synth_handler handler;
char *err;
int ret;
if ((action & ACTIONS) && !var) {
fprintf(stderr, "Error: -s, -S and -T not supported without -m or -c");
exit(-1);
}
if (!name)
name = "Anonymous";
trace_seq_init(&seq);
tep = tracefs_local_events(trace_dir);
if (!tep) {
if (!trace_dir)
trace_dir = "tracefs directory";
perror(trace_dir);
exit(-1);
}
synth = tracefs_sql(tep, name, buffer, &err);
if (!synth) {
perror("Failed creating synthetic event!");
if (err)
fprintf(stderr, "%s", err);
free(err);
exit(-1);
}
if (tracefs_synth_complete(synth)) {
if (var) {
if (action & ACTION_MAX)
handler = TRACEFS_SYNTH_HANDLE_MAX;
else
handler = TRACEFS_SYNTH_HANDLE_CHANGE;
if (action & ACTION_SAVE) {
ret = tracefs_synth_save(synth, handler, var, save_fields);
if (ret < 0) {
err = "adding save";
goto failed_action;
}
}
if (action & ACTION_TRACE) {
/*
* By doing the trace before snapshot, it will be included
* in the snapshot.
*/
ret = tracefs_synth_trace(synth, handler, var);
if (ret < 0) {
err = "adding trace";
goto failed_action;
}
}
if (action & ACTION_SNAPSHOT) {
ret = tracefs_synth_snapshot(synth, handler, var);
if (ret < 0) {
err = "adding snapshot";
failed_action:
perror(err);
if (errno == ENODEV)
fprintf(stderr, "ERROR: '%s' is not a variable\n",
var);
exit(-1);
}
}
}
tracefs_synth_echo_cmd(&seq, synth);
if (execute) {
ret = tracefs_synth_create(synth);
if (ret < 0) {
fprintf(stderr, "%s\n", tracefs_error_last(NULL));
exit(-1);
}
}
} else {
struct tracefs_instance *instance = NULL;
struct tracefs_hist *hist;
hist = tracefs_synth_get_start_hist(synth);
if (!hist) {
perror("get_start_hist");
exit(-1);
}
if (instance_name) {
if (execute)
instance = tracefs_instance_create(instance_name);
else
instance = tracefs_instance_alloc(trace_dir,
instance_name);
if (!instance) {
perror("Failed to create instance");
exit(-1);
}
}
tracefs_hist_echo_cmd(&seq, instance, hist, 0);
if (execute) {
ret = tracefs_hist_start(instance, hist);
if (ret < 0) {
fprintf(stderr, "%s\n", tracefs_error_last(instance));
exit(-1);
}
}
}
tracefs_synth_free(synth);
trace_seq_do_printf(&seq);
trace_seq_destroy(&seq);
return 0;
}
int main (int argc, char **argv)
{
char *trace_dir = NULL;
char *buffer = NULL;
char buf[BUFSIZ];
int buffer_size = 0;
const char *file = NULL;
const char *instance = NULL;
bool execute = false;
char **save_fields = NULL;
const char *name;
const char *var;
int action = 0;
char *tok;
FILE *fp;
size_t r;
int c;
int i;
for (;;) {
c = getopt(argc, argv, "ht:f:en:m:c:sS:TB:");
if (c == -1)
break;
switch(c) {
case 'h':
usage(argv);
case 't':
trace_dir = optarg;
break;
case 'f':
file = optarg;
break;
case 'e':
execute = true;
break;
case 'm':
action |= ACTION_MAX;
var = optarg;
break;
case 'c':
action |= ACTION_CHANGE;
var = optarg;
break;
case 's':
action |= ACTION_SNAPSHOT;
break;
case 'S':
action |= ACTION_SAVE;
tok = strtok(optarg, ",");
while (tok) {
save_fields = tracefs_list_add(save_fields, tok);
tok = strtok(NULL, ",");
}
if (!save_fields) {
perror(optarg);
exit(-1);
}
break;
case 'T':
action |= ACTION_TRACE | ACTION_SNAPSHOT;
break;
case 'B':
instance = optarg;
break;
case 'n':
name = optarg;
break;
}
}
if ((action & (ACTION_MAX|ACTION_CHANGE)) == (ACTION_MAX|ACTION_CHANGE)) {
fprintf(stderr, "Can not use both -m and -c together\n");
exit(-1);
}
if (file) {
if (!strcmp(file, "-"))
fp = stdin;
else
fp = fopen(file, "r");
if (!fp) {
perror(file);
exit(-1);
}
while ((r = fread(buf, 1, BUFSIZ, fp)) > 0) {
buffer = realloc(buffer, buffer_size + r + 1);
strncpy(buffer + buffer_size, buf, r);
buffer_size += r;
}
fclose(fp);
if (buffer_size)
buffer[buffer_size] = '\0';
} else if (argc == optind) {
usage(argv);
} else {
for (i = optind; i < argc; i++) {
r = strlen(argv[i]);
buffer = realloc(buffer, buffer_size + r + 2);
if (i != optind)
buffer[buffer_size++] = ' ';
strcpy(buffer + buffer_size, argv[i]);
buffer_size += r;
}
}
do_sql(instance, buffer, name, var, trace_dir, execute, action, save_fields);
free(buffer);
return 0;
}
--
FILES
-----
[verse]
--
*tracefs.h*
Header file to include in order to have access to the library APIs.
*-ltracefs*
Linker switch to add when building a program that uses the library.
--
SEE ALSO
--------
*sqlhist*(1),
*libtracefs*(3),
*libtraceevent*(3),
*trace-cmd*(1),
*tracefs_synth_init*(3),
*tracefs_synth_add_match_field*(3),
*tracefs_synth_add_compare_field*(3),
*tracefs_synth_add_start_field*(3),
*tracefs_synth_add_end_field*(3),
*tracefs_synth_append_start_filter*(3),
*tracefs_synth_append_end_filter*(3),
*tracefs_synth_create*(3),
*tracefs_synth_destroy*(3),
*tracefs_synth_free*(3),
*tracefs_synth_echo_cmd*(3),
*tracefs_hist_alloc*(3),
*tracefs_hist_alloc_2d*(3),
*tracefs_hist_alloc_nd*(3),
*tracefs_hist_free*(3),
*tracefs_hist_add_key*(3),
*tracefs_hist_add_value*(3),
*tracefs_hist_add_name*(3),
*tracefs_hist_start*(3),
*tracefs_hist_destory*(3),
*tracefs_hist_add_sort_key*(3),
*tracefs_hist_sort_key_direction*(3)
AUTHOR
------
[verse]
--
*Steven Rostedt* <rostedt@goodmis.org>
*Tzvetomir Stoyanov* <tz.stoyanov@gmail.com>
*sameeruddin shaik* <sameeruddin.shaik8@gmail.com>
--
REPORTING BUGS
--------------
Report bugs to <linux-trace-devel@vger.kernel.org>
LICENSE
-------
libtracefs is Free Software licensed under the GNU LGPL 2.1
RESOURCES
---------
https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/
COPYING
-------
Copyright \(C) 2020 VMware, Inc. Free use of this software is granted under
the terms of the GNU Public License (GPL).
|