File: kernel_nic_interface.rst

package info (click to toggle)
dpdk 22.11.9-1~deb12u1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 103,496 kB
  • sloc: ansic: 1,826,658; python: 4,473; sh: 4,351; makefile: 2,001; awk: 53
file content (423 lines) | stat: -rw-r--r-- 17,036 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
..  SPDX-License-Identifier: BSD-3-Clause
    Copyright(c) 2010-2015 Intel Corporation.

.. _kni:

Kernel NIC Interface
====================

.. note::

   KNI is deprecated and will be removed in future.
   See :doc:`../rel_notes/deprecation`.

   :ref:`virtio_user_as_exception_path` alternative is the preferred way
   for interfacing with the Linux network stack
   as it is an in-kernel solution and has similar performance expectations.

.. note::

   KNI is disabled by default in the DPDK build.
   To re-enable the library, remove 'kni' from the "disable_libs" meson option when configuring a build.

The DPDK Kernel NIC Interface (KNI) allows userspace applications access to the Linux* control plane.

KNI provides an interface with the kernel network stack
and allows management of DPDK ports using standard Linux net tools
such as ``ethtool``, ``iproute2`` and ``tcpdump``.

The main use case of KNI is to get/receive exception packets from/to Linux network stack
while main datapath IO is done bypassing the networking stack.

There are other alternatives to KNI, all are available in the upstream Linux:

#. :ref:`virtio_user_as_exception_path`

#. :doc:`../nics/tap` as wrapper to `Linux tun/tap
   <https://www.kernel.org/doc/Documentation/networking/tuntap.txt>`_

The benefits of using the KNI against alternatives are:

*   Faster than existing Linux TUN/TAP interfaces
    (by eliminating system calls and copy_to_user()/copy_from_user() operations.

The disadvantages of the KNI are:

* It is out-of-tree Linux kernel module
  which makes updating and distributing the driver more difficult.
  Most users end up building the KNI driver from source
  which requires the packages and tools to build kernel modules.

* As it shares memory between userspace and kernelspace,
  and kernel part directly uses input provided by userspace, it is not safe.
  This makes hard to upstream the module.

* Requires dedicated kernel cores.

* Only a subset of net devices control commands are supported by KNI.

The components of an application using the DPDK Kernel NIC Interface are shown in :numref:`figure_kernel_nic_intf`.

.. _figure_kernel_nic_intf:

.. figure:: img/kernel_nic_intf.*

   Components of a DPDK KNI Application


The DPDK KNI Kernel Module
--------------------------

The KNI kernel loadable module ``rte_kni`` provides the kernel interface
for DPDK applications.

When the ``rte_kni`` module is loaded, it will create a device ``/dev/kni``
that is used by the DPDK KNI API functions to control and communicate with
the kernel module.

The ``rte_kni`` kernel module contains several optional parameters which
can be specified when the module is loaded to control its behavior:

.. code-block:: console

    # modinfo rte_kni.ko
    <snip>
    parm:           lo_mode: KNI loopback mode (default=lo_mode_none):
                    lo_mode_none        Kernel loopback disabled
                    lo_mode_fifo        Enable kernel loopback with fifo
                    lo_mode_fifo_skb    Enable kernel loopback with fifo and skb buffer
                     (charp)
    parm:           kthread_mode: Kernel thread mode (default=single):
                    single    Single kernel thread mode enabled.
                    multiple  Multiple kernel thread mode enabled.
                     (charp)
    parm:           carrier: Default carrier state for KNI interface (default=off):
                    off   Interfaces will be created with carrier state set to off.
                    on    Interfaces will be created with carrier state set to on.
                     (charp)
    parm:           enable_bifurcated: Enable request processing support for
                    bifurcated drivers, which means releasing rtnl_lock before calling
                    userspace callback and supporting async requests (default=off):
                    on    Enable request processing support for bifurcated drivers.
                     (charp)
    parm:           min_scheduling_interval: KNI thread min scheduling interval (default=100 microseconds)
                     (long)
    parm:           max_scheduling_interval: KNI thread max scheduling interval (default=200 microseconds)
                     (long)


Loading the ``rte_kni`` kernel module without any optional parameters is
the typical way a DPDK application gets packets into and out of the kernel
network stack.  Without any parameters, only one kernel thread is created
for all KNI devices for packet receiving in kernel side, loopback mode is
disabled, and the default carrier state of KNI interfaces is set to *off*.

.. code-block:: console

    # insmod <build_dir>/kernel/linux/kni/rte_kni.ko

.. _kni_loopback_mode:

Loopback Mode
~~~~~~~~~~~~~

For testing, the ``rte_kni`` kernel module can be loaded in loopback mode
by specifying the ``lo_mode`` parameter:

.. code-block:: console

    # insmod <build_dir>/kernel/linux/kni/rte_kni.ko lo_mode=lo_mode_fifo

The ``lo_mode_fifo`` loopback option will loop back ring enqueue/dequeue
operations in kernel space.

.. code-block:: console

    # insmod <build_dir>/kernel/linux/kni/rte_kni.ko lo_mode=lo_mode_fifo_skb

The ``lo_mode_fifo_skb`` loopback option will loop back ring enqueue/dequeue
operations and sk buffer copies in kernel space.

If the ``lo_mode`` parameter is not specified, loopback mode is disabled.

.. _kni_kernel_thread_mode:

Kernel Thread Mode
~~~~~~~~~~~~~~~~~~

To provide flexibility of performance, the ``rte_kni`` KNI kernel module
can be loaded with the ``kthread_mode`` parameter.  The ``rte_kni`` kernel
module supports two options: "single kernel thread" mode and "multiple
kernel thread" mode.

Single kernel thread mode is enabled as follows:

.. code-block:: console

    # insmod <build_dir>/kernel/linux/kni/rte_kni.ko kthread_mode=single

This mode will create only one kernel thread for all KNI interfaces to
receive data on the kernel side.  By default, this kernel thread is not
bound to any particular core, but the user can set the core affinity for
this kernel thread by setting the ``core_id`` and ``force_bind`` parameters
in ``struct rte_kni_conf`` when the first KNI interface is created:

For optimum performance, the kernel thread should be bound to a core in
on the same socket as the DPDK lcores used in the application.

The KNI kernel module can also be configured to start a separate kernel
thread for each KNI interface created by the DPDK application.  Multiple
kernel thread mode is enabled as follows:

.. code-block:: console

    # insmod <build_dir>/kernel/linux/kni/rte_kni.ko kthread_mode=multiple

This mode will create a separate kernel thread for each KNI interface to
receive data on the kernel side.  The core affinity of each ``kni_thread``
kernel thread can be specified by setting the ``core_id`` and ``force_bind``
parameters in ``struct rte_kni_conf`` when each KNI interface is created.

Multiple kernel thread mode can provide scalable higher performance if
sufficient unused cores are available on the host system.

If the ``kthread_mode`` parameter is not specified, the "single kernel
thread" mode is used.

.. _kni_default_carrier_state:

Default Carrier State
~~~~~~~~~~~~~~~~~~~~~

The default carrier state of KNI interfaces created by the ``rte_kni``
kernel module is controlled via the ``carrier`` option when the module
is loaded.

If ``carrier=off`` is specified, the kernel module will leave the carrier
state of the interface *down* when the interface is management enabled.
The DPDK application can set the carrier state of the KNI interface using the
``rte_kni_update_link()`` function.  This is useful for DPDK applications
which require that the carrier state of the KNI interface reflect the
actual link state of the corresponding physical NIC port.

If ``carrier=on`` is specified, the kernel module will automatically set
the carrier state of the interface to *up* when the interface is management
enabled.  This is useful for DPDK applications which use the KNI interface as
a purely virtual interface that does not correspond to any physical hardware
and do not wish to explicitly set the carrier state of the interface with
``rte_kni_update_link()``.  It is also useful for testing in loopback mode
where the NIC port may not be physically connected to anything.

To set the default carrier state to *on*:

.. code-block:: console

    # insmod <build_dir>/kernel/linux/kni/rte_kni.ko carrier=on

To set the default carrier state to *off*:

.. code-block:: console

    # insmod <build_dir>/kernel/linux/kni/rte_kni.ko carrier=off

If the ``carrier`` parameter is not specified, the default carrier state
of KNI interfaces will be set to *off*.

.. _kni_bifurcated_device_support:

Bifurcated Device Support
~~~~~~~~~~~~~~~~~~~~~~~~~

User callbacks are executed while kernel module holds the ``rtnl`` lock, this
causes a deadlock when callbacks run control commands on another Linux kernel
network interface.

Bifurcated devices has kernel network driver part and to prevent deadlock for
them ``enable_bifurcated`` is used.

To enable bifurcated device support:

.. code-block:: console

    # insmod <build_dir>/kernel/linux/kni/rte_kni.ko enable_bifurcated=on

Enabling bifurcated device support releases ``rtnl`` lock before calling
callback and locks it back after callback. Also enables asynchronous request to
support callbacks that requires rtnl lock to work (interface down).

KNI Kthread Scheduling
~~~~~~~~~~~~~~~~~~~~~~

The ``min_scheduling_interval`` and ``max_scheduling_interval`` parameters
control the rescheduling interval of the KNI kthreads.

This might be useful if we have use cases in which we require improved
latency or performance for control plane traffic.

The implementation is backed by Linux High Precision Timers, and uses ``usleep_range``.
Hence, it will have the same granularity constraints as this Linux subsystem.

For Linux High Precision Timers, you can check the following resource: `Kernel Timers <http://www.kernel.org/doc/Documentation/timers/timers-howto.txt>`_

To set the ``min_scheduling_interval`` to a value of 100 microseconds:

.. code-block:: console

    # insmod <build_dir>/kernel/linux/kni/rte_kni.ko min_scheduling_interval=100

To set the ``max_scheduling_interval`` to a value of 200 microseconds:

.. code-block:: console

    # insmod <build_dir>/kernel/linux/kni/rte_kni.ko max_scheduling_interval=200

If the ``min_scheduling_interval`` and ``max_scheduling_interval`` parameters are
not specified, the default interval limits will be set to *100* and *200* respectively.

KNI Creation and Deletion
-------------------------

Before any KNI interfaces can be created, the ``rte_kni`` kernel module must
be loaded into the kernel and configured with the ``rte_kni_init()`` function.

The KNI interfaces are created by a DPDK application dynamically via the
``rte_kni_alloc()`` function.

The ``struct rte_kni_conf`` structure contains fields which allow the
user to specify the interface name, set the MTU size, set an explicit or
random MAC address and control the affinity of the kernel Rx thread(s)
(both single and multi-threaded modes).
By default the KNI sample example gets the MTU from the matching device,
and in case of KNI PMD it is derived from mbuf buffer length.

The ``struct rte_kni_ops`` structure contains pointers to functions to
handle requests from the ``rte_kni`` kernel module.  These functions
allow DPDK applications to perform actions when the KNI interfaces are
manipulated by control commands or functions external to the application.

For example, the DPDK application may wish to enabled/disable a physical
NIC port when a user enabled/disables a KNI interface with ``ip link set
[up|down] dev <ifaceX>``.  The DPDK application can register a callback for
``config_network_if`` which will be called when the interface management
state changes.

There are currently four callbacks for which the user can register
application functions:

``config_network_if``:

    Called when the management state of the KNI interface changes.
    For example, when the user runs ``ip link set [up|down] dev <ifaceX>``.

``change_mtu``:

    Called when the user changes the MTU size of the KNI
    interface.  For example, when the user runs ``ip link set mtu <size>
    dev <ifaceX>``.

``config_mac_address``:

    Called when the user changes the MAC address of the KNI interface.
    For example, when the user runs ``ip link set address <MAC>
    dev <ifaceX>``.  If the user sets this callback function to NULL,
    but sets the ``port_id`` field to a value other than -1, a default
    callback handler in the rte_kni library ``kni_config_mac_address()``
    will be called which calls ``rte_eth_dev_default_mac_addr_set()``
    on the specified ``port_id``.

``config_promiscusity``:

    Called when the user changes the promiscuity state of the KNI
    interface.  For example, when the user runs ``ip link set promisc
    [on|off] dev <ifaceX>``. If the user sets this callback function to
    NULL, but sets the ``port_id`` field to a value other than -1, a default
    callback handler in the rte_kni library ``kni_config_promiscusity()``
    will be called which calls ``rte_eth_promiscuous_<enable|disable>()``
    on the specified ``port_id``.

``config_allmulticast``:

    Called when the user changes the allmulticast state of the KNI interface.
    For example, when the user runs ``ifconfig <ifaceX> [-]allmulti``. If the
    user sets this callback function to NULL, but sets the ``port_id`` field to
    a value other than -1, a default callback handler in the rte_kni library
    ``kni_config_allmulticast()`` will be called which calls
    ``rte_eth_allmulticast_<enable|disable>()`` on the specified ``port_id``.

In order to run these callbacks, the application must periodically call
the ``rte_kni_handle_request()`` function.  Any user callback function
registered will be called directly from ``rte_kni_handle_request()`` so
care must be taken to prevent deadlock and to not block any DPDK fastpath
tasks.  Typically DPDK applications which use these callbacks will need
to create a separate thread or secondary process to periodically call
``rte_kni_handle_request()``.

The KNI interfaces can be deleted by a DPDK application with
``rte_kni_release()``.  All KNI interfaces not explicitly deleted will be
deleted when the ``/dev/kni`` device is closed, either explicitly with
``rte_kni_close()`` or when the DPDK application is closed.

DPDK mbuf Flow
--------------

To minimize the amount of DPDK code running in kernel space, the mbuf mempool is managed in userspace only.
The kernel module will be aware of mbufs,
but all mbuf allocation and free operations will be handled by the DPDK application only.

:numref:`figure_pkt_flow_kni` shows a typical scenario with packets sent in both directions.

.. _figure_pkt_flow_kni:

.. figure:: img/pkt_flow_kni.*

   Packet Flow via mbufs in the DPDK KNI


Use Case: Ingress
-----------------

On the DPDK RX side, the mbuf is allocated by the PMD in the RX thread context.
This thread will enqueue the mbuf in the rx_q FIFO,
and the next pointers in mbuf-chain will convert to physical address.
The KNI thread will poll all KNI active devices for the rx_q.
If an mbuf is dequeued, it will be converted to a sk_buff and sent to the net stack via netif_rx().
The dequeued mbuf must be freed, so the same pointer is sent back in the free_q FIFO,
and next pointers must convert back to virtual address if exists before put in the free_q FIFO.

The RX thread, in the same main loop, polls this FIFO and frees the mbuf after dequeuing it.
The address conversion of the next pointer is to prevent the chained mbuf
in different hugepage segments from causing kernel crash.

Use Case: Egress
----------------

For packet egress the DPDK application must first enqueue several mbufs to create an mbuf cache on the kernel side.

The packet is received from the Linux net stack, by calling the kni_net_tx() callback.
The mbuf is dequeued (without waiting due the cache) and filled with data from sk_buff.
The sk_buff is then freed and the mbuf sent in the tx_q FIFO.

The DPDK TX thread dequeues the mbuf and sends it to the PMD via ``rte_eth_tx_burst()``.
It then puts the mbuf back in the cache.

IOVA = VA: Support
------------------

KNI operates in IOVA_VA scheme when

- LINUX_VERSION_CODE >= KERNEL_VERSION(4, 10, 0) and
- EAL option `iova-mode=va` is passed or bus IOVA scheme in the DPDK is selected
  as RTE_IOVA_VA.

Due to IOVA to KVA address translations, based on the KNI use case there
can be a performance impact. For mitigation, forcing IOVA to PA via EAL
"--iova-mode=pa" option can be used, IOVA_DC bus iommu scheme can also
result in IOVA as PA.

Ethtool
-------

Ethtool is a Linux-specific tool with corresponding support in the kernel.
The current version of kni provides minimal ethtool functionality
including querying version and link state. It does not support link
control, statistics, or dumping device registers.