1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
|
.. _rendezvous-api:
Rendezvous
==========
.. automodule:: torch.distributed.elastic.rendezvous
Below is a state diagram describing how rendezvous works.
.. image:: etcd_rdzv_diagram.png
Registry
--------
.. autoclass:: RendezvousParameters
:members:
.. autoclass:: RendezvousHandlerRegistry
:members:
.. automodule:: torch.distributed.elastic.rendezvous.registry
Handler
-------
.. currentmodule:: torch.distributed.elastic.rendezvous
.. autoclass:: RendezvousHandler
:members:
Exceptions
----------
.. autoclass:: RendezvousError
.. autoclass:: RendezvousClosedError
.. autoclass:: RendezvousTimeoutError
.. autoclass:: RendezvousConnectionError
.. autoclass:: RendezvousStateError
Implementations
---------------
Dynamic Rendezvous
******************
.. currentmodule:: torch.distributed.elastic.rendezvous.dynamic_rendezvous
.. autofunction:: create_handler
.. autoclass:: DynamicRendezvousHandler()
:members: from_backend
.. autoclass:: RendezvousBackend
:members:
.. autoclass:: RendezvousTimeout
:members:
C10d Backend
^^^^^^^^^^^^
.. currentmodule:: torch.distributed.elastic.rendezvous.c10d_rendezvous_backend
.. autofunction:: create_backend
.. autoclass:: C10dRendezvousBackend
:members:
Etcd Backend
^^^^^^^^^^^^
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_rendezvous_backend
.. autofunction:: create_backend
.. autoclass:: EtcdRendezvousBackend
:members:
Etcd Rendezvous (Legacy)
************************
.. warning::
The ``DynamicRendezvousHandler`` class supersedes the ``EtcdRendezvousHandler``
class, and is recommended for most users. ``EtcdRendezvousHandler`` is in
maintenance mode and will be deprecated in the future.
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_rendezvous
.. autoclass:: EtcdRendezvousHandler
Etcd Store
**********
The ``EtcdStore`` is the C10d ``Store`` instance type returned by
``next_rendezvous()`` when etcd is used as the rendezvous backend.
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_store
.. autoclass:: EtcdStore
:members:
Etcd Server
***********
The ``EtcdServer`` is a convenience class that makes it easy for you to
start and stop an etcd server on a subprocess. This is useful for testing
or single-node (multi-worker) deployments where manually setting up an
etcd server on the side is cumbersome.
.. warning:: For production and multi-node deployments please consider
properly deploying a highly available etcd server as this is
the single point of failure for your distributed jobs.
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_server
.. autoclass:: EtcdServer
|