1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
|
# Examples for Distributed Training
## Examples with NVIDIA GPUs
Note: We recommend the [NVIDIA PyG Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pyg/tags) for best results and easiest setup with NVIDIA GPUs
### Examples with cuGraph
[cuGraph](https://github.com/rapidsai/cugraph) is a collection of packages focused on GPU-accelerated graph analytics including support for property graphs and scaling up to thousands of GPUs. cuGraph supports the creation and manipulation of graphs followed by the execution of scalable fast graph algorithms. It is part of the [RAPIDS](https://rapids.ai) accelerated data science framework.
[cuGraph GNN](https://github.com/rapidsai/cugraph-gnn) is a collection of GPU-accelerated plugins that support PyTorch and PyG natively through the _cuGraph-PyG_ and _WholeGraph_ subprojects. cuGraph GNN is built on top of cuGraph, leveraging its low-level [pylibcugraph](https://github.com/rapidsai/cugraph/python/pylibcugraph) API and C++ primitives for sampling and other GNN operations ([libcugraph](https://github.com/rapidai/cugraph/python/libcugraph)). It also includes the `libwholegraph` and `pylibwholegraph` libraries for high-performance distributed edgelist and embedding storage. Users have the option of working with these lower-level libraries directly, or through the higher-level API in cuGraph-PyG that directly implements the `GraphStore`, `FeatureStore`, `NodeLoader`, and `LinkLoader` interfaces.
Complete documentation on RAPIDS graph packages, including `cugraph`, `cugraph-pyg`, `pylibwholegraph`, and `pylibcugraph` is available on the [RAPIDS docs pages](https://docs.rapids.ai/api/cugraph/nightly/graph_support).
| Example | Scalability | Description |
| ------------------------------------------------------------------------------ | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`ogbn_train_cugraph.py`](./ogbn_train_cugraph.py) | single-node | Single Node Multi GPU Example for `ogbn_train.py` using [CuGraph](https://www.nvidia.com/en-us/on-demand/session/gtc24-s61197/). |
| [`papers100m_gcn_cugraph_multinode.py`](./papers100m_gcn_cugraph_multinode.py) | multi-node | Example for training GNNs on a homogeneous graph on multiple nodes using [CuGraph](https://www.nvidia.com/en-us/on-demand/session/gtc24-s61197/). |
### Examples with Pure PyTorch
| Example | Scalability | Description |
| ---------------------------------------------------------------------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| [`distributed_batching.py`](./distributed_batching.py) | single-node | Example for training GNNs on multiple graphs. (deprecated in favor of [`ogbn_train_cugraph.py`](./ogbn_train_cugraph.py)) |
| [`distributed_sampling.py`](./distributed_sampling.py) | single-node | Example for training GNNs on a homogeneous graph with neighbor sampling. (deprecated in favor of [`ogbn_train_cugraph.py`](./ogbn_train_cugraph.py)) |
| [`distributed_sampling_multinode.py`](./distributed_sampling_multinode.py) | multi-node | Example for training GNNs on a homogeneous graph with neighbor sampling on multiple nodes. (deprecated in favor of [`papers100m_gcn_cugraph_multinode.py`](./papers100m_gcn_cugraph_multinode.py)) |
| [`distributed_sampling_multinode.sbatch`](./distributed_sampling_multinode.sbatch) | multi-node | Example for submitting a training job to a Slurm cluster using [`distributed_sampling_multi_node.py`](./distributed_sampling_multinode.py). |
| [`papers100m_gcn.py`](./papers100m_gcn.py) | single-node | Example for training GNNs on the `ogbn-papers100M` homogeneous graph w/ ~1.6B edges. (deprecated in favor of [`ogbn_train_cugraph.py`](./ogbn_train_cugraph.py)) |
| [`papers100m_gcn_multinode.py`](./papers100m_gcn_multinode.py) | multi-node | Example for training GNNs on a homogeneous graph on multiple nodes. (deprecated in favor of [`papers100m_gcn_cugraph_multinode.py`](./papers100m_gcn_cugraph_multinode.py)) |
| [`pcqm4m_ogb.py`](./pcqm4m_ogb.py) | single-node | Example for training GNNs for a graph-level regression task. |
| [`mag240m_graphsage.py`](./mag240m_graphsage.py) | single-node | Example for training GNNs on a large heterogeneous graph. |
| [`taobao.py`](./taobao.py) | single-node | Example for training link prediction GNNs on a heterogeneous graph. (deprecated in favor of [taobao_mnmg.py](https://github.com/rapidsai/cugraph-gnn/blob/branch-25.04/python/cugraph-pyg/cugraph_pyg/examples/taobao_mnmg.py) with [CuGraph](https://www.nvidia.com/en-us/on-demand/session/gtc24-s61197/). |
| [`model_parallel.py`](./model_parallel.py) | single-node | Example for model parallelism by manually placing layers on each GPU. |
| [`data_parallel.py`](./data_parallel.py) | single-node | Example for training GNNs on multiple graphs. Note that [`torch_geometric.nn.DataParallel`](https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.data_parallel.DataParallel) is deprecated and [discouraged](https://github.com/pytorch/pytorch/issues/65936). |
## Examples with Intel GPUs (XPUs)
| Example | Scalability | Description |
| -------------------------------------------------------------- | ---------------------- | ------------------------------------------------------------------------ |
| [`distributed_sampling_xpu.py`](./distributed_sampling_xpu.py) | single-node, multi-gpu | Example for training GNNs on a homogeneous graph with neighbor sampling. |
|