File: distributed-architectures.txt

package info (click to toggle)
bart 0.9.00-3
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 9,040 kB
  • sloc: ansic: 116,116; python: 1,329; sh: 726; makefile: 639; javascript: 589; cpp: 106
file content (82 lines) | stat: -rw-r--r-- 4,545 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
Distributed architecturs means to use different compute endities (or shorter nodes).
Nodes can be PCs, workstations, or nodes of a cluster.
They can be connected by TCP/IP or other connections like InfiniBand.
An alternativ term would be heterogenous systems.

Bart uses Message Passing Interface (MPI) [1] to work on distributed systems.

# Basic Requirements
1.) Install (Open-)MPI on all nodes
2.) make sure bart executeable is working on all nodes (maybe compiling each node is necessary, espacially if differnt OS are used)
3.) build bart with MPI=1
4.) setup ssh connections between nodes (requires ssh-server on each node as well)
5.) setup file share between the nodes (for file reading/writing)
6.) run bart (with mpirun) and parallel flags (bart -p)

# Bart command line interface (no mpi required)
- bart -p <pflags> [-s dim0 ... dimN] [-e maxdim0 ... maxdimN] [-S] tool <args>
  - l <lflags>: specifys dimensions of looping which will be parallelized
  - -s: gives start dimensions or (if -e not given) slice which should be processed
  - -e: gives max in dimensions (imagine a for loop with i < maxdim)
  - Example 1: loop in dimension 13 from item 1 to 3
    - bart -l $(bart bitmask 13) -s 1 -e 4 fft 3 example_file.ra
  - Example 2: process only slice 2 in a 3D stack
    - bart -l (bart bitmask 3) -s 2 fft 1 example_file.ra
  - Example 3: loop in dimension 13 over all items (3 items required in this dimensions for this example)
    - bart -l (bart bitmask 13) -e 3 example_file.ra
  - Example 4: loop in dimensions 12 and 13 over items 2 to 3 and 3 to 5
    - bart -l (bart bitmask 12 13) -s 2:3 -e 3:5 fft 1 example_file.ra

# Usefull commands
- [Setup ssh key]: ssh-keygen -t rsa -b 2024 -f ~/.ssh/mpi
   - generate rsa key with bitlength 2024 in file ~/.ssh/mpi
- [Copy ssh key]: ssh-copy-id <user>@<node>
   - copies public key from <user> to <node>, 'ssh-copy-id' has to be installed seperatly
   - Advantage: you can not accidantly share your private key
- [Setup file share over ssh]: sshfs <user>@<node>:<wd> <wd_on_localhost>
  - mounts <wd> on <wd_on_localhost>
  - <user>, user on <node>, Note: if ssh is setup correctly, you don't need the same users on each nodes
- [Unmount file share over ssh]: fusermount -u <abs_path>
- [Run bart with MPI]: 'mpirun -n <x> --host <host>:<slots> -x <env_var> -wdir <wd> bart -p <pflags> [-s dim0 ... dimN] [-e max0 ... maxN] [-S] tool <args> : ...
   - -n: <x> specifys how many slots should be used, slots != processes
   - --host <host> ip adress or better ssh-alias, <slots> specify how many slots should max available, has to correspont with <x>
   - -x <env_var> if needed set environment varialbe (e.g.: usefull for DEBUG_LEVL)
   - -wdir <wd> specify working directory, could be differnt on each node
   - bart has to be absolute path to bart executeable per node if not the same or in <wd>
   - : seperats differnt node configurations (basically repeat everything for each node with adopted parameters, if they are the same, you don't have to seperate them this complicated, just use --hostfile <list of compute endities>)
- [Setup symbolic link]: ln -s <source> <target>
   - Setup an symbolic link to create the same file structure on nodes to circument the need of '-wdir'


# Usefull files
- ~/.ssh/config:
"Host <node_name other nodes should refer to>
  User <user of this node>
  HostName <ip of other node>
  IdentityFile <public key that should be copied>"
  - alternativly setup '/etc/hosts', to use names for nodes



# Troubel shoot
- Q: Nothing happens at all after running 'mpirun'
  A: Check the mpi versions on each node, they should be the same or at least same major version
     'mpirun --version'

- Q: "No protocol specified":
  A: This is an warning/error from ssh. It means that no protocol for X is specified

- Q: "ssh: connect to host <ip addr> port 22: Connection refused"
  A: Most likely you have to install openssh-server on host

- Q: "WARNING: Open MPI failed to TCP connect to a peer MPI process."
  A: Most likely you didn't setup ssh correctly, check each node connection again.
     You should be able to connect to each node and vice versa

- Q: "/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33'
      /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34'"
- A: Most likely you nodes use differnt OS (distributions) with differnt versions of GLIBC.
     Compile bart on each node seperatly
     Use the absolute path to this binary in you call of 'mpirun', differnt nodes can be seperated by ':' 

[1] https://mpitutorial.com/