1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614
|
<!--
%\VignetteIndexEntry{Parallel Workers on Other Machines}
%\VignetteAuthor{Henrik Bengtsson}
%\VignetteKeyword{R}
%\VignetteKeyword{package}
%\VignetteKeyword{vignette}
%\VignetteEngine{parallelly::selfonly}
-->
# Introduction
Sometimes it is not sufficient to parallize on a single computer - it
cannot provide all of the compute power we are looking for. When we
hit this limit, a natural next level is to look at other computers
near us, e.g. desktops in an office or other computers we have access
to remotely. In this vignette, we will cover how to run parallel R
workers on other machines. Sometimes we distinguish between local
machines and on remote machines, where _local_ machines are machines
considered to be on the same local area network (LAN) and that might
share a common file system. _Remote_ machines are machines that are on
a different network and that do not share a common file system with
the main R computer. In most cases the distinction between local and
remote machines does not matter, but in some cases we can take
advantages of workers being local.
Regardless of running parallel workers on local or remote machines, we
need a way to connect to the machines and launch R on them.
## SSH and R configuration (once)
### Verifying SSH access
The most common approach to connect to another machine is via Secure
Shell (SSH). Linux, macOS, and MS Windows all have a built-in SSH
client called `ssh`. Consider we have another Linux machine called
`n1.remote.org`, it can be accessed via SSH, and we have an account
`alice` on that machine. For the case of these instructions, it does
not matter whether `n1.remote.org` is on our local network (LAN) or a
remote machine on the internet. Also, to make it clear that we do not have to have the same username on `n1.remote.org` and on our local machine, we will use `ally` as the username on our local machine.
To access the `alice` user account on `n1.remote.org` from our local
computer, we open a terminal on the local computer and then SSH to the
other machine as:
```sh
{ally@local}$ ssh alice@n1.remote.org
alice@n1.remote.org's password: *************
{alice@n1}$
```
The commands to call are what follows after the prompt. The prompt on
our local machine is `{ally@local}$`, which tells us that our username
is `ally` and the name of the local machine is `local`. The prompt on
the `n1.remote.org` machine is `{alice@n1}$`, which tells us that our
username on that machine is `alice` and that the machine is called
`n1` on that system.
To return to our local machine, exit the SSH shell by typing `exit`;
```sh
{alice@n1}$ exit
{ally@local}$
```
If we get this far, we have confirmed that we have SSH access to this
machine.
### Configure password-less SSH access
Launching parallel R workers is typically done automatically in the
background, which means it cumbersome, or even impossible, to enter
the SSH password for each machine we wish to connect to. The solution
is to configure SSH to connect with _public-private keys_, which
pre-establish SSH authentication between the main machine and the
machine to connect to. As this is common practice when working with
SSH, there are numerous online tutorials explaining how to configure
private-public SSH key pairs. Please consult one of them for the
details, but the gist is to use (i) `ssh-keygen` to generate the
public-private SSH keys on your local machine, and then (ii)
`ssh-copy-id` to deploy the public key on the machine you want to
connect to.
Step 1: Generate public-private SSH keys locally
```sh
{ally@local}$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/ally/.ssh/id_rsa):
Created directory '/home/ally/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/ally/.ssh/id_rsa
Your public key has been saved in /home/ally/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:Sx48uXZTUL12SKKUzWB77e/Pm3TifqrDIbOnJ0pEWHY ally@local
The key's randomart image is:
+---[RSA 3072]----+
| o E=.. |
| + ooo+.o |
| . ..o..o.o |
| o ..o .+ .|
| S .... |
| + =o.. . |
| * o= ...o|
| o .o.=..++|
| ...=.++=*|
+----[SHA256]-----+
```
Step 2: Copy the public SSH key to the other machine
```sh
{ally@local}$ ssh-copy-id alice@n1.remote.org
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/ally/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
alice@n1.remote.org:s password: *************
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'alice@n1.remote.org'"
and check to make sure that only the key(s) you wanted were added.
```
At this point, we should be able to SSH to the other machine without
having to enter a password;
```sh
{ally@local}$ ssh alice@n1.remote.org
{alice@n1}$
```
Type `exit` to return to your local machine.
Note, if you later want to connect to other machines,
e.g. `n2.remote.org` or `hpc.my-university.edu`, you may re-use the
above generated keys for those systems to. In other words, you do not
have to use `ssh-keygen` to generate new keys for those machines.
### Verifying R exists on the other machine
In order to run parallel R workers on another machine, it (i) needs to
be installed on that machine, and (ii) ideally readily available by
calling `Rscript`. Parallel R workers are launched via `Rscript`,
instead of the more commonly known `R` command - both come with all R
installation, i.e. if you have one of them, you have the other too.
To verify that R is installed on the other machine, SSH to the machine and call `Rscript --version`;
```sh
{ally@local}$ ssh alice@n1.remote.org
{alice@n1}$ Rscript --version
Rscript (R) version 4.4.2 (2024-10-31)
```
If you get:
```sh
{alice@n1}$ Rscript --version
Rscript: command not found
```
then R is either not installed on that machine, or it cannot be
found. If it is installed, but cannot be found, make sure that
environment variable `PATH` his configured properly on that machine.
### Final checks
With password-less SSH access, and R being available, on the other
machine, we should be able to SSH into the other machine and query the
R version in a single call:
```sh
{ally@local}$ ssh alice@n1.remote.org Rscript --version
Rscript (R) version 4.4.2 (2024-10-31)
{ally@local}$
```
This is all that is needed to launch one or more parallel R workers on
machine `n1.remote.org` running under user `alice`. We can test this
from within R with the **parallelly** package using:
```sh
{ally@local}$ R --quiet
library(parallely)
cl <- makeClusterPSOCK("n1.remote.org", user = "alice")
print(cl)
#> Socket cluster with 1 nodes where 1 node is on host 'n1.remote.org'
#> (R version 4.4.2 (2024-10-31), platform x86_64-pc-linux-gnu)
parallel::stopCluster(cl)
```
If you want to run parallel workers on other machines, repeat the
above for each machine. After this, you will be able to launch
parallel R workers on these machines with little efforts.
### Machine-specific SSH customization (recommended)
Some machines do not use the default port 22 to answer on SSH
connection requests. If the machine uses another port, say, port 2201,
then we canspecify that via option `-p port`, when we connect to it,
e.g.
```sh
{ally@local}$ ssh -p 2201 alice@n1.remote.org
```
In R, we can specify argument `port=port` as in:
```sh
cl <- makeClusterPSOCK("n1.remote.org", port = 2201, user = "alice")
```
Now, it can be tedious to have to remember custom SSH ports and
usernames when setting up remote workers in R. It also adds noise and
distraction having such details in the R script, and not to mention
the fact that the R script has a specific username hardcoded into the
code makes the script less reproducible for other users - they need to
change the code to match their username. One way to avoid having to
give specific SSH options when calling `ssh` in the terminal, or
`makeClusterPSOCK()` in R, is to configure these settings in SSH. This
can be done via a file called `~/.ssh/config` _on your local
machine_. This file does not exist by default, so you would have to
create it yourself, if missing. It is a plain text file, so you should
use a plain text editor to create and edit it.
To configure SSH to use port 2201 and username `alice` whenever
connecting to `n1.remote.org`, the `~/.ssh/config` file should
contain the following entry:
```plain
Host n1.remote.org
User alice
Port 2201
```
With this, we can connect to `n1.remote.org` by just using:
```sh
{ally@local}$ ssh n1.remote.org
{alice@n1}$
```
SSH will then connect to the machine as if we had specified also `-p
2201` and `-l alice`. These settings will also be picked up when we
connect via R, meaning the following will also work:
```sh
cl <- makeClusterPSOCK("n1.remote.org")
```
To achieve the same for other machines, add another entry for them,
e.g.
```plain
Host n1.remote.org
User alice
Port 2201
Host n2.remote.org
User alice
Port 2201
Host hpc.my-university.edu
User alice.bobson
```
When hosts on the same system share the same setting, one can use
globbing to configure them the same way. For instance, the above can
be shorted to:
```plain
Host n?.remote.org
User alice
Port 2201
Host hpc.my-university.edu
User alice.bobson
```
Being able to connect to remote machines by just specifying their
hostnames is convenient and simplifies also the R code. Because of
this, we recommend setting up also `~/.ssh/config`.
# Examples
## Example: Two parallel workers on a single remote machine
Our first example sets up two parallel workers on the remote machine
`n1.remote.org`. For this to work, we need SSH access to the machine,
and it must have R installed, as explained in the above
section. Contrary to local parallel workers, the number of parallel
workers on remote machines is specified by repeating the machine name
an equal number of times;
```r
library(parallelly)
workers <- c("n1.remote.org", "n1.remote.org")
cl <- makeClusterPSOCK(workers, user = "alice")
print(cl)
#> Socket cluster with 2 nodes where 2 nodes are on host 'n1.remote.org'
#> (R version 4.4.2 (2024-10-31), platform x86_64-pc-linux-gnu).
```
_Comment_: In the **parallel** package, a parallel worker is referred
to a parallel node, or short _node_, which is why we use the same term
in the **parallelly** package.
Note, contrary to parallel workers running on the local machine,
parallel workers on remote machines are launched sequentially, that is
one after each other. Because of this, the setup time for a remote
parallel cluster will increase linearly with the number of remote
parallel workers.
_Technical details_: If we would add `verbose = TRUE` to
`makeClusterPSOCK()`, we would learn that the parallel workers are
launched in the background by R using something like:
```
'/usr/bin/ssh' -R 11058:localhost:11058 -l alice n1.remote.org Rscript ...
'/usr/bin/ssh' -R 11059:localhost:11059 -l alice n1.remote.org Rscript ...
```
This tells us that there is one active SSH connection per parallel
worker. It also reveals that that each of these connections uses a so
called _reverse tunnel_, which is used to establish a unique
communication channel between the main R process and the corresponding
parallel worker. It also this use of reverse tunneling that avoids
having to configure dynamic DNS (DDNS) and port-forwarding in our
local firewalls, which is cumbersome and requires administrative
rights. When using **parallelly**, there is no need for administrative
rights - any non-privileged user can launch remote parallel R workers.
## Example: Two parallel workers on two remote machines
This example sets up a parallel worker on each of two remote machines
`n1.remote.org` and `n2.remote.org`. It works very similar to the
previous example, but now the two SSH connections go to two different
machines rather than the same.
```r
library(parallelly)
workers <- c("n1.remote.org", "n2.remote.org")
cl <- makeClusterPSOCK(workers, user = "alice")
print(cl)
#> Socket cluster with 2 nodes where 1 node is on host 'n1.remote.org'
#> (R version 4.4.2 (2024-10-31), platform x86_64-pc-linux-gnu),
#> 1 node is on host 'n2.remote.org' (R version 4.4.2 (2024-10-31),
#> platform x86_64-pc-linux-gnu)
```
_Technical details_: If we would add `verbose = TRUE` also in this
case, we would see:
```
'/usr/bin/ssh' -R 11464:localhost:11464 -l alice n1.remote.org Rscript ...
'/usr/bin/ssh' -R 11465:localhost:11464 -l alice n2.remote.org Rscript ...
```
Recall, if we have configured SSH to pick up the username `alice` from
`~/.ssh/config` on our local machine, as shown in the previous
section, we could have skipped the `user` argument, and just used:
```r
workers <- c("n1.remote.org", "n2.remote.org")
cl <- makeClusterPSOCK(workers)
```
Note how these instructions for setting up a parallel cluster on these
two machines would be the identical for another user that has
configured their personal `~/.ssh/config` file.
## Example: Three parallel workers on two remote machines
When we now understand that we control the number of parallel workers
on a specific machine by replicate the machine name, we also know how
to launch different number of parallel workers on different machines.
From now on, we will also assume that the remote username no longer
has to be specified, because it has already been configured via the
`~/.ssh/config` file. With this, we can sets up two parallel workers
on `n1.remote.org` and one on `n2.remote.org`, by:
```r
library(parallelly)
workers <- c("n1.remote.org", "n1.remote.org", "n2.remote.org")
cl <- makeClusterPSOCK(workers)
print(cl)
#> Socket cluster with 3 nodes where 2 nodes are on host 'n1.remote.org'
#> (R version 4.4.2 (2024-10-31), platform x86_64-pc-linux-gnu),
#> 1 node is on host 'n2.remote.org' (R version 4.4.2 (2024-10-31),
#> platform x86_64-pc-linux-gnu)
```
Again, the `user` argument does not have to be specified, because it is configured in `~/.ssh/config`.
To generalize to many workers, we can use the `rep()` function. For example,
```r
workers <- c(rep("n1.remote.org", 3), rep("n2.remote.org", 4))
```
sets up three workers on `n1.remote.org` and four on `n2.remote.org`,
totaling seven parallel workers.
## Example: A mix of local and remote workers
As an alternative to `makeClusterPSOCK(n)`, we can use
`makeClusterPSOCK(workers)` to set up parallelly workers running on
the local machine. By convention, the name `localhost` is an alias to
your local machine. This means, we can use:
```sh
library(parallelly)
workers <- rep("localhost", 4)
cl_local <- makeClusterPSOCK(workers)
print(cl_local)
#> Socket cluster with 4 nodes where 4 nodes are on host 'localhost'
#> (R version 4.4.2 (2024-10-31), platform x86_64-pc-linux-gnu)
```
to launch four local parallel workers. Note how we did not have to
specify `user = "ally"`. This is because the default username is
always the local username. Next, assume we want to add another four
parallel workers running on `n1.remote.org`. We already know we can
set these up as:
```sh
library(parallelly)
workers <- rep("n1.remote.org", 4)
cl_remote <- makeClusterPSOCK(workers, user = "alice")
print(cl_remote)
#> Socket cluster with 4 nodes where 4 nodes are on host 'n1.remote.org'
#> (R version 4.4.2 (2024-10-31), platform x86_64-pc-linux-gnu).
```
At this point, we have two independent clusters of parallel workers:
`cl_local` and `cl_remote`. We can combine them into a single
cluster using:
```r
cl <- c(cl_local, cl_remote)
print(cl)
#> Socket cluster with 8 nodes where 4 nodes are on host 'localhost'
#> (R version 4.4.2 (2024-10-31), platform x86_64-pc-linux-gnu), 4
#> nodes are on host 'n1.remote.org' (R version 4.4.2 (2024-10-31),
#> platform x86_64-pc-linux-gnu)
```
To emphasize the usefulness of customizing our SSH connections via
`~/.ssh/config`, if the remote username would already have been
already configured there, we would be able to set up the full cluster
in one single call, as in:
```sh
library(parallelly)
workers <- c(rep("localhost", 4), rep("n1.remote.org", 4)
cl <- makeClusterPSOCK(workers)
```
## Example: Parallel workers on a remote machine accessed via dedicated login machine
Sometimes a remote machine, where we want to run R, is only accessible
via an intermediate login machine, which in SSH terms may also be
referred to as a "jumphost". For example, assume machine
`secret1.remote.org` can only be accessed by first logging into
`login.remote.org` as in:
```sh
{ally@local}$ ssh alice@login.remote.org
{alice@login}$ ssh alice@secret1.remote.org
{alice@secret1}$
```
To achive the same in a single SSH call, we can specify the "jumphost"
`-J hostname` option for SSH, as in:
```sh
{ally@local}$ ssh -J alice@login.remote.org alice@secret1.remote.org
{alice@secret1}$
```
We can use the `rshopts` argument of `makeClusterPSOCK()` to achieve
the same when setting up parallel workers. To launch three parallel
workers on `secret1.remote.org`, use:
```r
workers <- rep("secret1.remote.org", 3)
cl <- makeClusterPSOCK(
workers,
rshopts = c("-J", "login.remote.org"),
user = "alice"
)
```
A more convenient solution is to configure the jumphost in
`~/.ssh/config`, as in:
```plain
Host *.remote.org
User alice
Host secret?.remote.org
ProxyJump login.remote.org
```
This will cause any SSH connection to a machine on the `remote.org`
network to use username `alice`. It will also cause any SSH
connection to machines `secret1.remote.org`, `secret2.remote.org`, and
so on, to use jumphost `login.remote.org`. You can verify that all
this work by:
```sh
{ally@local}$ ssh login.remote.org
{alice@login}$
```
and then:
```sh
{ally@local}$ ssh secret1.remote.org
{alice@secret1}$
```
If the above work, then the following will work from within R:
```r
library(parallelly)
workers <- rep("secret1.remote.org", 3)
cl <- makeClusterPSOCK(workers)
```
# Special needs and tweaks
The above sections cover most common use cases for setting up a
parallel cluster from a local Linux, macOS, and MS Windows
machine. However, there are cases where there above does not work, or
you prefer to use another solution. This section aims to cover such
alternatives.
## Example: Remote workers ignoring any remote .Rprofile settings
To launch parallel workers skipping any `~/.Rprofile` settings on the
remote machines, we can pass option `--no-init-file` to `Rscript` via
argument `rscript_args`. For example,
```r
workers <- rep("n1.remote.org", 2)
cl <- makeClusterPSOCK(workers, rscript_args = "--no-init-file")
```
will launch two parallel workers on `n1.remote.org` ignoring any
`.Rprofile` files.
## Example: Use PuTTY on MS Windows to connect to remote worker
If you run on an MS Windows machine and prefer to use PuTTY to manage
your SSH connections, or for other reasons cannot use the built-in
`ssh` client, you can tell `makeClusterPSOCK()` to use PuTTY and your
PuTTY settings via various arguments.
Here is an example that launches two parallel workers on
`n1.remote.org` running under user `alice` connecting via SSH port
2201 using PuTTY and public-private SSH keys in file
`C:/Users/ally/.ssh/putty.ppk`:
```r
workers <- "n1.remote.org"
cl <- makeClusterPSOCK(
workers,
user = "alice",
rshcmd = "<putty-plink>",
rshopts = c("-P", 2201, "-i", "C:/Users/ally/.ssh/putty.ppk")
)
```
## Example: Two remote workers running on MS Windows
Thus far we have considered our remote machines to run a Unix-like
operating system, e.g. Linux or macOS. If your remote machines run MS
Windows, you can use similar techniques to launch parallel workers
there as well. For this to work, the remote MS Windows machines must
accept incoming SSH connections, which is something most Windows
machines are not configured to do by default. If you do not know set
that up, or if you do not have the system permissions to do so, please
reach out to you system administrator of those machines.
Assuming we have SSH access to two MS Windows machines,
`mswin1.remote.org` and `mswin2.remote.org`, everything works the same
as before, except that we need to specify also argument `rscript_sh =
"cmd"`;
```r
workers <- c("mswin1.remote.org", "mswin2.remote.org")
cl <- makeClusterPSOCK(workers, rscript_sh = "cmd")
```
That argument specifies that the parallel R workers should be launched
on the remote machines via MS Windows' `cmd.exe` shell.
|