1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112
|
TODO list for distcc
See also TODO comments in source files and doc/todo/
convert web site from latte to say php
Latte is nice, but seems to be gradually becoming unmaintained;
not in Ubuntu or Gentoo.
failing to resolve a host should be a soft failure
State files depend on host byte order and break when DISTCC_DIR is
shared between heterogenous machines. Of course sharing this
directory is probably a bad idea anyhow.
error messages get badly interleaved
Large writes are not always atomic.
When running parallel compiles that produce many warnings/errors,
the errors can get mixed up, both between lines and within lines.
Someone suggested writing out through stdio, but I don't see why
that would particularly help.
This needs to be done even when writing to a file.
It might be more useful to voluntarily write output one line at a
time so as to increase the chance that each line is written atomically.
hardcode "gcc" not "cc"?
I don't think distcc works with anything else. So why take the
risk of calling anything else?
monitor/state/lock files be on NFS
What happens if the processes are missing?
Should we perhaps specially handle files created by a remote
machine, e.g. but the hostname in?
Handle ESTALE or short EOF.
http://groups.google.com/groups?selm=netappCJyvKo.MrI%40netcom.com
if connection fails, reschedule remotely?
See messages from Heiko
Perhaps if compilation on one remote machine fails, try another, rather
than falling back to localhost?
However, we do need to make sure that if all remote possibilities
are eliminated, then we still run locally.
Perhaps we should more carefully distinguish e.g. "failed to connect",
"server dropped connection", etc etc.
Backing off from downed machines makes this a little
less unnecessary.
auto-check socklen_t mess
Dmitri says:
> By the way, about the accept() argument type (int, size_t, or socklen_t)
> issue I had already reported in a previous post, an autoconf macro is
> available after all. See:
> http://www.gnu.org/software/ac-archive/htmldoc/ac_prototype_accept.html
> http://www.gnu.org/software/ac-archive/htmldoc/ac_func_accept_argtypes.html
> It would be better to fix this specific issue, as I think it could break
> 64-bit builds where the type of the argument is actually important.
Perhaps it should be int if not defined. See accept(2).
some kind of memory leak in gnome monitor?
"compiler not found"
Perhaps distinguish this as a separate error case. We need to
trap the error from exec on the server, pass that back across the
network, and then handle it specially on the client.
Back off from the machine where it failed and retry locally?
This would be more useful when we explicitly set compiler versions.
scheduler should allow for clock/bus speed
(Perhaps front-side bus speed is dominant, since compiling won't
fit in cache?)
Suppose we have one 2GHz and one 1GHz machine. Jobs will take
roughly twice as long to run on the second one; conversely we can
run two jobs on the first one in the time it takes to run one on
the second machine.
gkrellm monitor for distcc
Ought to work with client-server mode
Possibly easier than writing everything ourselves
Show number of running jobs?
don't change the path
Rather than getting into this mess of changing the $PATH, perhaps
we should just check more carefully at the moment that we execute
things?
One problem with this might be interaction with ccache. If we
have doubly masqueraded distcc:ccache:gcc, then ccache probably
needs to see itself as the first item on the path to be able to
find the right gcc.
Perhaps we should remove items from the path, rather than trimming
the path?
It would be good to unify the code in dcc_support_masquerade()
with dcc_trim_path().
Perhaps distccd should do path munging when it gets a request
rather than at startup. It's ugly that the daemon's idea of the
correct path may be wrong if files are changed after the daemon is
started.
DEPENDENCIES_OUTPUT
The problem is when the preprocessor and compiler are both run
with this variable set. The compiler appends directions for
compiling from the temporary .i to the temporary .o, which is
unwanted.
This is a problem for ccache (now fixed), but not a problem for
distcc because we always run the preprocessor and always locally.
The remote compiler doesn't see the variable.
multiple cleanup calls at end
why is this happening?
monitor
There's two things that could be monitored. First is the daemon running
on `this' computer and the client that is sending processes across the
network.
lisa writes:
Some things I'd like to see for the daemon:
1) Uptime
2) Configuration (port, lzo compression? ssh enabled? etc)
3) Number of jobs done (and a spread of the types of errors reported)
4) Average throughput
5) Current compiling tasks (pid, Src, filesize, filename, time recv'd)
For the client (ie, distccmon-gnome replacement):
On a per job basis
For SEND and RECEIVE state:
1) Current throughput
2) Type of connection (ssh? port? lzo?)
3) pid
4) filename
For (remote) COMPILE state:
1) The actual pre-processed filename
2) Type of connection (ssh? port? lzo?)
3) pid
For (local) LINKING state:
1) The actual pre-processed filename
2) pid
3) Where the object code was compiled
4) The post-linked filenamed (gcc ... -o [display this])
For (local) PREPROCESS state:
1) filename
2) pid
A tall order to be sure, and it'd suck to do the GUI... but you asked.
:)
unlink .i file as soon as it has been opened for sending
Might help with vm performance by hinting to the kernel that it
will be discarded.
Possibly reduces the chances of temporary files being left behind.
However, will not work on Windows.
variable to add extra remote cflags, to handle icpc
handle -xc++, etc
Can be either one or two arguments.
handle -Wp,-MF
Some makefiles seem to generate this. Aarg!
Installation as an SSH subsystem
Might make use easier on Windows. I don't see any real advantage
on Unix.
I think the daemon should already be OK for this. It just
requires a slight change in the way we call ssh from the client.
In fact, if you just wrote a small script that rearranged the
arguments and put that in DISTCC_SSH then everything would
probably be fine already.
control through command line
Handle options like
--distcc-verbose
--distcc-hosts=
to allow options to be set on the command line.
I'm not sure this is a good idea or useful.
If we have produced a .i file and need to fall back to running locally
then use that rather than the original source. On the other hand,
falling back to running the original command is possibly more robust.
* @todo Make absolutely sure that if we fail, the .o file is removed.
* Perhaps it would be better to receive to a temporary file and then
* rename into place? On the other hand, gcc seems to just write
* directly, and if we fail or crash then Make ought to know not to
* use it.
*
* @todo Count the preprocessor, and any compilations run locally, against the
* load of localhost. In doing this, make sure that we cannot deadlock
* against a load limit, by having a case where we need to hold one lock and
* take another to make progress. I don't think there should be any such case
* -- we can release the cpp lock before starting the main compiler.
*
allow more control over verbosity
For example, for the client, it would be nice to get just 'info'
level messages about things that can or can't be distributed.
split gcc-specific argument parsing into a separate module
boredom
When there are too many jobs submitted by make, then we have to
wait until any slot is available. Unfortunately there is no
OS-level locking system I can think of that allows us to block
waiting for any one of a number of resources.
If there are no slots to run, then at the moment we just sleep for
2s. This is OK, but can leave the processor idle. It would be
better to be woken up by other processes as they exit. One way to
do this would be to listen on a named pipe for notifications.
This must be backed up by a sleep timer because we may not get the
notification if e.g. the other process is killed. Also it won't
work on Cygwin, which doesn't have named pipes.
Simply doing a select() on a pipe allows us to block for a while
or until signalled. Simply doing a nonblocking write of one byte
to the pipe ought to allow waking up exactly one of the sleepers.
Using an OS level semaphore to guard access to slots might work
with some fudging, but there is no good portable implementation of
them so it is moot.
When woken, the clients can do one full round of trying to get a
slot and then go back to sleep.
This "guides" the OS scheduler towards keeping (almost) the exact
number of clients activated, without too many of them spinning.
We can't make the timeout too high, or the client will idle for a
long time waiting for it. But if we make it too low then we have
the thundering herd problem that currently exists...
Perhaps this is overengineering: people shouldn't make the -j
number so high that this is hit very often, and we need to have
the timeout anyhow, so why not just rely on it.
Just listening on a pipe is cheaper than checking all the locks.
intel CC
Does not understand the .ii extension.
We need to specify -xc++ to make it properly compile C++ from
preprocessed source.
Is it OK to just get the user to add this?
Perhaps we could add it always?
Do we need a DISTCC_ADD_OPTIONS variable?
clean up temp files when a client is signalled
Interrupting a compilation is pretty common. It might be good to
handle this more cleanly.
We can also remove status files. This would reduce the need for
monitor clients to handle dead state files, which might reduce
problems to do with viewing compilations by another user.
globally visible status files
Perhaps store in a world-writable /var/lib/distcc, so that
they're visible even when TMPDIR or HOME has been reset, as when
building with emerge.
Another good case to support is compilation from inside a chroot
jail.
It might also be nice to be able to see other people using your
machine either as a client or as a server.
This requires passing a trust boundary when publishing information
across accounts. The directory needs to be writable and the
programs need to be robust against other users trying to cause
mischief.
It's perhaps not great to allow that kind of security issue in a
default installation. Should we really create a mode 777
directory by default? umask will put some restrictions on what
can be seen.
Alternatively, have an environment variable that sets the state
location. If people want it globally visible they can set it to a
global location.
dnotify in monitor
This has been implemented, but I pulled it out because I'm not
convinced it is a good idea.
Signals into GTK seem to cause some trouble when running from
valgrind etc.
Polling is not too expensive, and is nice and simple. It also
allows easier ways to handle corner cases like cleaning up state
files left over after a compiler is terminated.
Could set up dnotify on the state directory so that we don't have
to keep polling it. This would slightly reduce our CPU usage when
idle, and might allow for faster updates when busy.
We still have to scan the whole directory though, so we don't want
to do it too often.
I'm not sure how to nicely integrate this into GNOME though.
dnotify sends us a signal, which doesn't seem to fit in well with
the GNOME system. Perhaps the dummy pipe trick? Or perhaps we
can jump out of the signal?
We can't call GTK code from inside.
state changes are "committed" by renaming the file, so we'd want
to listen for DN_RENAME I think.
We need to make sure not to get into a loop by reacting to our own
delete events.
SSH connection hoarding
It might be nice to hold open SSH connections to avoid the network
and CPU overhead of opening new ones.
However, fsh is far too slow, probably because of being written in
Python.
It's only going to work on systems which can pass file descriptors
and therefore needs to be optional. Probably this only works on
Unix.
Building the kernel between the three x2000s seems to make
localhost thrash. A few jobs (but not many) get passed out to the
other machines.
Perhaps for C++ or something with really large files fsh would be
better because the cost of starting Python would be amortized
across more work.
I don't think this needs to be done in distcc. It can be a
completely separate project to just rewrite fsh into C. Indeed
you could even be compatible with the Python implementation and
just write the short-lived client bit in C.
Masquerade
It might be nice to automatically create the directory and
symlinks. However we don't know what compiler names they'll want
to hook...
Probably the best that we can do is provide clear instructions for
users or package distributors to set this up.
Packaging
Perhaps build RPMS and .debs?
Is it easy to build a static (or LSB-compliant?) .rpm on Debian?
What about an apt repository?
Statistics
Accumulate statistics on how many jobs are built on various machines.
Want to be able to do something like "watch ccache -s".
Perhaps just dump files into a status directory where they can be
examined?
Ignore (or delete) files over ~60s old. This avoids problems with
files hanging around from interrupted compilations.
refactor name handling
Common function that looks at file extensions and returns
information about them
- what is the preprocessed form of this extension?
- does this need preprocessing?
- is this a source file?
check that EINTR is handled in all cases
check that all lengths are unsigned 32-bit
I think this is done, but it's worth checking a bit more.
abort when cpp fails
The same SIGCHLD handling approach used to feed the compiler from
a fifo might be used to abort early if the preprocessor fails.
This will happen reasonably often, whenever there is a problem
with an include, ifdef, comment, etc.
It might save waiting for a long connection to complete.
One complication is that we know the compiler ought to consume all
its input but we don't know when cpp ought to finish. So the
sigchld handler will have to check if it failed or not. If it
failed, then abort compilation. If it did not fail, then keep
going with the connection or whatever.
This is probably not worthwhile at the moment because connections
generally seem faster than waiting for cpp.
feed compiler from fifo
Probably quite desirable, because it allows the compiler to start
work sooner.
This was originally removed because of some hitches to do with
process termination. I think it can be put back in reliably, but
only if this is fixed. Perhaps we need to write to the compiler
in nonblocking mode?
Perhaps it would be better to talk to both the compiler and
network in nonblocking mode? It is pretty desirable to pull
information from the network as soon as possible, so that the TCP
windows and buffers can open right up.
Check CVS to remember what originally went wrong here.
Events that we need to consider:
Client forks
Compiler opens pipe
Client exits
Server opens pipe
There are a few possibilities here:
Client opens fifo, reads all input, and exits. The normal
success case.
Client never reads from fifo and just exits. Would happen if
the compiler command line was wrong.
Client reads from fifo but not the whole thing, and then
exits.
Opening the fifo is a synchronization point: in blocking mode
neither the compiler or server can proceed past here until the other
one opens it. If the compiler exits, then the server ought to be
broken out of it by a SIGCHLD. But there is a race condition
here: the SIGCHLD might happen just before the open() call.
We need to either jump out of the signal handler and abort the
compilation, or use a non-blocking open and a dummy pipe to break
the select().
If we jump out with longjmp then this makes the code a bit
convoluted.
Alternatively the signal handler could just do a nonblocking open
on the pipe, which would allow the open to complete, if it had not
already.
This was last supported in 0.12. That version doesn't handle the
compiler exiting without opening the pipe though.
streaming input output
We could start sending the preprocessed source out before it is
complete. This would require a protocol that allows us to send
little chunks from various streams, followed by an EOF.
This can certainly be done -- fsh and ssh do it. However,
particularly if we want to allow for streaming more than one thing
at a time, then getting all the timing conditions right to avoid
deadlock caused by bubbles of data in TCP pipes. rsync has had
trouble with this. It's even more hairy when running over ssh.
So on the whole I am very skeptical about doing this. Even when
refactored into a general 'distexec', this is more about batch
than interactive processing.
assemble on client
May be useful if there is a cross compiler but no cross assembler,
as is supposed to be the case for PPC AIX. See thread by Stuart D
Gathman. Would also allow piping output back to client, if the
protocol was changed to support that.
web site
http://user-mode-linux.sourceforge.net/thanks.html
sendfile
perhaps try sendfile to receive as well, if this works on any platforms.
static linking
cachegrind shows that a large fraction of client runtime is spent in the
dynamic linker, which is kind of a waste. In principle using dietlibc
might reduce the fixed overhead of the client. However, the nsswitch
functions are always dynamically linked: even if we try to produce a
static client it will include dlopen and eventually indirectly get libc,
so it's probably not practical.
testing
How to use Debian's make-kpkg with distcc? Does it work with the
masquerade feature?
http://moin.conectiva.com.br/files/AptRpm/attachments/apt-0.5.5cnc4.1.tar.bz2
coverage
Try running with gcov. May require all tests to be run from the same
directory (no chdir) so that the .da files can accumulate properly.
slow networks
Use Linux Traffic Control to simulate compilation across a slow
network.
scheduling onto localhost
Where does local execution fit into the picture?
Perhaps we could talk to a daemon on localhost to coordinate with
other processes, though that's a bit yucky.
However the client should use the same information and shared
state as the daemon when deciding whether it can take on another
job.
At the moment we just use a fixed number of slots, by default 4,
and this seems to work adequately.
make "localhost" less magic
Recognizing this magic string and treating it differently from
127.0.0.1 or the canonical name of the host is perhaps a bit
strange. People do seem to get it wrong. I can't think of a
better simple solution though.
blacklist/lock by IP, not by name
Means we need reliable addr-to-string for IPv4 and IPv6.
Any downside to this?
Would fix Zygo's open Debian bug.
DNS multi-A-records
build.foo.com expands to a list of all IP addresses for building.
Need to choose an appropriate target that has the right compilers.
Probably not a good idea.
If we go to using DNS roundrobin records, or if people have the same
HOSTS set on different machines, then we can't rely on the ordering of
hosts. Perhaps we should always shuffle them?
ssh is an interesting case because we probably want to open the
connection using the hostname, so that the ssh config's "Host"
sections can have the proper effect.
Sometimes people use multi A records for machines with several
routeable interfaces. In that case it would be bad to assume the
machine can run multiple jobs, and it is better to let the
resolver work out which address to use.
DNS SRV records
Can only be updated by zone administrator -- unless you have
dynamic DNS, which is quite possible.
better scheduler
What's the best way to schedule jobs? Multiprocessor machines present
a considerable complication, because we ought to schedule to them even
if they're already busy.
We don't know how many more jobs will arrive in the future. This
might be the first of many, or it might be the last, or all jobs might
be sequenced in this stage of compilation.
Generic OS scheduling theory suggests (??) that we should schedule a
job in the place where it is likely to complete fastest. In other
words, we should put it on the fastest CPU that's not currently busy.
We can't control the overall amount of concurrency -- that's down to
Make. I think all we really want is to keep roughly the same number
of jobs running on each machine.
I would rather not require all clients to know the capabilities of the
machines they might like to use, but it's probably acceptable.
We could also take the current load of the CPUs into account, but I'm
not sure if we could get the information back fast enough for it to
make a difference.
Note that loadavg on Linux includes processes stuck in D state,
which are not necessarily using any CPU.
We want to approximate all tasks on the network being in a single queue,
from which the servers invite tasks as cycles become available.
However, we also want to preserve the classic-TCP model of clients opening
connections to servers, because this makes the security model
straightforward, works over plain TCP, and also can work over SSH.
http://www.cs.panam.edu/~meng/Course/CS6354/Notes/meng/master/node4.html
Research this more.
We "commit" to using a particular server at the last possible moment: when
we start sending a job to it. This is almost certainly preferable to
queueing up on a particular server when we don't know that it will be the
next one free.
One analogy for this is patients waiting in a medical center to see one of
several doctors. They all wait in a common waiting room (the queue) until
a doctor (server) is free. Normally the doctors would come into the
waiting room to say "who's next?", but the constraint of running over TCP
means that in our case the doctors cannot initiate the transaction.
One approach would be to have a central controller (ie receptionist), who
knows which clients are waiting and which servers are free, but I don't
really think the complexity is justified at this stage.
Imagine if the clients sat so that they could see which doctor had their
door open and was ready to accept a new patient. The first client who
sees that then gets up to go through that door. There is a possibility of
a race when two patients head for the door at the same time, but we just
need to make sure that only one of them wins, and that the other returns
to her seat and keeps looking rather than getting stuck.
Ideally this will be built on top of some mechanism that does not rely on
polling.
I had wondered whether it would work to use refused TCP connections to
indicate that a server's door is closed, but I think that is no good.
It seems that at least on Linux, and probably on other platforms, you
cannot set the TCP SYN backlog down to zero for a socket. The kernel will
still accept new connections on behalf of the process if it is listening,
even if it's asked for no backlog and if it's not accepting them yet.
netstat shows these processes just in
It looks like the only way to reliably have the server turn away
connections is to either close its listening socket when it's too busy, or
drop connections. This would work OK, but it forces the client into
retrying, which is inefficient and ugly.
Suppose clients connect and then wait for a prompt from the server before
they begin to send. For multiple servers the client would keep opening
connections to new machines until it got an invitation to send a job.
This requires a change to the protocol but it can be made backward
compatible if necessary, though perhaps that's not necessary.
This would have the advantage of working over either TCP or SSH. The main
problem is that the client will potentially need to open connections to
many machines before it can proceed.
We almost certainly need to do this with nonblocking IO, but that should
be reasonably portable.
Local compilation needs to be handled by lockfiles or some similar
mechanism.
So in pseudocode this will be something like
looking_fds = []
while not accepted:
select() on looking_fds:
if any have failed, remove them
if any have sent an invitation:
close all others
use the accepted connection
open a new connection
I'm not sure if connections should be opened in random order or the order
they're listed.
Clients are almost certainly not going to be accepted in the order in
which they arrive.
If the client sends its job early then it doesn't hurt anybody else. I
suppose it could open a lot of connections but that sort of fairness issue
is not really something that distcc needs to handle. (Just block the user
if they misbehave.)
We can't use select() to check for the ability to run a process locally.
Perhaps the select() needs to timeout and we can then, say, check the load
average.
problems with new protocol
Does anyone actually want this? I really need an example of
somewhere where it would be useful.
The server may need to know the right extension for the temporary
file to make the compiler behave in the right way. In fact,
knowing the acceptable temporary filenames is part of the
application definition.
Compression
Can compression automatically be turned on, rather than requiring
user configuration? I can't tell at the moment when would be the
right time to do that.
Is it cheap enough to always have it on? We not only pay the cost
of compression, but we also need to give up on using sendfile()
and therefore pay for more kernel-userspace transitions and some
data copying. Therefore probably not, at least for GigE.
User Manual
The UML manual is very good
- Add some documentation of the benchmark system. Does this belong
in the manual, or in a separate manual?
- FAQ: Can't you check the gcc version? No, because gcc programs which
report the same versions number can have different behaviours, perhaps due
to vendor/distributor patches.
Just cpp and linker?
Is it easy to describe how to install only the bits of gcc needed for
distcc clients? Basically the driver, header, linker, and specs. Would
this save much space?
Certainly installing gcc is much easier than installing a full cross
development environment, because you don't need headers or libraries. So
if you have a target machine that is a bit slower but not terrible (or you
don't have many of them) it might be convenient to do most of your builds
on the target, but rely on helpers with cross-compilers to help out.
-g support
I'm told that gcc may fix this properly in a future release. There would
then be no need to kludge around it in distcc.
Perhaps detect the -g option, and then absolutify filenames passed to the
compiler. This will cause absolute filenames to appear in error messages,
but I don't see any easy way to have both correct stabs info and also
correct error messages.
Is anything else wrong with this approach?
kill compiler
If the client is killed, it will close the connection. The server ought
to kill the compiler so as to prevent runaway processes on the server.
This probably involves selecting() for read on the connection.
The compilation will complete relatively soon anyhow, so it's not worth
doing this unless there is a simple implementation.
tcp fiddling
I wonder if increasing the maximum window size (sys.net.core.wmem_default,
etc) will help anything? It's probably dominated by scheduling
inefficiency at the moment.
The client does seem to spend time in wait_for_tcp_memory, which
might be benefitted by increasing the available memory.
benchmark
Try aspell and xmms, which may have strange Makefiles.
glibc
gtk/glib
glibc++
qt
gcc
gdb
linux
openoffice
mozilla
rsync-like distributed caching
Look in the remote machine's cache as well.
Perhaps use a SQUID-like broadcast of the file digest and other critical
details to find out if any machine in the workgroup has the file cached.
Perhaps this could be built on top of a more general file-caching
mechanism that maps from hash to body. At the moment this sounds like
premature optimization.
Send source as an rdiff against the previous version.
Needs to be able to fall back to just sending plain text of course.
Perhaps use different compression for source and binary.
librsync is probably not stable enough to do this very well.
--ping option
It would be nice to have a <tt>--ping</tt> client option to contact
all the remote servers, and perhaps return some kind of interesting
information.
Output should be machine-parseable e.g. to use in removing
unreachable machines from the host list.
Perhaps send little fixed signatures, based on --version. Would
this ever be useful?
non-CC-specific Protocol
Perhaps rather than getting the server to reinterpret the command
line, we should mark the input and output parameters on the client.
So what's sent across the network might be
distcc -c @@INPUT@@ -o @@OUTPUT@@
It's probably better to add additional protocol sections to say
which words should be the input and output files than to use magic
values.
The attraction is that this would allow a particularly knotty part
of code to be included only in the client and run only once. If any
bugs are fixed in this, then only the client will need to be
upgraded. This might remove most of the gcc-specific knowledge from
the server.
Different clients might be used to support various very different
distributable jobs.
We ought to allow for running commands that don't take an input or
output file, in case we want to run "gcc --version".
The drawback is that probably new servers need to be installed to
handle the new protocol version.
I don't know if there's really a compelling reason to do this. If
the argument parser depends on things that can only be seen on the
client, such as checking whether files exist, then this may be
needed.
The server needs to use an appropriately-named temporary file.
gcc wierdnesses:
distcc needs to handle <tt>$COMPILER_PATH</tt> and
<tt>$GCC_EXEC_PREFIX</tt> in some sensible way, if there is one.
Not urgent because I have never heard of them being used.
networking timeouts:
Also we want a timeout for name resolution. The GNU resolver has
a specific feature to do this. On other systems we probably need
to use alarm(), but that might be more trouble than it is worth. Jonas
Jensen says:
Timing out the connect call could be done easier than this, just by
interrupting it with a SIGALRM, but that's not enough to abort
gethostbyname. This method of longjmp'ing from a signal handler is what
they use in curl, so it should be ok.
configurable timeout?
Maybe make the various timeouts configurable? Isn't it possible
to choose values that suit everyone?
Maybe the initial connection timeout should be shorter?
waitstatus
Make sure that native waitstatus formats are the same as the
Unix/Linux/BSD formats used on the wire. (See
<http://www.opengroup.org/onlinepubs/007904975/functions/wait.html>,
which says they may only be interpreted by macros.) I don't know
of any system where they're different.
override compiler name
distcc could support cross-compilation by a per-volunteer option to
override the compiler name. On the local host, it might invoke gcc
directly, but on some volunteers it might be necessary to specify a more
detailed description of the compiler to get the appropriate cross tool.
This might be insufficient for Makefiles that need to call several
different compilers, perhaps gcc and g++ or different versions of gcc.
Perhaps they can make do with changing the DISTCC host settings at
appropriate times.
I'm not convinced this complexity is justified.
Rusty is doing this in ccontrol, which is possibly a better place
for it.
use spawn() on Windows
fork() is very slow. Can we get away with only using spawn()?
Installable package for Windows
Also, it would be nice to have an easily installable package for Windows
that makes the machine be a Cygwin-based compile volunteer. It probably
needs to include cross-compilers for Linux (or whatever), or at least
simple instructions for building them.
autodetection (Rendezvous, etc)
http://dotlocal.org/mdnsd/
The Apple licence is apparently not GPL compatible.
Brad reckons SLP is a better fit.
Automatic detection ("zero configuration") of compile volunteers is
probably not a good idea, because it might be complicated to implement,
and would possibly cause breakage by distributing to machines which are
not properly configured.
OpenMOSIX autodiscovery
what is this?
central configuration
Notwithstanding the previous point, centralized configuration for a site
would be good, and probably quite practical. Setting up a list of
machines centrally rather than configuring each one sounds more friendly.
The most likely design is to use DNS SRV records (RFC2052), or perhaps
multi-RR A records. For exmaple, compile.ozlabs.foo.com would resolve to
all relevant machines. Another possibility would be to use SLP, the
Service Location Protocol, but that adds a larger dependency and it seems
not to be widely deployed.
Large-scale Distribution
distcc in it's present form works well on small numbers of close machines
owned by the same people. It might be an interesting project to
investigate scaling up to large numbers of machines, which potentially do
not trust each other. This would make distcc somewhat more like other
"peer-to-peer" systems like Freenet and Napster.
preprocess remotely
Some people might like to assume that all the machines have the same
headers installed, in which case we really can preprocess remotely and
only ship the source. Imagine e.g. a Clearcase environment where the same
filesystem view is mounted on all machines, and they're all running the
exact same system release.
It's probably not really a good idea, because it will be marginally faster
but much more risky. It is possible, though, and perhaps people building
files with enormous headers would like it.
Perhaps those people should just use a different tool like dmake, etc.
Local variables:
mode: indented-text
indent-tabs-mode: nil
End:
|