1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
|
SERVICE/WORKER ARCHITECTURE
* pause/resume job
* cancel job
* cancel map
* slow node
* one data point more expensive than the others
* infinite loop for some values
* exceptions for some values
* worker not stateless: one calculation interferes with the next
* worker memory leak
* worker requires initial state
* worker state changes between map calls
* client requests logging on service, any worker, or all workers
* logging based on condition
* rerun value with logging when worker fails
* parallel random number generator, with seed control
* worker which needs workers
CLUSTER MANAGEMENT
* priority jobs
* start/stop workers when job starts
* add client to processing pool, with preference for client jobs
* add new node
* machine reboots: exchange, service, worker, monitor
* unreliable processor/memory on some nodes
* workers which use local disk but don't clean up after themselves
* service monitoring
* identify all processes for a given user
* identify idle machines
JOB MANAGER
* job manager machine reboots
* job manager upgrades
* progress thumbnail
* client detach/attach
* retrieve results
* notify user when job starts (message queue or email)
DEPLOYMENT
* egg basket and trusted users
* different users having conflicting dependencies
* worker implemented in java, IDL, Matlab, R, ...
* client implemented java, IDL, Matlab, R, ...
* code movement while developing worker
DATA MANAGEMENT
* big file, independent disks
* storing run results
SCALABILITY
* workstation, dedicated cluster, batch queue, NoW, EC2, teragrid
SECURITY (SNS, Teragrid, shared clusters)
* service/workers run as user
* exchange runs as user
* job manager behind firewall
* cluster behind firewall
|