1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047
|
Name
ganglia - distributed monitoring system
Version
ganglia 2.5.7
The latest version of this software and document will always be found at
http://ganglia.sourceforge.net/. You are currently reading $Revision:
1.12 $ of this document.
Synopsis
______ ___
/ ____/___ _____ ____ _/ (_)___ _
/ / __/ __ `/ __ \/ __ `/ / / __ `/
/ /_/ / /_/ / / / / /_/ / / / /_/ /
\____/\__,_/_/ /_/\__, /_/_/\__,_/
/____/ Distributed Monitoring System
Ganglia is a scalable distributed monitoring system for high-performance
computing systems such as clusters and Grids. It is based on a
hierarchical design targeted at federations of clusters. It relies on a
multicast-based listen/announce protocol to monitor state within
clusters and uses a tree of point-to-point connections amongst
representative cluster nodes to federate clusters and aggregate their
state. It leverages widely used technologies such as XML for data
representation, XDR for compact, portable data transport, and RRDtool
for data storage and visualization. It uses carefully engineered data
structures and algorithms to achieve very low per-node overheads and
high concurrency. The implementation is robust, has been ported to an
extensive set of operating systems and processor architectures, and is
currently in use on over 500 clusters around the world. It has been used
to link clusters across university campuses and around the world and can
scale to handle clusters with 2000 nodes.
The ganglia system is comprised of two unique daemons, a PHP-based web
frontend and a few other small utility programs.
Ganglia Monitoring Daemon (gmond)
Gmond is a multi-threaded daemon which runs on each cluster node you
want to monitor. Installation is easy. You don't have to have a
common NFS filesystem or a database backend, install special
accounts, maintain configuration files or other annoying hassles.
Gmond is its own redundant, distributed database.
Gmond has four main responsibilities: monitor changes in host state,
multicast relevant changes, listen to the state of all other ganglia
nodes via a multicast channel and answer requests for an XML
description of the cluster state.
Each gmond transmits in information in two different ways:
multicasting host state in external data representation (XDR) format
or sending XML over a TCP connection.
Ganglia Meta Daemon (gmetad)
Federation in Ganglia is achieved using a tree of point-to-point
connections amongst representative cluster nodes to aggregate the
state of multiple clusters. At each node in the tree, a Ganglia Meta
Daemon ("gmetad") periodically polls a collection of child data
sources, parses the collected XML, saves all numeric, volatile
metrics to round-robin databases and exports the aggregated XML over
a TCP sockets to clients. Data sources may be either "gmond"
daemons, representing specific clusters, or other "gmetad" daemons,
representing sets of clusters. Data sources use source IP addresses
for access control and can be specified using multiple IP addresses
for failover. The latter capability is natural for aggregating data
from clusters since each "gmond" daemon contains the entire state of
its cluster.
Ganglia PHP Web Frontend
The Ganglia web frontend provides a view of the gathered information
via real-time dynamic web pages. Most importantly, it displays
Ganglia data in a meaningful way for system administrators and
computer users. Although the web frontend to ganglia started as a
simple HTML view of the XML tree, it has evolved into a system that
keeps a colorful history of all collected data.
The Ganglia web frontend caters to system administrators and users.
For example, one can view the CPU utilization over the past hour,
day, week, month, or year. The web frontend shows similar graphs for
Memory usage, disk usage, network statistics, number of running
processes, and all other Ganglia metrics.
The web frontend depends on the existence of the "gmetad" which
provides it with data from several Ganglia sources. Specifically,
the web frontend will open the local port 8651 (by default) and
expects to receive a Ganglia XML tree. The web pages themselves are
highly dynamic; any change to the Ganglia data appears immediately
on the site. This behavior leads to a very responsive site, but
requires that the full XML tree be parsed on every page access.
Therefore, the Ganglia web frontend should run on a fairly powerful,
dedicated machine if it presents a large amount of data.
The Ganglia web frontend is written in the PHP scripting language,
and uses graphs generated by "gmetad" to display history
information. It has been tested on many flavours of Unix (primarily
Linux) with the Apache webserver and the PHP 4.1 module.
Installation
The latest version of all ganglia software can always be downloaded from
http://ganglia.sourceforge.net/downloads.php
Ganglia runs on Linux (i386, ia64, sparc, alpha, powerpc, m68k, mips,
arm, hppa, s390), Solaris, FreeBSD, AIX, IRIX, Tru64, HPUX, MacOS X and
Windows (cygwin beta) making it as portable as it is scalable.
Monitoring Core Installation
If you use the Linux RPMs provided on the ganglia web site, you can skip
to the end of this section.
Ganglia uses the GNU autoconf so compilation and installation of the
monitoring core is basically
% ./configure
% make
% make install
but there are some issues that you need to take a look at first.
Kernel multicast support
Currently ganglia will only run on machines with multicast support.
The vast majority of machines have multicast support by default. If
you have problems with ganglia this is a core issue. Later versions
of ganglia will not have the multicast requirement.
Gmetad is not installed by default
Since "gmetad" relies on the Round-Robin Database Tool ( see
http://www.rrdtool.com/ ) it will not be compiled unless you
explicit request it by using a --with-gmetad flag.
% ./configure --with-gmetad
The configure script will fail if it cannot find the rrdtool library
and header files. By default, it expects to find them at
/usr/include/rrd.h and /usr/lib/librrd.a. If you installed them in
different locations then you need to add the following configure
flags
% ./configure CFLAGS="-I/rrd/header/path" CPPFLAGS="-I/rrd/header/path" \
LDFLAGS="-L/rrd/library/path" --with-gmetad
of course, you need to substitute "/rrd/header/path" and
"/rrd/library/path" with the real location of the rrd tool header
file and library respectively.
AIX should not be compiled with shared libraries
You must add the "--disable-shared" and "--enable-static" configure
flags if you running on AIX
% ./configure --disable-shared --enable-static
GEXEC confusion
GEXEC is a scalable cluster remote execution system which provides
fast, RSA authenticated remote execution of parallel and distributed
jobs. It provides transparent forwarding of stdin, stdout, stderr,
and signals to and from remote processes, provides local environment
propagation, and is designed to be robust and to scale to systems
over 1000 nodes. Internally, GEXEC operates by building an n-ary
tree of TCP sockets and threads between gexec daemons and
propagating control information up and down the tree. By using
hierarchical control, GEXEC distributes both the work and resource
usage associated with massive amounts of parallelism across multiple
nodes, thereby eliminating problems associated with single node
resource limits (e.g., limits on the number of file descriptors on
front-end nodes). (from http://www.theether.org/gexec )
"gexec" is a great cluster execution tool but integrating it with
ganglia is very clumsy to say the least. GEXEC can run standalone
without access to a ganglia "gmond". In standalone mode gexec will
use the hosts listed in your GEXEC_SVRS variable to run on. For
example, say I want to run "hostname" on three machines in my
cluster: "host1", "host2" and "host3". I use the following command
line.
% GEXEC_SVRS="host1 host2 host3" gexec -n 3 hostname
and gexec would build an n-ary tree (binary tree by default) of TCP
sockets to those machines and run the command "hostname"
As an added feature, you can have "gexec" pull a host list from a
locally running gmond and use that as the host list instead of
GEXEC_SVRS. The list is load balanced and "gexec" will start the job
on the *n* least-loaded machines.
For example..
% gexec -n 5 hostname
will run the command "hostname" on the five least-loaded machines in
a cluster.
To turn on the "gexec" feature in ganglia you must configure ganglia
with the "--enable-gexec" flag
% ./configure --enable-gexec
Enabling "gexec" means that by default any host running gmond will
send a special multicast message announcing that gexec is installed
on it and open for requests.
Now the question is, what if I don't want gexec to run on every host
in my cluster? For example, you may not want to have "gexec" run
jobs on your cluster frontend nodes.
You simply add the following line to your "gmond" configuration file
("/etc/gmond.conf" by default)
no_gexec on
Simple huh? I know the configuration file option, "no_gexec", seems
crazy (and it is). Why have an option that says "yes to no gexec"?
The early versions of gmond didn't use a configuration file but
instead commandline options. One of the commandline options was
simply "--no-gexec" and the default was to announce gexec as on.
Once you have successfully run
% ./configure
% make
% make install
you should find the following files installed in "/usr" (by default).
/usr/bin/gstat
/usr/bin/gmetric
/usr/sbin/gmond
/usr/sbin/gmetad
If you installed ganglia using RPMs then these files will be installed
when you install the RPM. The RPM is installed simply by running
% rpm -Uvh ganglia-monitor-core-2.5.7.tar.gz
Once you have the necessary binaries installed, you can test your
installation by running
% ./gmond
This will start the ganglia monitoring daemon. You should then be able
to run
% telnet localhost 8649
And get an XML description of the state of your machine (and any other
hosts running gmond at the time).
If you are installing by source on Linux, scripts are provided to start
"gmetad" and "gmond" at system startup. They are easy to install from
the source root.
% cp ./gmond/gmond.init /etc/rc.d/init.d/gmond
% chkconfig --add gmond
% chkconfig --list gmond
gmond 0:off 1:off 2:on 3:on 4:on 5:on 6:off
% /etc/rc.d/init.d/gmond start
Starting GANGLIA gmond: [ OK ]
Repeat this step with gmetad.
PHP Web Frontend Installation
1. Unzip the webfrontend distribution in your website tree. This is
often under the directory "/var/www/html", however look for the
variable "DocumentRoot" in your Apache configuration files to be
sure. All the PHP script files use relative URLs in their links, so
you may place the "ganglia/" directory anywhere convenient. I like
to unzip *tar.gz files with one tar command:
% cd /var/www/html
% tar xvzf gmetad-webfrontend-2.5.0.tar.gz
2. Ensure your webserver understands how to process PHP script files.
Currently, the web frontend contains certain php language that
requires PHP version 4 or greater. Processing PHP script files
usually requires a webserver module, such as the "mod_php" for the
popular Apache webserver. In RedHat Linux, the RPM package that
provides this module is called simply "php".
For Apache, "mod_php" module must be enabled. The following lines
should appear somewhere in Apache's *conf files. This example
applies to RedHat and Mandrake Linux. The actual filenames may vary
on your system. If you installed the php module using an RPM
package, this work will have been done automatically.
<IfDefine HAVE_PHP4>
LoadModule php4_module extramodules/libphp4.so
AddModule mod_php4.c
</IfDefine>
AddType application/x-httpd-php .php .php4 .php3 .phtml
AddType application/x-httpd-php-source .phps
3. The webfrontend requires the existance of the gmetad package on the
webserver. Follow the installation instructions on the gmetad page.
Specifically, the webfrontend requires the rrdtool and the "rrds/"
directory from gmetad. If you are a power user, you may use NFS to
simulate the local existance of the rrds.
4. Test your installation. Visit the URL:
http://localhost/ganglia/
With a web-browser, where localhost is the address of your
webserver.
Installation of the web frontend is simplified on Linux by using rpm.
% rpm -Uvh gmetad-webfrontend-2.5.7-1.i386.rpm
Preparing... ########################################### [100%]
1:gmetad-webfrontend ########################################### [100%]
Configuration
Gmond Configuration
While the default options for gmond will work for most clusters, gmond
is very flexible and can be customize with the configuration file:
"/etc/gmond.conf".
"/etc/gmond.conf" is not required as its absence will only cause gmond
to start in a default configuration. Here is a sample of a gmond.conf
configuration file with comment to help you configure gmond
# This is the configuration file for the Ganglia Monitor Daemon (gmond)
# Documentation can be found at http://ganglia.sourceforge.net/docs/
#
# To change a value from it's default simply uncomment the line
# and alter the value
#####################
#
# The name of the cluster this node is a part of
# default: "unspecified"
# name "My Cluster"
#
# The owner of this cluster. Represents an administrative
# domain. The pair name/owner should be unique for all clusters
# in the world.
# default: "unspecified"
# owner "My Organization"
#
# The latitude and longitude GPS coordinates of this cluster on earth.
# Specified to 1 mile accuracy with two decimal places per axis in Decimal
# DMS format: "N61.18 W130.50".
# default: "unspecified"
# latlong "N32.87 W117.22"
#
# The URL for more information on the Cluster. Intended to give purpose,
# owner, administration, and account details for this cluster.
# default: "unspecified"
# url "http://www.mycluster.edu/"
#
# The location of this host in the cluster. Given as a 3D coordinate:
# "Rack,Rank,Plane" that corresponds to a Euclidean coordinate "x,y,z".
# default: "unspecified"
# location "0,0,0"
#
# The multicast channel for gmond to send/receive data on
# default: 239.2.11.71
# mcast_channel 239.2.11.71
#
# The multicast port for gmond to send/receive data on
# default: 8649
# mcast_port 8649
#
# The multicast interface for gmond to send/receive data on
# default: the kernel decides based on routing configuration
# mcast_if eth1
#
# The multicast Time-To-Live (TTL) for outgoing messages
# default: 1
# mcast_ttl 1
#
# The number of threads listening to multicast traffic
# default: 2
# mcast_threads 2
#
# Which port should gmond listen for XML requests on
# default: 8649
# xml_port 8649
#
# The number of threads answering XML requests
# default: 2
# xml_threads 2
#
# Hosts ASIDE from "127.0.0.1"/localhost and those multicasting
# on the same multicast channel which you will share your XML
# data with. Multiple hosts are allowed on multiple lines.
# Can be specified with either hostnames or IP addresses.
# default: none
# trusted_hosts 1.1.1.1 1.1.1.2 1.1.1.3 \
# 2.3.2.3 3.4.3.4 5.6.5.6
#
# The number of nodes in your cluster. This value is used in the
# creation of the cluster hash.
# default: 1024
# num_nodes 1024
#
# The number of custom metrics this gmond will be storing. This
# value is used in the creation of the host custom_metrics hash.
# default: 16
# num_custom_metrics 16
#
# Run gmond in "mute" mode. Gmond will only listen to the multicast
# channel but will not send any data on the channel.
# default: off
# mute on
#
# Run gmond in "deaf" mode. Gmond will only send data on the multicast
# channel but will not listen/store any data from the channel.
# default: off
# deaf on
#
# Run gmond in "debug" mode. Gmond will not background. Debug messages
# are sent to stdout. Value from 0-100. The higher the number the more
# detailed debugging information will be sent.
# default: 0
# debug_level 10
#
# If you don't want gmond to setuid, set this to "on"
# default: off
# no_setuid on
#
# Which user should gmond run as?
# default: nobody
# setuid nobody
#
# If you do not want this host to appear in the gexec host list, set
# this value to "on"
# default: off
# no_gexec on
#
# If you want any host which connects to the gmond XML to receive
# data, then set this value to "on"
# default: off
# all_trusted on
If you want to customize the operation of gmond, simply edit this file
and save it to "/etc/gmond.conf". You can create multiple gmond
configurations by writing the configuration file to a different file,
say "/etc/gmond_test.conf", and the using the "--conf" option of gmond
to specify which configuration file to use.
% ./gmond --conf=/etc/gmond_test.conf
would start gmond with the settings in /etc/gmond_test.conf
Gmetad Configuration
The behavior of the Ganglia Meta Daemon is completely controlled by a
single configuration file which is by default "/etc/gmetad.conf". For
gmetad to do anything useful you much specify at least one "data_source"
in the configuration. The format of the data_source line is as follows
data_source "Cluster A" 127.0.0.1 1.2.3.4:8655 1.2.3.5:8625
data_source "Cluster B" 1.2.4.4:8655
In this example, there are two unique data sources: "Cluster A" and
"Cluster B". The Cluster A data source has three redundant sources. If
gmetad cannot pull the data from the first source, it will continue
trying the other sources in order.
If you do not specify a port number, gmetad will assume the default
ganglia port which is 8649 (U*N*I*X on a phone key pad)
Here is a sample gmetad configuration file with comments
# This is an example of a Ganglia Meta Daemon configuration file
# http://ganglia.sourceforge.net/
#
#-------------------------------------------------------------------------------
# Setting the debug_level to 1 will keep daemon in the forground and
# show only error messages. Setting this value higher than 1 will make
# gmetad output debugging information and stay in the foreground.
# default: 0
# debug_level 10
#
#-------------------------------------------------------------------------------
# What to monitor. The most important section of this file.
#
# The data_source tag specifies either a cluster or a grid to
# monitor. If we detect the source is a cluster, we will maintain a complete
# set of RRD databases for it, which can be used to create historical
# graphs of the metrics. If the source is a grid (it comes from another gmetad),
# we will only maintain summary RRDs for it.
#
# Format:
# data_source "my cluster" [polling interval] address1:port addreses2:port ...
#
# The keyword 'data_source' must immediately be followed by a unique
# string which identifies the source, then an optional polling interval in
# seconds. The source will be polled at this interval on average.
# If the polling interval is omitted, 15sec is asssumed.
#
# A list of machines which service the data source follows, in the
# format ip:port, or name:port. If a port is not specified then 8649
# (the default gmond port) is assumed.
# default: There is no default value
#
# data_source "my cluster" 10 localhost my.machine.edu:8649 1.2.3.5:8655
# data_source "my grid" 50 1.3.4.7:8655 grid.org:8651 grid-backup.org:8651
# data_source "another source" 1.3.4.7:8655 1.3.4.8
data_source "my cluster" localhost
#
#-------------------------------------------------------------------------------
# Scalability mode. If on, we summarize over downstream grids, and respect
# authority tags. If off, we take on 2.5.0-era behavior: we do not wrap our output
# in <GRID></GRID> tags, we ignore all <GRID> tags we see, and always assume
# we are the "authority" on data source feeds. This approach does not scale to
# large groups of clusters, but is provided for backwards compatibility.
# default: on
# scalable off
#
#-------------------------------------------------------------------------------
# The name of this Grid. All the data sources above will be wrapped in a GRID
# tag with this name.
# default: Unspecified
# gridname "MyGrid"
#
#-------------------------------------------------------------------------------
# The authority URL for this grid. Used by other gmetads to locate graphs
# for our data sources. Generally points to a ganglia/
# website on this machine.
# default: "http://hostname/ganglia/",
# where hostname is the name of this machine, as defined by gethostname().
# authority "http://mycluster.org/newprefix/"
#
#-------------------------------------------------------------------------------
# List of machines this gmetad will share XML with. Localhost
# is always trusted.
# default: There is no default value
# trusted_hosts 127.0.0.1 169.229.50.165 my.gmetad.org
#
#-------------------------------------------------------------------------------
# If you want any host which connects to the gmetad XML to receive
# data, then set this value to "on"
# default: off
# all_trusted on
#
#-------------------------------------------------------------------------------
# If you don't want gmetad to setuid then set this to off
# default: on
# setuid off
#
#-------------------------------------------------------------------------------
# User gmetad will setuid to (defaults to "nobody")
# default: "nobody"
# setuid_username "nobody"
#
#-------------------------------------------------------------------------------
# The port gmetad will answer requests for XML
# default: 8651
# xml_port 8651
#
#-------------------------------------------------------------------------------
# The port gmetad will answer queries for XML. This facility allows
# simple subtree and summation views of the XML tree.
# default: 8652
# interactive_port 8652
#
#-------------------------------------------------------------------------------
# The number of threads answering XML requests
# default: 4
# server_threads 10
#
#-------------------------------------------------------------------------------
# Where gmetad stores its round-robin databases
# default: "/var/lib/ganglia/rrds"
# rrd_rootdir "/some/other/place"
"gmetad" has a "--conf" option to allow you to specify alternate
configuration files
% ./gmetad -conf=/tmp/my_custom_config.conf
PHP Web Frontend Configuration
Most configuration parameters reside in the "ganglia/conf.php" file.
Here you may alter the template, gmetad location, RRDtool location, and
set the default time range and metrics for graphs.
The static portions of the Ganglia website are themable. This means you
can alter elements such as section lables, some links, and images to
suit your individual tastes and environment. The "template_name"
variable names a directory containing the current theme. Ganglia uses
TemplatePower to implement themes. A user-defined skin must conform to
the template interface as defined by the default theme. Essentially, the
variable names and START/END blocks in a custom theme must remain the
same as the default, but all other HTML elements may be changed.
Other configuration variables in "conf.php" specify the location of
gmetad's files, and where to find the rrdtool program. These locations
need only be changed if you do not run gmetad on the webserver.
Otherwise the default locations should work fine. The "default_range"
variable specifies what range of time to show on the graphs by default,
with possible values of hour, day, week, month, year. The
"default_metric" parameter specifies which metric to show on the cluster
view page by default.
Commandline Tools
There are two commandline tools that work with "gmond" to add custom
metrics and query the current state of a cluster: "gmetric" and "gstat"
respectively.
Gmetric
The Ganglia Metric Tool (gmetric) allows you to easily monitor any
arbitrary host metrics that you like expanding on the core metrics that
gmond measures by default.
If you want help with the gmetric sytax, simply use the "help"
commandline option
% gmetric --help
gmetric 2.5.7
Purpose:
The Ganglia Metric Client (gmetric) announces a metric
value to all Ganglia Monitoring Daemons (gmonds) that are listening
on the cluster multicast channel.
Usage: gmetric [OPTIONS]...
-h --help Print help and exit
-V --version Print version and exit
-nSTRING --name=STRING Name of the metric
-vSTRING --value=STRING Value of the metric
-tSTRING --type=STRING Either string|int8|uint8|int16|uint16|int32|uint32|float|double
-uSTRING --units=STRING Unit of measure for the value e.g. Kilobytes, Celcius
-sSTRING --slope=STRING Either zero|positive|negative|both (default='both')
-xINT --tmax=INT The maximum time in seconds between gmetric calls (default=60)
-dINT --dmax=INT The lifetime in seconds of this metric (default=0)
-cSTRING --mcast_channel=STRING Multicast channel to send/receive on (default='239.2.11.71')
-pINT --mcast_port=INT Multicast port to send/receive on (default=8649)
-iSTRING --mcast_if=STRING Network interface to multicast on e.g. 'eth1' (default='kernel decides')
-lINT --mcast_ttl=INT Multicast Time-To-Live (TTL) (default=1)
The gmetric tool formats a special multicast message and sends it to all
gmonds that are listening.
All metrics in ganglia have a name, value, type and optionally units.
For example, say I wanted to measure the temperature of my CPU
(something gmond doesn't do by default) then I could multicast this
metric with name="temperature", value="63", type="int16" and
units="Celcius".
Assume I have a program called "cputemp" which outputs in text the
temperature of the CPU
% cputemp
63
I could easily send this data to all listening gmonds by running
% gmetric --name temperature --value `cputemp` --type int16 --units Celcius
Check the exit value of gmetric to see if it successfully sent the data:
0 on success and -1 on failure.
To constantly sample this temperature metric, you just need too add this
command to your cron table.
Gstat
The Ganglia Cluster Status Tool (gstat) is a commandline utility that
allows you to get status report for your cluster.
To get help with the commandline options, simply pass "gstat" the
"--help" option
% gstat --help
gstat 2.5.7
Purpose:
The Ganglia Status Client (gstat) connects with a
Ganglia Monitoring Daemon (gmond) and output a load-balanced list
of cluster hosts
Usage: gstat [OPTIONS]...
-h --help Print help and exit
-V --version Print version and exit
-a --all List all hosts. Not just hosts running gexec (default=off)
-d --dead Print only the hosts which are dead (default=off)
-m --mpifile Print a load-balanced mpifile (default=off)
-1 --single_line Print host and information all on one line (default=off)
-l --list Print ONLY the host list (default=off)
-iSTRING --gmond_ip=STRING Specify the ip address of the gmond to query (default='127.0.0.1')
-pINT --gmond_port=INT Specify the gmond port to query (default=8649)
Frequently Asked Questions (FAQ)
What metrics does ganglia collect on platform x?
This table describes all the metrics that ganglia collects and shows
what platforms the metric are supported on. (The following table is
not yet complete).
Metric Name Description Platforms
-----------------------------------------------------------------------
boottime System boot timestamp l,f
bread_sec
bwrite_sec
bytes_in Number of bytes in per second l,f
bytes_out Number of bytes out per second l,f
cpu_aidle Percent of time since boot idle CPU l
cpu_arm
cpu_avm
cpu_idle Percent CPU idle l,f
cpu_intr
cpu_nice Percent CPU nice l,f
cpu_num Number of CPUs l,f
cpu_rm
cpu_speed Speed in MHz of CPU l,f
cpu_ssys
cpu_system Percent CPU system l,f
cpu_user Percent CPU user l,f
cpu_vm
cpu_wait
cpu_wio
disk_free Total free disk space l,f
disk_total Total available disk space l,f
load_fifteen Fifteen minute load average l,f
load_five Five minute load average l,f
load_one One minute load average l,f
location GPS coordinates for host e
lread_sec
lwrite_sec
machine_type
mem_buffers Amount of buffered memory l,f
mem_cached Amount of cached memory l,f
mem_free Amount of available memory l,f
mem_shared Amount of shared memory l,f
mem_total Amount of available memory l,f
mtu Network maximum transmission unit l,f
os_name Operating system name l,f
os_release Operating system release (version) l,f
part_max_used Maximum percent used for all partitions l,f
phread_sec
phwrite_sec
pkts_in Packets in per second l,f
pkts_out Packets out per second l,f
proc_run Total number of running processes l,f
proc_total Total number of processes l,f
rcache
swap_free Amount of available swap memory l,f
swap_total Total amount of swap memory l,f
sys_clock Current time on host l,f
wcache
Platform key:
l = Linux, f = FreeBSD, a = AIX, c = Cygwin
m = MacOS, i = IRIX, h = HPUX, t = Tru64
e = Every Platform
If you are interested in how the metrics are collected, just take a
look in directory "./gmond/machines" in the source distribution.
There is a single source file in the directory for each platform
that is supported.
What does the error "Process XML (x): XML_ParseBuffer() error at line x:
not well-formed"
This is an error that occurs when a ganglia components reads data
from another ganglia component and finds that the XML is not
well-formed. The most common time this is a problem is when the PHP
web frontend tries to read the XML stream from gmetad.
To troubleshoot this problem, capture an XML from the ganglia
component in question (gmetad/gmond). This is easy to do if you have
telnet installed. Simply login to the machine running the component
and run.
% telnet localhost 8651
By default, gmetad exports its XML on port 8651 and gmond exports
its XML on port 8649. Modify the port number above to suite your
configuration.
When you connect to the port you should get an XML stream. If not,
look in the process table on the machine to ensure that the
component is actually running.
Once you are getting an XML stream, capture it to a file by running.
% telnet localhost 8651 > XML.txt
Connection closed by foreign host.
If you open the file "XML.txt", you will see the captured XML
stream. You will need to remove the first three lines of the
"XML.txt" which will read...
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Those lines are output from "telnet" and not the ganglia component
(I wish telnet would send those messages to "stderr" but they are
send to "stdout").
There are many ways that XML can be misformed. The great tool for
validating XML is "xmllint". "xmllint" will read the file and find
the line containing the error.
% xmllint --valid XML.txt
will read your captured XML stream, validate it against the ganglia
DTD and check that it is well-formed XML. "xmllint" will print the
entire XML stream if there are no errors. If there are errors they
will be reported with line numbers. For example...
/tmp/XML.txt:3393: error: Opening and ending tag mismatch: HOST and CLUSTER
</CLUSTER>
^
/tmp/XML.txt:3394: error: Opening and ending tag mismatch: CLUSTER and GANGLIA_XML
</GANGLIA_XML>
^
/tmp/XML.txt:3395: error: Premature end of data in tag GANGLIA_XML
If you get errors, open "XML.txt" and go to the line numbers in
question. See if you can understand based on your configuration how
these errors could occur. If you cannot fix the problem yourself,
please email your "XML.txt" and output from "xmllint" to
"ganglia-developers@lists.sourceforge.net". Please include
information about the version of each component in question along
with the operating system they are running on. The more details we
have about your configuration the more likely it is we will be able
to help you. Also, all mailing to "ganglia-developers" is archiving
and available to read on the web. You may want to modify "XML.txt"
to remove any sensitive information.
How do I remove a host from the list?
A common problem that people have is not being able to remove a host
from the ganglia web frontend.
Here is a common scenario
1. All hosts in a cluster are multicasting on the ganglia channel.
2. One of the hosts fails or is moved for whatever reason.
3. All the hosts in the cluster report that the host is "dead" or
"expired".
4. The sysadmin wants to removed this host from the "dead" list.
Unfortunately there is currently no nice way to remove a single dead
host from the list. All data in gmond is soft state so you will need
to restart all gmond and gmetad processes. It is important to note
that ALL dead hosts will be flushed from the record by restarting
the processes (since they have to hear the host at least once to
know it is expired).
How good is Solaris, IRIX, Tru64 support?
Here is an email from Steve Wagner about the state of the ganglia on
Solaris, IRIX and Tru64. Steve is to thank for porting ganglia to
Solaris and Tru64. He also helped with the IRIX port.
State of the IRIX port:
* CPU percentage stuff hasn't improved despite my efforts. I fear there
may be a flaw in the way I'm summing counters for all the CPUs.
* Auto-detection of network interfaces apparently segfaults.
* Memory and load reporting appear to be running properly.
* CPU speed is not being reported properly on multi-proc machines.
* Total/running processes are not reported.
* gmetad untested.
* Monitoring core apparently stable in foreground, background being tested
(had a segfault earlier).
State of the Tru64 port:
* CPU percentage stuff here works perfectly.
* Memory and swap usage stats are suspected to be inaccurate.
* Total/running processes are not reported.
* gmetad untested.
* Monitoring core apparently stable in foreground and background.
State of the Solaris port:
* CPU percentages are slightly off, but correct enough for trending
purposes.
* Load, ncpus, CPU speed, breads/writes, lreads/writes, phreads/writes,
and rcache/wcache are all accurate.
* Memory/swap statistics are suspiciously flat, but local stats bear
this out (and they *are* being updated) so I haven't investigated
further.
* Total processes are counted, but not running ones.
* gmetad appears stable
Anyway, all three ports I've been messing with are usable and fairly
stable. Although there are areas for improvement I think we really can't
keep hogging all this good stuff - what I'm looking at is ready for
release.
Where are the debian packages?
Here is an email message from Preston Smith for Debian users
Debian packages for Debian 3.0 (woody) are available at
http://www.physics.purdue.edu/~psmith/ganglia
(i386, sparc, and powerpc are there presently, more architectures will
appear when I get them built.)
Packages for "unstable" (sid) will be available in the main Debian
archive soon.
Also, a CVS note: I checked in the debian/ directory used to create
debian packages.
How should I configure multihomed machines?
Here is an email that Matt Massie sent to a user having problems
with multihomed machines
i need to add a section in the documentation talking about this since it
seems to be a common question.
when you use...
mcast_if eth1
.. in /etc/gmond.conf that tells gmond to send its data out the "eth1"
network interface but that doesn't necessarily mean that the source
address of the packets will match the "eth1" interface. to make sure that
data sent out eth1 has the correct source address run the following...
% route add -host 239.2.11.71 dev eth1
... before starting gmond. that should do the trick for you.
-matt
> I have seen some post related to some issues
> with gmond + multicast running on a dual nic
> frontend.
>
> Currently I am experiencing a weird behavior
>
> I have the following setup:
>
> -----------------------
> | web server + gmetad |
> -----------------------
> |
> |
> |
> ----------------------
> | eth0 A.B.C.112 |
> | |
> | Frontend + gmond |
> | |
> | eth1 192.168.100.1 |
> ----------------------
> |
> |
>
> 26 nodes each
> gmond
>
> In the frontend /etc/gmond.conf I have the
> following statement: mcast_if eth1
>
> The 26 nodes are correctly reported.
>
> However the Frontend is never reported.
>
> I am running iptables on the Frontend, and I am seing
> things like:
>
> INPUT packet died: IN=eth1 OUT= MAC= SRC=A.B.C.112 DST=239.2.11.71
> LEN=36 TOS=0x00 PREC=0x00 TTL=1 ID=53740 DF PROTO=UDP SPT=41608 DPT=8649
> LEN=16
>
> I would have expected the source to be 192.168.100.1 with mcast_if eth1
>
> Any idea ?
How should I configure my Cisco Catalyst Switches?
Perhaps information regarding gmond on networks set up through cisco
catalyst switches should be mentioned in the ganglia documentation.
I think by default multicast traffic on the catalyst will flood all
devices unless configured properly. Here is a relavent snipet from a
message forum, with a link to cisco document.
If what you are trying to do, is minimizing the impact on your
network due to a multicast application, this link may describe what
you want to do: http://www.cisco.com/warp/public/473/38.html
We set up our switches according to this after a consultant came in
and installed an application multicasting several hundred packets
per second. This made the network functional again.
Getting Support
The tired and thirsty prospector threw himself down at the edge of the
watering hole and started to drink. But then he looked around and saw
skulls and bones everywhere. "Uh-oh," he thought. "This watering hole
is reserved for skeletons." --Jack Handey
There are three mailing lists available to you: "ganglia-general",
"ganglia-developers" and "ganglia-announce". You can join these lists or
read their archives by visiting
https://sourceforge.net/mail/?group_id=43021
"All of the ganglia mailing lists are closed". That means that in order
to post to the lists, you must be subscribed to the list. We're sorry
for the inconvenience however it is very easy to subscribe and
unsubscribe from the lists. We had to close the mailing lists because of
SPAM problems.
When you need help please follow these steps until your problem is
resolved.
1. completely read the documentation
2. check the "ganglia-general" archive to see if other people have had
the same problem
3. post your support request to the "ganglia-general" mailing list
4. check the "ganglia-developers" archive
5. post your question to the "ganglia-developers" list
please send all bugs, patches, and feature requests to the
"ganglia-developers" list after you have checked the
"ganglia-developers" archive to see if the question has already been
asked and answered.
Copyright
Copyright (C) 2002,2003 University of California, Berkeley
The ganglia source tree incorporated great source code from other
projects as well.
Copyright (c) 2000 Dug Song <dugsong@monkey.org>
Copyright (C) 1999,2000,2001,2002 Lukas Schroeder <lukas@azzit.de>,
and others.
Copyright (C) 1991, 1992, 1996, 1998, 1999 Free Software Foundation, Inc.
Copyright (C) 2000 David Helder
Copyright (C) 2000 Andrew Lanoix
Authors
Matt Massie <massie@CS.Berkeley.EDU>
and the Ganglia Development Team...
Bas van der Vlies basv Developer basv at users.sourceforge.net
Neil T. Spring bluehal Developer bluehal at users.sourceforge.net
Brooks Davis brooks_en_davis Developer brooks_en_davis at users.sourceforge.net
Eric Fraser fraze Developer fraze at users.sourceforge.net
greg bruno gregbruno Developer gregbruno at users.sourceforge.net
Jeff Layton laytonjb Developer laytonjb at users.sourceforge.net
Doc Schneider maddocbuddha Developer maddocbuddha at users.sourceforge.net
Mason Katz masonkatz Developer masonkatz at users.sourceforge.net
Mike Howard mhoward Developer mhoward at users.sourceforge.net
Oliver Mössinger olivpass Developer olivpass at users.sourceforge.net
Preston Smith pmsmith Developer pmsmith at users.sourceforge.net
Federico David Sacerdoti sacerdoti Developer sacerdoti at users.sourceforge.net
Tim Cera timcera Developer timcera at users.sourceforge.net
Mathew Benson wintermute11 Developer wintermute11 at users.sourceforge.net
Contributors
There have been dozens of contributors who have provided patches and
helpful bug reports. We need to list them here later.
|