File: debug.adoc

package info (click to toggle)
ntpsec 1.2.0%2Bdfsg1-4
links: PTS, VCS
area: main
in suites: bullseye
size: 10,044 kB
sloc: ansic: 60,737; python: 31,610; sh: 1,494; yacc: 1,291; makefile: 176; javascript: 138
file content (284 lines) | stat: -rw-r--r-- 14,425 bytes
parent folder | download | duplicates (3)
= NTP Debugging Techniques
include::include-html.ad[]

[cols="10%,90%",frame="none",grid="none",style="verse"]
|==============================
|image:pic/pogo.gif[]|
{millshome}pictures.html[from 'Pogo', Walt Kelly]

We make house calls and bring our own bugs.

|==============================

== More Help

include::includes/install.adoc[]

'''''

== Initial Startup

This page discusses +ntpd+ program monitoring and debugging techniques
using the link:ntpq.html[+ntpq+ - standard NTP query program], either
on the local server or from a remote machine. The +ntpq+ program
implements the management functions specified in the NTP specification
https://tools.ietf.org/html/rfc5905[RFC 5905].  In addition, the
program can be used to send remote configuration commands to the
server.

The +ntpd+ daemon can operate in two modes, depending on the presence
of the +-n+ command-line option. Without the option the daemon
detaches from the controlling terminal and proceeds autonomously. With
one or more +-d+ options the daemon generates special trace output
useful for debugging. In general, interpretation of this output
requires reference to the sources. However, a single +-d+ does produce
only mildly cryptic output and can be very useful in finding problems
with configuration and network troubles.

Any Firewall needs to allow NTP traffic.
The server side of +ntpd+ listens on UDP port 123.  The client side
also sends from port 123 but not all implementations do that
and +ntpq+ sends from system assigned ports.
If you are running an link:authentic.html#nts[NTS] server, you also
need to allow TCP port 4460.

Other problems are apparent in the system log, which ordinarily shows
the startup banner, some cryptic initialization data and the computed
precision value. Event messages at startup and during regular operation
are sent to the optional +protostats+ monitor file, as described on the
link:decode.html[Event Messages and Status Words] page. These and other
error messages are sent to the system log, as described on the
link:msyslog.html[+ntpd+ System Log Messages] page. In real emergencies
the daemon will send a terminal error message to the system log and then
cease operation.

The next most common problem is incorrect DNS names. Check that each DNS
name used in the configuration file exists and that the address responds
to the Unix +ping+ command. The Unix +traceroute+
utility can be used to verify a partial or complete path exists. Most
problems reported to the NTP newsgroup are not NTP problems, but
problems with the network or firewall configuration.

If you use GPS, and your time is off by 19 years, you may have been
bitten by the GPS 1024 week number rollover bug - WNRO.
Please see link:rollover.html[Rollover issues in time sources]

== Verifying Correct Operation

Unless using the +iburst+ option, the client normally takes a few
minutes to synchronize to a server. If the client time at startup
happens to be more than 1000 s distant from NTP time, the daemon exits
with a message to the system log directing the operator to manually set
the time within 1000 s and restart. If the time is less than 1000 s but
more than 128 s distant, a step correction occurs and the daemon
restarts automatically.

When started for the first time and a frequency file - usually ntp.drift -
is not present, the daemon enters a special mode in order to calibrate the
frequency. This takes 900 ms during which the time is not disciplined. When
calibration is complete, the daemon creates the frequency file and enters
normal mode to amortize whatever residual offset remains.

The +ntpq+ commands +pe+, +as+ and +rv+ are normally sufficient to
verify correct operation and assess nominal performance. The
link:ntpq.html#pe[+pe+] command displays a list showing the DNS name or
IP address for each association along with selected status and
statistics variables. The first character in each line is the tally
code, which shows which associations are candidates to set the system
clock and of these which one is the system peer. The encoding is shown
in the +select+ field of the link:decode.html#peer[peer status word].

The link:ntpq.html#as[+as+] command displays a list of associations and
association identifiers. Note the +condition+ column, which reflects the
tally code. The link:ntpq.html#rv[+rv+] command displays the
link:ntpq.html#system[system variables] billboard, including the
link:decode.html#sys[system status word]. The
link:ntpq.html#rv[+rv assocID+] command, where +assocID+ is the
association ID, displays the link:ntpq.html#peer[peer variables]
billboard, including the link:decode.html#peer[peer status word]. Note
that, except for explicit calendar dates, times are in milliseconds and
frequencies are in parts-per-million (ppm).

A detailed explanation of the system, peer and clock variables in the
billboards is beyond the scope of this page; however, a comprehensive
explanation for each one is in the NTPv4 protocol specification. The
following observations will be useful in debugging and monitoring.

1.  The server has successfully synchronized to its sources if the
+leap+ peer variable has value other than 3 (11b) The client has
successfully synchronized to the server when the +leap+ system variable
has value other than 3.
2.  The +reach+ peer variable is an 8-bit shift register displayed in
octal format. When a valid packet is received, the rightmost bit is lit.
When a packet is sent, the register is shifted left one bit with 0
replacing the rightmost bit. If the +reach+ value is nonzero, the server
is reachable; otherwise, it is unreachable. Note that, even if all
servers become unreachable, the system continues to show valid time to
dependent applications.
3.  A useful indicator of miscellaneous problems is the +flash+ peer
variable, which shows the result of 13 sanity tests. It contains the
link:decode.html#flash[flash status word] bits, commonly called
flashers, which displays the current errors for the association. These
bits should all be zero for a valid server.
4.  The three peer variables +filtdelay+, +filtoffset+ and +filtdisp+
show the delay, offset and jitter statistics for each of the last eight
measurement rounds. These statistics and their trends are valuable
performance indicators for the server, client and the network. For
instance, large fluctuations in delay and jitter suggest network
congestion. Missing clock filter stages suggest packet losses in the
network.
5.  The synchronization distance, defined as one-half the delay plus the
dispersion, represents the maximum error statistic. The jitter
represents the expected error statistic. The maximum error and expected
error calculated from the peer variables represents the quality metric
for the server. The maximum error and expected error calculated from the
system variables represents the quality metric for the client. If the
root synchronization distance for any server exceeds 1.5 s, called the
select threshold, the server is considered invalid.

Sometimes the time distribution of errors can be revealing. It's a
good idea to look occasionally at the plots produced by
link:ntpviz.html[ntpviz].

== Large Frequency Errors

The frequency tolerance of computer clock oscillators varies widely,
sometimes above 500 ppm. While the daemon can handle frequency errors up
to 500 ppm, or 43 seconds per day, values much above 100 ppm reduce the
headroom, especially at the lowest poll intervals. To determine the
particular oscillator frequency, start +ntpd+ using the +noselect+
option with the +server+ configuration command.

Record the time of day and offset displayed by the +ntpq+
link:ntpq.html#peer[+peer+] command. Wait for an hour or so and record the
time of day and offset. Calculate the frequency as the offset difference
divided by the time difference. If the frequency offset is much above 100 ppm,
the link:ntpfrob.html[{ntpfrobman}] program might be useful to adjust the
kernel clock frequency below that value. For systems that do not support
this program, this might be one using a command in the system startup
file.

== Access Controls

Provisions are included in +ntpd+ for access controls which deflect
unwanted traffic from selected hosts or networks. The controls described
on the link:accopt.html[Access Control Options] include detailed packet
filter operations based on source address and address mask. Normally,
filtered packets are dropped without notice other than to increment
tally counters. However, the server can be configured to send a
"kiss-o'-death" (KoD) packet to the client either when explicitly
configured or when cryptographic authentication fails for some reason.
The client association is permanently disabled, the access denied bit
(BOGON4) is set in the flash variable and a message is sent to the system
log.

The access control provisions include a limit on the packet rate from a
host or network. If an incoming packet exceeds the limit, it is dropped
and a KoD sent to the source. If this occurs after the client
association has synchronized, the association is not disabled, but a
message is sent to the system log. See the link:accopt.html[Access
Control Options] page for further information.

== Large Delay Variations

In some reported scenarios an access line may show low to moderate
network delays during some period of the day and moderate to high delays
during other periods. Often the delay on one direction of transmission
dominates, which can result in large time offset errors, sometimes in
the range up to a few seconds. It is not usually convenient to run
+ntpd+ throughout the day in such scenarios, since this could result in
several time steps, especially if the condition persists for greater
than the stepout threshold.

Specific provisions have been built into +ntpd+ to cope with these
problems. The scheme is called "huff-'n-puff and is described on the
link:miscopt.html[Miscellaneous Options] page. An alternative approach
in such scenarios is first to calibrate the local clock frequency error
by running +ntpd+ in continuous mode during the quiet interval and let
it write the frequency to the +ntp.drift+ file. Then, run +ntpd -q+ from
a cron job each day at some time in the quiet interval. In systems with
the nanokernel or microkernel performance enhancements, including
Solaris, Tru64, Linux and FreeBSD, the kernel continuously disciplines
the frequency so that the residual correction produced by +ntpd+ is
usually less than a few milliseconds.

== Cryptographic Authentication

Reliable source authentication requires the use of symmetric key
link:authopt.html[Authentication Options] page. In symmetric key
cryptography servers and clients share session keys contained in a
secret key file. In public key cryptography, the server has a
private key, never shared, and a public key with unrestricted
distribution. Symmetric kays can be produced by
the link:ntpkeygen.html[+ntpkeygen+] program.

Problems with symmetric key authentication are usually due to mismatched
keys or improper use of the +trustedkey+ command. A simple way to check
for problems is to use the trace facility, which is enabled using the
+ntpd -d+ command line. As each packet is received a trace line is
displayed which shows the authentication status in the +auth+ field. A
status of 1 indicates the packet was successful authenticated; otherwise
it has failed.

== Debugging Checklist

If the +ntpq+ or program does not show that messages are being
received by the daemon or that received messages do not result in
correct synchronization, verify the following:

1.  Check the system log for +ntpd+ messages about configuration errors,
name-lookup failures or initialization problems. Common system log
messages are summarized on the link:msyslog.html[+ntpd+ System Log
Messages] page.  If you specify a log file, be sure to check in
your main syslog file (and be sure it logs entries from ntpd) since
some of the errors are logged before it switches to the specified
log file.

2. Check to be sure that only one copy of +ntpd+ is running.

3.  Verify using +ping+ or other utility that packets actually do make
the round trip between the client and server. Verify using +dig+,
+nslookup+ or other utility that the DNS server names do exist and
resolve to valid Internet addresses. Be aware that ICMP (ping) packets
may be firewalled or filtered anywhere in the path. Ping failure does not
explicitly mean that the client and server cannot exchange NTP's
UDP traffic.

4.  Check that the remote NTP server is up and running. The usual
evidence that it is not is a +Connection refused+ message.

5.  Using the +ntpq+ program, verify that the packets received and
packets sent counters are incrementing. If the sent counter does not
increment and the configuration file includes configured servers,
something may be wrong in the host network or interface configuration.
If the sent counter does increment, but the received counter does not
increment, something may be wrong in the network or the server NTP
daemon may not be running or the server itself may be down or not
responding.

6.  If both the sent and received counters do increment, but the +reach+
values in the peers billboard with +ntpq+ continues to show zero,
received packets are probably being discarded for some reason. If this
is the case, the cause should be evident from the +flash+ variable as
discussed above and on the +ntpq+ page. It could be that the server has
disabled access for the client address, in which case the +refid+ field
in the +ntpq+ peers billboard will show a kiss code. See
link:decode.html#kiss[Kiss Codes] for a full list of the codes and their
meanings.

7.  If the +reach+ values in the peers billboard show the servers are
alive and responding, note the tattletale symbols at the left margin,
which indicate the status of each server resulting from the various
grooming and mitigation algorithms. The interpretation of these symbols
is discussed on the +ntpq+ page. After a few minutes of operation, one
or another of the reachable server candidates should show a * tattletale
symbol. If this doesn't happen, the intersection algorithm, which
classifies the servers as truechimers or falsetickers, may be unable to
find a majority of truechimers among the server population.

8.  If all else fails, see the FAQ and/or the discussion and briefings
at the project website.

'''''

include::includes/footer.adoc[]