KERNEL_NOTES FOR DIABLO
(0) Location of options
Diablo compilation options mainly appear in two files: lib/config.h and
lib/vendor.h. lib/config.h is supposed to hold only permanent
configuration options. The more advanced options are usually disabled
unless it is possible to do preprocessor conditionals on the OS version.
Generally speaking, any option overrides that you do should be done in
(I) Use of mmap()
Diablo requires at least shared read-only file maps to work properly.
This is known to work on Sun, Solaris, IRIX, and FreeBSD.
BSDI releaes including 3.0 are known to have serious problems with mmap()
and it is not suggested that you run diablo on it.
Once you get past shared read-only file maps, you get into shared
read-write file maps, shared read-write anonymous maps, and sys-v
shared memory maps. These are optional. I believe the Sun, Solaris,
IRIX, and FreeBSD support shared r/w maps but SunOS does not support
anonymous maps (solaris does). Most systems support sys-v shared memory.
I have only tested advanced mmap features on FreeBSD.
Diablo will work fine with systems which do not have a unified buffer
cache for read+write mmaps, which means all mmap features will work
with FreeBSD 2.2.x or greater just fine.
Memory allocation features:
USE_ANON_MMAP Allows diablo to use an anonymous private r/w mmap
to allocate memory. This will cause the least
USE_FALL_MMAP Diablo uses a temporary file private mmap which it
then remove()s to allocate memory. May or may not
work well depending on how the filesystem works.
The default is to simply use malloc().
USE_SPAM_RW_MAP Use a read+write mmap() for the spam cache file,
otherwise uses a read-only mmap and seek+write to
USE_SPAM_SHM Use sysv-shared memory to map the spam cache. The
spam cache will be read from its file into shared
memory on diablo startup and written back on final
exit. This is the most efficient spam-cache memory
option in diablo and should be used whenever possible.
USE_PCOMMIT_RW_MAP Use a read+write mmap() for the precommit cache,
otherwise uses a read-only mmap and seek+write to
USE_PCOMMIT_SHM use sysv-shared memory to map the precommit cache.
This can double dhistory lookup performance and lead
to better stability under extreme loading conditions
when used with DO_PCOMMIT_POSTCACHE. This option is
DO_PCOMMIT_POSTCACHE use the precommit cache to hold recent dhistory file
hits. Recommended only if USE_PCOMMIT_RW_MAP or
USE_PCOMMIT_SHM is set.
(II) memory, disk, and cpu
A 100 MIPS class cpu is suggested for up to 40 feeds, a 200 MIPS class cpu
is suggested otherwise. Nominally, a pentium-pro 200 running Linux or
FreeBSD, a Sun-ultra running solaris, or a 150MHz R4400 or better SGI box
running IRIX is suggested. I use FreeBSD boxes.
A minimum of 128MB of ram is required (mainly to maintain the dhistory
file efficiently). If you have more then 30 feeds, 192MB of ram is
suggested. If you have more then 70 feeds, 256MB of ram is suggested.
The more memory the merrier.
The minimum recommended disk configuration is three fast 4G disks
sd0 would be used as the root disk, but half of it (2G) would be the
/news partition. sd1 and sd2 would be striped together to make an 8G
spool. A stripe size of 2048 sectors (1 MByte) is suggested. NOTE that
a large /news partition is required. It must not only hold the dhistory
file and a backup of the dhistory file, it must also hold outgoing queue
files and not blow up if outgoing feeds have problems and start to
back up. /news/dqueue can easily take a gig all by itself.
The nominally recommended disk configuration is two fast 2G disks and
two or more fast 4G disks, with /news striped on the first two disks and
the spool striped on the second two.
An ultra-wide SCSI controller is recommended. One will generally be
sufficient, but if you intend to run more then 80 feeds you should
consider having two. UW is suggested for the transaction rate, not
the disk throughput. Well-cooled Seagate drives are recommended.
The machine should not ever have to swap, but swap should be configured
to allow the machine to retire idle processes. I suggest configuring
128MB of swap on every disk to spread any swap activity around.
(III) file descriptors, process limits, datasize resource limits
Configure the system to support a minimum of 512 descriptors per process
and at least 4096 descriptors for the system as a whole. The system
must support at least 512 processes per user and 1024 total processes.
This may involve both kernel configuration and resource limit settings.
The datasize limit should be at least 128MB.
(IV) NBUFs - kernel filesystem buffers
On kernels for which filesystem buffers are static, configure a large
number of buffers. If you have 256MB of ram, I would dedicate half
of it to filesystem buffers.
On kernels which have a dynamic buffer cache (FreeBSD, for example), but
do not have a unified buffer cache, NBUF should be confiured to at least
6144 (around 24 MBytes of KVM) because it is implemented on top of the
primary buffer cache, which is dynamic. If you configure too much, you
will reduce the system's ability to manage its memory.
The typical FreeBSD kernel config line is:
(V) DHistory file tuning
Diablo should be able to handle upwards of 3000 accepted articles/min
and message-id history lookups (check/ihave) rates between 40,000 and
100,000 lookups/minute. The actual performance depends heavily on
the amount of memory you have and the number of diablo processes
in contention with each other.
Many kernels will bog down on internal filesystem locks as the number
of incoming feeds rises. You need to worry once you get over 35 or so
simultanious diablo processes. Adding memory or reducing the size of
the dhistory file will help here.
The dhistory file defaults to a 14 day retention and will stabilize
at between 350 and 400 MBytes given an article rate of 800,000 articles/day
(a full feed as of this writing). You can compile diablo with a lower
retention by setting either USE_SHORT_REMEMBER in lib/vendor.h or
setting a specific REMEMBERDAYS in lib/vendor.h. USE_SHORT_REMEMBER
sets the retention to 7 days and the dhistory file will stabilize at
between 175 and 200 MBytes.
The DHistory file hash table size is programmable, but not dynamic.
The default is 4 million entries. You can change this with the -h
option in diload. For example, '-h 8m'. The hash table size must
be a power of 2. The new hash table size will then take effect when
you next run biweekly.trim. Either 4m or 8m is recommended. NOTE:
if you make a mistake specifying the hash table size, you can blow
up the news system so be careful.
(VI) Tuning outgoing feeds to INN
Please examine the samples/dnewsfeeds file. Generally speaking, you need
to tune any outgoing feeds to INN reader boxes.
You want to do two things: First, you want to make sure the spam filter
is configured properly and turned on. The spam filter is turned on by
default in Diablo 1.12 or greater. The sample dnewsfeeds file contains a
spam filter starter which you should use.
Second, you should consider cutting control messages in front of articles
and then delaying non-control messages by 5 minutes. This will allow
cancel controls to leap ahead of articles and reduce INN's article write
overhead (which is usually the big bottleneck in INN).
Typically, you separate control messages out by creating two separate
feeds to your reader box. The first one has a 'delgroupany control.*',
and the second one has a 'requiregroup control.*'. Taking the example
from the sample dnewsfeeds file:
... other add and delgroups ...
... other add and delgroups ...
Then, in dnntpspool.ctl you program the normal feed for queue-delayed,
to delay it by 5 minutes (assuming you run dspoolout from cron every 5
minutse), and you program the control feed as realtime. Also, if you
don't mind slightly longer delays, q2 may be a better choice then q1.
nntp2a oldnntp.best.com 500 n4 q1
nntp2c nntp1x.ba.best.com 500 n4 realtime
(VII) Tuning dexpire
There are two cron jobs that deal with dexpire. The first is called
quadhr.expire and nominally runs dexpire every four hours (6 times a day).
The second is called hourly.expire and attempts to rerun dexpire if
the quadhr cron fails.
DExpire in Diablo is very fast. Since diablo stores multiple-articles
per spool file, DExpire is able to free up disk space very quickly and
you should not be scared of running it often. DExpire's biggest hog
is that it must scan the dhistory file. Unlike INN's expire, dexpire
does not rewrite the dhistory file. Instead, it expires entries in
place which is considerably faster.
The sample expiration cron jobs adm/quadhr.expire and adm/hourly.expire
set a free space target of 2 gigabytes. This is the suggested free space
target if you run expire every 4 hours and is designed to deal with
large influxes of data that may occur in a 4 hour period. You can run
a tighter free space target if you run dexpire more often. You can
probably get away with a 1 gigabyte (1000 megabyte) free space target
if you run dexpire every 2 or 3 hours, but I suggest leaving the free
space target alone.
(VIII) Typical Performance from news1.best.com
news1.best.com is a FreeBSD 2.2.x box running on a PPro 200 with 192 MB
of ram, one 2940UW SCSI controller, and three 4G Seagate ST34371W's.
One 4G. It is partitioned as follows:
Filesystem 1K-blocks Used Avail Capacity Mounted on
/dev/sd0a 127151 49473 67506 42% /
/dev/sd0d 63567 1369 57113 2% /var
/dev/sd0e 465940 43490 385175 10% /var/log
/dev/sd0f 232474 5 213872 0% /var/tmp
/dev/sd0g 1017327 432274 503667 46% /usr
/dev/sd0h 1705391 720650 848310 46% /news <--- too small
/dev/ccd0c 8176355 5596859 1925387 74% /news/spool/news
procfs 4 4 0 100% /proc
The ccd partition is configured with a 4M stripe, designed to
optimally handle a large number of diablo processes each writing
to its own private, but large, spool file.
ccd0 8192 none /dev/sd1d /dev/sd2d
The machine is currently configured with 95 feeds, of which 15 are
'official' fully transited backbone feeds and another 10 are
fully transited backup backbone feeds. Another 20 send me message-id's
equivalent to nearly full feeds. Most of the remainder are mainly
outgoing feeds to T1 customers and their incoming component is minor.
When news1.best.com is taken down for 30 minutes, then brought
up again, it gets pounded by about half of its feeds and is
able to put away around 25 articles/sec and around 500
message-id lookups/sec. What this means, basically, is that
although the machine is able to catch up in real articles,
many of the feeds continue to get behind for a short period of time
(500/45 = 11 checks/sec/full-incoming-feed, not quite enough).
The reason the check rate is so low is basically due to the load on
the system. 90+ diablo processes all pounding away on the caches
and the disks reduces efficiency all around. Half the feeds would
result in almost tripple the efficiency due not only to the lower
level of pounding, but also due to the greater amount of memory available.
The real issue is one of message-id load. I run news1.best.com
with a high message-id load on purpose... most news admins do not need
45 full incoming message-id loads to get good news coverage... four
or five will do just as well.
A simple rule of thumb for a news admin is to take full feeds only
from half a dozen or so sources and ask the remaining sources to only
send you locally posted articles.
In anycase, with news1.best.com, the caches start to recover once the
articles have begun to catchup and get back in synch. The message-id
lookup capability increases from 500/sec to 10000/sec and the incoming
feeds catch up very quickly after that.
Disk I/O is limited by seeking, so the transfers/sec statistics is often
more useful then throughput statistics. Once caught up news1.best.com
stabilizes at around 30 tps on each of its three disks. When catching up,
under its heaviest load, sd0 hits around 90 tps which is basically
saturation, while sd1 and sd2 stabilize at between 60 and 70 tps. The
disks are theoretically capable of around 100 tps (averaged).
There a number of ways to reduce the dhistory file load. Reducing the
number of full incoming feeds to a reasonable number (4 or 5) is one
way. Another way is to stripe /news AND the spool rather then just the
spool. A third way is to simply pack in more memory for better caching.
A fourth way is to reduce the default history retention (see the release
notes for setting REMEMBER_DAYS) from 14 days to 9 to significantly
reduce the size of the dhistory file. Probably the best way to reduce
the dhistory file load is with better management of incoming feeds, only
a few actually need to be full feeds.
After you handle the dhistory file load, tuning realtime vs non-realtime
feeds comes next. realtime feeds should only be used under certain
conditions. If you are a large ISP providing feeds to your T1 customers,
making those feeds realtime gets news to them rather then them getting it
over your internet backhaul from someone else. If you peer at a MAE,
where you do not pay on a bandwidth basis, making feeds that go over that
link in realtime will reduce the load on other feeds that go over more
expensive links, especially if your MAE peers return the favor. Local
feeds to newsreader boxes do not have to be realtime, nor do most other
feeds. Why make a feed over a costly internet backhaul realtime when all
it does is increase your outgoing bandwidth ?