1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389
|
<pre>Network Working Group R. Moats
Request for Comments: 2517 R. Huber
Category: Informational AT&T
February 1999
<span class="h1">Building Directories from DNS: Experiences from WWWSeeker</span>
Status of this Memo
This memo provides information for the Internet community. It does
not specify an Internet standard of any kind. Distribution of this
memo is unlimited.
Copyright Notice
Copyright (C) The Internet Society (1999). All Rights Reserved.
Abstract
There has been much discussion and several documents written about
the need for an Internet Directory. Recently, this discussion has
focused on ways to discover an organization's domain name without
relying on use of DNS as a directory service. This memo discusses
lessons that were learned during InterNIC Directory and Database
Services' development and operation of WWWSeeker, an application that
finds a web site given information about the name and location of an
organization. The back end database that drives this application was
built from information obtained from domain registries via WHOIS and
other protocols. We present this information to help future
implementors avoid some of the blind alleys that we have already
explored. This work builds on the Netfind system that was created by
Mike Schwartz and his team at the University of Colorado at Boulder
[<a href="#ref-1" title=""Applying an Information Gathering Architecture to Netfind: A White Pages Tool for a Changing and Growing Internet"">1</a>].
<span class="h2"><a class="selflink" id="section-1" href="#section-1">1</a>. Introduction</span>
Over time, there have been several RFCs [<a href="#ref-2" title=""Plan for Internet Directory Services"">2</a>, <a href="#ref-3" title=""A Strategic Plan for Deploying an Internet X.500 Directory Service"">3</a>, <a href="#ref-4" title=""White Pages Meeting Report"">4</a>] about approaches
for providing Internet Directories. Many of the earlier documents
discussed white pages directories that supply mappings from a
person's name to their telephone number, email address, etc.
More recently, there has been discussion of directories that map from
a company name to a domain name or web site. Many people are using
DNS as a directory today to find this type of information about a
given company. Typically when DNS is used, users guess the domain
name of the company they are looking for and then prepend "www.".
This makes it highly desirable for a company to have an easily
<span class="grey">Moats & Huber Informational [Page 1]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-2" ></span>
<span class="grey"><a href="./rfc2517">RFC 2517</a> Building Directories from DNS February 1999</span>
guessable name.
There are two major problems here. As the number of assigned names
increases, it becomes more difficult to get an easily guessable name.
Also, the TLD must be guessed as well as the name. While many users
just guess ".COM" as the "default" TLD today, there are many two-
letter country code top-level domains in current use as well as other
gTLDs (.NET, .ORG, and possibly .EDU) with the prospect of additional
gTLDs in the future. As the number of TLDs in general use increases,
guessing gets more difficult.
Between July 1996 and our shutdown in March 1998, the InterNIC
Directory and Database Services project maintained the Netfind search
engine [<a href="#ref-1" title=""Applying an Information Gathering Architecture to Netfind: A White Pages Tool for a Changing and Growing Internet"">1</a>] and the associated database that maps organization
information to domain names. This database thus acted as the type of
Internet directory that associates company names with domain names.
We also built WWWSeeker, a system that used the Netfind database to
find web sites associated with a given organization. The experienced
gained from maintaining and growing this database provides valuable
insight into the issues of providing a directory service. We present
it here to allow future implementors to avoid some of the blind
alleys that we have already explored.
<span class="h2"><a class="selflink" id="section-2" href="#section-2">2</a>. Directory Population</span>
<span class="h3"><a class="selflink" id="section-2.1" href="#section-2.1">2.1</a> What to do?</span>
There are two issues in populating a directory: finding all the
domain names (building the skeleton) and associating those domains
with entities (adding the meat). These two issues are discussed
below.
<span class="h3"><a class="selflink" id="section-2.2" href="#section-2.2">2.2</a> Building the skeleton</span>
In "building the skeleton", it is popular to suggest using a variant
of a "tree walk" to determine the domains that need to be added to
the directory. Our experience is that this is neither a reasonable
nor an efficient proposal for maintaining such a directory. Except
for some infrequent and long-standing DNS surveys [<a href="#ref-5" title=""Network Wizards Internet Domain Survey"">5</a>], DNS "tree
walks" tend to be discouraged by the Internet community, especially
given that the frequency of DNS changes would require a new tree walk
monthly (if not more often). Instead, our experience has shown that
data on allocated DNS domains can usually be retrieved in bulk
fashion with FTP, HTTP, or Gopher (we have used each of these for
particular TLDs). This has the added advantage of both "building the
skeleton" and "adding the meat" at the same time. Our favorite
method for finding a server that has allocated DNS domain information
is to start with the list maintained at
<span class="grey">Moats & Huber Informational [Page 2]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-3" ></span>
<span class="grey"><a href="./rfc2517">RFC 2517</a> Building Directories from DNS February 1999</span>
<a href="http://www.alldomains.com/countryindex.html">http://www.alldomains.com/countryindex.html</a> and go from there.
Before this was available, it was necessary to hunt for a registry
using trial and error.
When maintaining the database, existing domains may be verified via
direct DNS lookups rather than a "tree walk." "Tree walks" should
therefore be the choice of last resort for directory population, and
bulk retrieval should be used whenever possible.
<span class="h3"><a class="selflink" id="section-2.3" href="#section-2.3">2.3</a> Adding the meat</span>
A possibility for populating a directory ("adding the meat") is to
use an automated system that makes repeated queries using the WHOIS
protocol to gather information about the organization that owns a
domain. The queries would be made against a WHOIS server located
with the above method. At the conclusion of the InterNIC Directory
and Database Services project, our backend database contained about
2.9 million records built from data that could be retrieved via
WHOIS. The entire database contained 3.25 million records, with the
additional records coming from sources other than WHOIS.
In our experience this information contains many factual and
typographical errors and requires further examination and processing
to improve its quality. Further, TLD registrars that support WHOIS
typically only support WHOIS information for second level domains
(i.e. ne.us) as opposed to lower level domains (i.e.
windrose.omaha.ne.us). Also, there are TLDs without registrars, TLDs
without WHOIS support, and still other TLDs that use other methods
(HTTP, FTP, gopher) for providing organizational information. Based
on our experience, an implementor of an internet directory needs to
support multiple protocols for directory population. An automated
WHOIS search tool is necessary, but isn't enough.
<span class="h2"><a class="selflink" id="section-3" href="#section-3">3</a>. Directory Updating: Full Rebuilds vs Incremental Updates</span>
Given the size of our database in April 1998 when it was last
generated, a complete rebuild of the database that is available from
WHOIS lookups would require between 134.2 to 167.8 days just for
WHOIS lookups from a Sun SPARCstation 20. This estimate does not
include other considerations (for example, inverting the token tree
required about 24 hours processing time on a Sun SPARCstation 20)
that would increase the amount of time to rebuild the entire
database.
Whether this is feasible depends on the frequency of database updates
provided. Because of the rate of growth of allocated domain names
(150K-200K new allocated domains per month in early 1998), we
provided monthly updates of the database. To rebuild the database
<span class="grey">Moats & Huber Informational [Page 3]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-4" ></span>
<span class="grey"><a href="./rfc2517">RFC 2517</a> Building Directories from DNS February 1999</span>
each month (based on the above time estimate) would require between 3
and 5 machines to be dedicated full time (independent of machine
architecture). Instead, we checkpointed the allocated domain list
and rebuild on an incremental basis during one weekend of the month.
This allowed us to complete the update on between 1 and 4 machines (3
Sun SPARCstation 20s and a dual-processor Sparcserver 690) without
full dedication over a couple of days. Further, by coupling
incremental updates with periodic refresh of existing data (which can
be done during another part of the month and doesn't require full
dedication of machine hardware), older records would be periodically
updated when the underlying information changes. The tradeoff is
timeliness and accuracy of data (some data in the database may be
old) against hardware and processing costs.
<span class="h2"><a class="selflink" id="section-4" href="#section-4">4</a>. Directory Presentation: Distributed vs Monolithic</span>
While a distributed directory is a desirable goal, we maintained our
database as a monolithic structure. Given past growth, it is not
clear at what point migrating to a distributed directory becomes
actually necessary to support customer queries. Our last database
contained over 3.25 million records in a flat ASCII file. Searching
was done via a PERL script of an inverted tree (also produced by a
PERL script). While admittedly primitive, this configuration
supported over 200,000 database queries per month from our production
servers.
Increasing the database size only requires more disk space to hold
the database and inverted tree. Of course, using database technology
would probably improve performance and scalability, but we had not
reached the point where this technology was required.
<span class="h2"><a class="selflink" id="section-5" href="#section-5">5</a>. Security Considerations</span>
The underlying data for the type of directory discussed in this
document is already generally available through WHOIS, DNS, and other
standard interfaces. No new information is made available by using
these techniques though many types of search become much easier. To
the extent that easier access to this data makes it easier to find
specific sites or machines to attack, security may be decreased.
The protocols discussed here do not have built-in security features.
If one source machine is spoofed while the directory data is being
gathered, substantial amounts of incorrect and misleading data could
be pulled in to the directory and be spread to a wider audience.
<span class="grey">Moats & Huber Informational [Page 4]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-5" ></span>
<span class="grey"><a href="./rfc2517">RFC 2517</a> Building Directories from DNS February 1999</span>
In general, building a directory from registry data will not open any
new security holes since the data is already available to the public.
Existing security and accuracy problems with the data sources are
likely to be amplified.
<span class="h2"><a class="selflink" id="section-6" href="#section-6">6</a>. Acknowledgments</span>
This work described in this document was partially supported by the
National Science Foundation under Cooperative Agreement NCR-9218179.
<span class="h2"><a class="selflink" id="section-7" href="#section-7">7</a>. References</span>
[<a id="ref-1">1</a>] M. F. Schwartz, C. Pu. "Applying an Information
Gathering Architecture to Netfind: A White Pages Tool for a
Changing and Growing Internet", University of Colorado Technical
Report CU-CS-656-93. December 1993, revised July 1994.
URL:ftp://ftp.cs.colorado.edu/pub/cs/techreports/schwartz/Netfind
[<a id="ref-2">2</a>] Sollins, K., "Plan for Internet Directory Services", <a href="./rfc1107">RFC 1107</a>,
July 1989.
[<a id="ref-3">3</a>] Hardcastle-Kille, S., Huizer, E., Cerf, V., Hobby, R. and S.
Kent, "A Strategic Plan for Deploying an Internet X.500 Directory
Service", <a href="./rfc1430">RFC 1430</a>, February 1993.
[<a id="ref-4">4</a>] Postel, J. and C. Anderson, "White Pages Meeting Report", <a href="./rfc1588">RFC</a>
<a href="./rfc1588">1588</a>, February 1994.
[<a id="ref-5">5</a>] M. Lottor, "Network Wizards Internet Domain Survey", available
from <a href="http://www.nw.com/zone/WWW/top.html">http://www.nw.com/zone/WWW/top.html</a>
<span class="grey">Moats & Huber Informational [Page 5]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-6" ></span>
<span class="grey"><a href="./rfc2517">RFC 2517</a> Building Directories from DNS February 1999</span>
<span class="h2"><a class="selflink" id="section-8" href="#section-8">8</a>. Authors' Addresses</span>
Ryan Moats
AT&T
15621 Drexel Circle
Omaha, NE 68135-2358
USA
EMail: jayhawk@att.com
Rick Huber
AT&T
Room C3-3B30, 200 Laurel Ave. South
Middletown, NJ 07748
USA
EMail: rvh@att.com
<span class="grey">Moats & Huber Informational [Page 6]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-7" ></span>
<span class="grey"><a href="./rfc2517">RFC 2517</a> Building Directories from DNS February 1999</span>
<span class="h2"><a class="selflink" id="section-9" href="#section-9">9</a>. Full Copyright Statement</span>
Copyright (C) The Internet Society (1999). All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other than
English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Moats & Huber Informational [Page 7]
</pre>
|