1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874
|
<!-- Converted by db4-upgrade version 1.1 -->
<section
xmlns="http://docbook.org/ns/docbook" version="5.0"
xmlns:xlink="http://www.w3.org/1999/xlink"
xml:id="Address_Standardizer"
>
<title>Address Standardizer</title>
<para>This is a fork of the <link xlink:href="http://www.pagcgeo.org/docs/html/pagc-11.html">PAGC standardizer</link> (original code for this portion was <link xlink:href="http://sourceforge.net/p/pagc/code/360/tree/branches/sew-refactor/postgresql">PAGC PostgreSQL Address Standardizer</link>). </para>
<para>The address standardizer is a single line address parser that takes an input address and normalizes it based on a set of rules stored in a table and helper lex and gaz tables.</para>
<para>The code is built into a single PostgreSQL extension library called <code>address_standardizer</code> which can be installed with <code>CREATE EXTENSION address_standardizer;</code>. In addition to the address_standardizer extension, a sample data extension called <code>address_standardizer_data_us</code> extensions is built, which contains gaz, lex, and rules tables for US data. This extensions can be installed via: <code>CREATE EXTENSION address_standardizer_data_us;</code></para>
<para>The code for this extension can be found in the PostGIS <filename>extensions/address_standardizer</filename> and is currently self-contained.</para>
<para>For installation instructions refer to: <xref linkend="installing_pagc_address_standardizer"/>.</para>
<section xml:id="Address_Standardizer_Basics"><title>How the Parser Works</title>
<para>The parser works from right to left looking first at the macro elements
for postcode, state/province, city, and then looks micro elements to determine
if we are dealing with a house number street or intersection or landmark.
It currently does not look for a country code or name, but that could be
introduced in the future.</para>
<variablelist>
<varlistentry>
<term>Country code</term>
<listitem><para>Assumed to be US or CA based on: postcode as US or Canada state/province as US or Canada else US</para></listitem>
</varlistentry>
<varlistentry>
<term>Postcode/zipcode</term>
<listitem><para>These are recognized using Perl compatible regular expressions.
These regexs are currently in the parseaddress-api.c and are relatively
simple to make changes to if needed.</para></listitem>
</varlistentry>
<varlistentry>
<term>State/province</term>
<listitem><para>These are recognized using Perl compatible regular expressions.
These regexs are currently in the parseaddress-api.c but could get moved
into includes in the future for easier maintenance.</para></listitem>
</varlistentry>
</variablelist>
</section>
<section xml:id="Address_Standardizer_Types">
<title>Address Standardizer Types</title><info>
<abstract>
<para>This section lists the PostgreSQL data types installed by Address Standardizer extension. Note we describe the casting behavior of these which is very
important especially when designing your own functions.
</para>
</abstract>
</info>
<refentry xml:id="stdaddr">
<refnamediv>
<refname>stdaddr</refname>
<refpurpose>A composite type that consists of the elements of an address. This is the return type for <varname>standardize_address</varname> function.</refpurpose>
</refnamediv>
<refsection>
<title>Description</title>
<para>A composite type that consists of elements of an address. This is the return type for <xref linkend="standardize_address"/> function. Some descriptions for elements are borrowed from <link xlink:href="http://www.pagcgeo.org/docs/html/pagc-12.html#ss12.1">PAGC Postal Attributes</link>.</para>
<para>The token numbers denote the output reference number in the <xref linkend="rulestab"/>.</para>
<para>&address_standardizer_required;</para>
<variablelist>
<varlistentry>
<term><varname>building</varname></term>
<listitem>
<para> is text (token number <code>0</code>): Refers to building number or name. Unparsed building identifiers and types. Generally blank for most addresses.</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>house_num</varname></term>
<listitem>
<para>is a text (token number <code>1</code>): This is the street number on a street. Example <emphasis>75</emphasis> in <code>75 State Street</code>.</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>predir</varname></term><listitem>
<para> is text (token number <code>2</code>): STREET NAME PRE-DIRECTIONAL such as North, South, East, West etc.</para>
</listitem></varlistentry>
<varlistentry><term><varname>qual</varname></term>
<listitem>
<para>is text (token number <code>3</code>): STREET NAME PRE-MODIFIER Example <emphasis>OLD</emphasis> in <code>3715 OLD HIGHWAY 99</code>.</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>pretype</varname></term>
<listitem>
<para> is text (token number <code>4</code>): STREET PREFIX TYPE</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>name</varname></term>
<listitem>
<para>is text (token number <code>5</code>): STREET NAME</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>suftype</varname></term>
<listitem>
<para>is text (token number <code>6</code>): STREET POST TYPE e.g. St, Ave, Cir. A street type following the root street name. Example <emphasis>STREET</emphasis> in <code>75 State Street</code>.</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>sufdir</varname></term>
<listitem>
<para>is text (token number <code>7</code>): STREET POST-DIRECTIONAL A directional modifier that follows the street name.. Example <emphasis>WEST</emphasis> in <code>3715 TENTH AVENUE WEST</code>.</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>ruralroute</varname></term>
<listitem>
<para>is text (token number <code>8</code>): RURAL ROUTE . Example <emphasis>7</emphasis> in <code>RR 7</code>.</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>extra</varname></term>
<listitem>
<para>is text: Extra information like Floor number.</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>city</varname></term>
<listitem>
<para>is text (token number <code>10</code>): Example Boston.</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>state</varname></term>
<listitem>
<para>is text (token number <code>11</code>): Example <code>MASSACHUSETTS</code></para>
</listitem>
</varlistentry>
<varlistentry><term><varname>country</varname></term>
<listitem>
<para>is text (token number <code>12</code>): Example <code>USA</code></para>
</listitem>
</varlistentry>
<varlistentry><term><varname>postcode</varname></term>
<listitem>
<para>is text POSTAL CODE (ZIP CODE) (token number <code>13</code>): Example <code>02109</code></para>
</listitem>
</varlistentry>
<varlistentry><term><varname>box</varname></term>
<listitem>
<para>is text POSTAL BOX NUMBER (token number <code>14 and 15</code>): Example <code>02109</code></para>
</listitem>
</varlistentry>
<varlistentry><term><varname>unit</varname></term>
<listitem>
<para>is text Apartment number or Suite Number (token number <code>17</code>): Example <emphasis>3B</emphasis> in <code>APT 3B</code>.</para>
</listitem>
</varlistentry>
</variablelist>
</refsection>
</refentry>
</section>
<section xml:id="Address_Standardizer_Tables">
<title>Address Standardizer Tables</title><info>
<abstract>
<para>This section lists the PostgreSQL table formats used by the address_standardizer for normalizing addresses. Note that these tables do not need to be named the same as what is referenced here. You can have different lex, gaz, rules tables for each country for example or for your custom geocoder. The names of these tables get passed into the address standardizer functions.
</para>
<para>The packaged extension <varname>address_standardizer_data_us</varname> contains data for standardizing US addresses.</para>
</abstract>
</info>
<refentry xml:id="rulestab">
<refnamediv>
<refname>rules table</refname>
<refpurpose>The rules table contains a set of rules that maps address input sequence tokens to standardized output sequence. A rule is defined as a set of input tokens followed by -1 (terminator) followed by set of output tokens followed by -1 followed by number denoting kind of rule followed by ranking of rule.</refpurpose>
</refnamediv>
<refsection>
<title>Description</title>
<para>A rules table must have at least the following columns, though you are allowed to add more for your own uses. </para>
<variablelist>
<varlistentry>
<term><varname>id</varname></term>
<listitem>
<para>Primary key of table</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>rule</varname></term>
<listitem>
<para>text field denoting the rule. Details at <link xlink:href="http://www.pagcgeo.org/docs/html/pagc-12.html#--r-rec--">PAGC Address Standardizer Rule records</link>.</para>
<para>A rule consists of a set of non-negative integers representing input tokens, terminated by a -1, followed by an equal number of non-negative integers representing postal attributes, terminated by a -1, followed by an integer representing a rule type, followed by an integer representing the rank of the rule. The rules are ranked from 0 (lowest) to 17 (highest).</para>
<para>So for example the rule <code>2 0 2 22 3 -1 5 5 6 7 3 -1 2 6</code> maps to sequence of output tokens <emphasis>TYPE NUMBER TYPE DIRECT QUALIF</emphasis> to the output sequence <emphasis>STREET STREET SUFTYP SUFDIR QUALIF</emphasis>. The rule is an ARC_C rule of rank 6. </para>
<para>Numbers for corresponding output tokens are listed in <xref linkend="stdaddr"/>.</para>
</listitem>
</varlistentry>
</variablelist>
</refsection>
<refsection xml:id="rule_input_tokens"><title>Input Tokens</title>
<para>Each rule starts with a set of input tokens followed by a terminator <code>-1</code>. Valid input tokens excerpted from <link xlink:href="http://www.pagcgeo.org/docs/html/pagc-12.html#ss12.2">PAGC Input Tokens</link> are as follows:</para>
<para><emphasis role="bold">Form-Based Input Tokens</emphasis></para>
<variablelist>
<varlistentry>
<term><varname>AMPERS</varname></term>
<listitem>
<para>(13). The ampersand (&) is frequently used to abbreviate the word "and".</para>
</listitem>
</varlistentry>
<varlistentry>
<term><varname>DASH</varname></term>
<listitem>
<para>(9). A punctuation character.</para>
</listitem>
</varlistentry>
<varlistentry>
<term><varname>DOUBLE</varname></term>
<listitem>
<para>(21). A sequence of two letters. Often used as identifiers.</para>
</listitem>
</varlistentry>
<varlistentry>
<term><varname>FRACT</varname></term>
<listitem>
<para>(25). Fractions are sometimes used in civic numbers or unit numbers.</para>
</listitem>
</varlistentry>
<varlistentry>
<term><varname>MIXED</varname></term>
<listitem>
<para>(23). An alphanumeric string that contains both letters and digits. Used for identifiers.</para>
</listitem>
</varlistentry>
<varlistentry>
<term><varname>NUMBER</varname></term>
<listitem>
<para>(0). A string of digits.</para>
</listitem>
</varlistentry>
<varlistentry>
<term><varname>ORD</varname></term>
<listitem>
<para>(15). Representations such as First or 1st. Often used in street names.</para>
</listitem>
</varlistentry>
<varlistentry>
<term><varname>ORD</varname></term>
<listitem>
<para>(18). A single letter.</para>
</listitem>
</varlistentry>
<varlistentry>
<term><varname>WORD</varname></term>
<listitem>
<para>(1). A word is a string of letters of arbitrary length. A single letter can be both a SINGLE and a WORD.</para>
</listitem>
</varlistentry>
</variablelist>
<para><emphasis role="bold">Function-based Input Tokens</emphasis></para>
<variablelist>
<varlistentry>
<term><varname>BOXH</varname></term>
<listitem>
<para>(14). Words used to denote post office boxes. For example <emphasis>Box</emphasis> or <emphasis>PO Box</emphasis>.</para>
</listitem>
</varlistentry>
<varlistentry>
<term><varname>BUILDH</varname></term>
<listitem>
<para>(19). Words used to denote buildings or building complexes, usually as a prefix. For example: <emphasis>Tower</emphasis> in <emphasis>Tower 7A</emphasis>.</para>
</listitem>
</varlistentry>
<varlistentry>
<term><varname>BUILDT</varname></term>
<listitem>
<para>(24). Words and abbreviations used to denote buildings or building complexes, usually as a suffix. For example: <emphasis>Shopping Centre</emphasis>.</para>
</listitem>
</varlistentry>
<varlistentry>
<term><varname>DIRECT</varname></term>
<listitem>
<para>(22). Words used to denote directions, for example <emphasis>North</emphasis>.</para>
</listitem>
</varlistentry>
<varlistentry>
<term><varname>MILE</varname></term>
<listitem>
<para>(20). Words used to denote milepost addresses.</para>
</listitem>
</varlistentry>
<varlistentry>
<term><varname>ROAD</varname></term>
<listitem>
<para>(6). Words and abbreviations used to denote highways and roads. For example: the <emphasis>Interstate</emphasis> in <emphasis>Interstate 5</emphasis></para>
</listitem>
</varlistentry>
<varlistentry>
<term><varname>RR</varname></term>
<listitem>
<para>(8). Words and abbreviations used to denote rural routes. <emphasis>RR</emphasis>.</para>
</listitem>
</varlistentry>
<varlistentry>
<term><varname>TYPE</varname></term>
<listitem>
<para>(2). Words and abbreviation used to denote street typess. For example: <emphasis>ST</emphasis> or <emphasis>AVE</emphasis>.</para>
</listitem>
</varlistentry>
<varlistentry>
<term><varname>UNITH</varname></term>
<listitem>
<para>(16). Words and abbreviation used to denote internal subaddresses. For example, <emphasis>APT</emphasis> or <emphasis>UNIT</emphasis>.</para>
</listitem>
</varlistentry>
</variablelist>
<para><emphasis role="bold">Postal Type Input Tokens</emphasis></para>
<variablelist>
<varlistentry>
<term><varname>QUINT</varname></term>
<listitem>
<para>(28). A 5 digit number. Identifies a Zip Code</para>
</listitem>
</varlistentry>
<varlistentry>
<term><varname>QUAD</varname></term>
<listitem>
<para>(29). A 4 digit number. Identifies ZIP4.</para>
</listitem>
</varlistentry>
<varlistentry>
<term><varname>PCH</varname></term>
<listitem>
<para>(27). A 3 character sequence of letter number letter. Identifies an FSA, the first 3 characters of a Canadian postal code.</para>
</listitem>
</varlistentry>
<varlistentry>
<term><varname>PCT</varname></term>
<listitem>
<para>(26). A 3 character sequence of number letter number. Identifies an LDU, the last 3 characters of a Canadian postal code.</para>
</listitem>
</varlistentry>
</variablelist>
<para><emphasis role="bold">Stopwords</emphasis></para>
<para>STOPWORDS combine with WORDS. In rules a string of multiple WORDs and STOPWORDs will be represented by a single WORD token.</para>
<variablelist>
<varlistentry>
<term><varname>STOPWORD</varname></term>
<listitem>
<para>(7). A word with low lexical significance, that can be omitted in parsing. For example: <emphasis>THE</emphasis>.</para>
</listitem>
</varlistentry>
</variablelist>
</refsection>
<refsection><title>Output Tokens</title>
<para>After the first -1 (terminator), follows the output tokens and their order, followed by a terminator <code>-1</code>. Numbers for corresponding output tokens are listed in <xref linkend="stdaddr"/>. What are allowed is dependent on kind of rule. Output tokens valid for each rule type are listed in <xref linkend="rule_types_rank"/>.</para>
</refsection>
<refsection xml:id="rule_types_rank"><title>Rule Types and Rank</title>
<para>The final part of the rule is the rule type which is denoted by one of the following, followed by a rule rank. The rules are ranked from 0 (lowest) to 17 (highest).</para>
<para><emphasis role="bold"><varname>MACRO_C</varname></emphasis></para>
<para>(token number = "<emphasis role="bold">0</emphasis>"). The class of rules for parsing MACRO clauses such as <emphasis>PLACE STATE ZIP</emphasis></para>
<para><emphasis role="bold"><varname>MACRO_C</varname> output tokens</emphasis> (excerpted from <link xlink:href="http://www.pagcgeo.org/docs/html/pagc-12.html#--r-typ--">http://www.pagcgeo.org/docs/html/pagc-12.html#--r-typ--</link>.</para>
<variablelist>
<varlistentry>
<term><varname>CITY</varname></term>
<listitem>
<para>(token number "10"). Example "Albany"</para>
</listitem>
</varlistentry>
<varlistentry>
<term><varname>STATE</varname></term>
<listitem>
<para>(token number "11"). Example "NY"</para>
</listitem>
</varlistentry>
<varlistentry>
<term><varname>NATION</varname></term>
<listitem>
<para>(token number "12"). This attribute is not used in most reference files. Example "USA"</para>
</listitem>
</varlistentry>
<varlistentry>
<term><varname>POSTAL</varname></term>
<listitem>
<para>(token number "13"). (SADS elements "ZIP CODE" , "PLUS 4" ). This attribute is used for both the US Zip and the Canadian Postal Codes.</para>
</listitem>
</varlistentry>
</variablelist>
<para><emphasis role="bold"><varname>MICRO_C</varname></emphasis></para>
<para>(token number = "<emphasis role="bold">1</emphasis>"). The class of rules for parsing full MICRO clauses (such as House, street, sufdir, predir, pretyp, suftype, qualif) (ie ARC_C plus CIVIC_C). These rules are not used in the build phase.</para>
<para><emphasis role="bold"><varname>MICRO_C</varname> output tokens</emphasis> (excerpted from <link xlink:href="http://www.pagcgeo.org/docs/html/pagc-12.html#--r-typ--">http://www.pagcgeo.org/docs/html/pagc-12.html#--r-typ--</link>.</para>
<variablelist>
<varlistentry><term><varname>HOUSE</varname></term>
<listitem>
<para>is a text (token number <code>1</code>): This is the street number on a street. Example <emphasis>75</emphasis> in <code>75 State Street</code>.</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>predir</varname></term><listitem>
<para> is text (token number <code>2</code>): STREET NAME PRE-DIRECTIONAL such as North, South, East, West etc.</para>
</listitem></varlistentry>
<varlistentry><term><varname>qual</varname></term>
<listitem>
<para>is text (token number <code>3</code>): STREET NAME PRE-MODIFIER Example <emphasis>OLD</emphasis> in <code>3715 OLD HIGHWAY 99</code>.</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>pretype</varname></term>
<listitem>
<para> is text (token number <code>4</code>): STREET PREFIX TYPE</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>street</varname></term>
<listitem>
<para>is text (token number <code>5</code>): STREET NAME</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>suftype</varname></term>
<listitem>
<para>is text (token number <code>6</code>): STREET POST TYPE e.g. St, Ave, Cir. A street type following the root street name. Example <emphasis>STREET</emphasis> in <code>75 State Street</code>.</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>sufdir</varname></term>
<listitem>
<para>is text (token number <code>7</code>): STREET POST-DIRECTIONAL A directional modifier that follows the street name.. Example <emphasis>WEST</emphasis> in <code>3715 TENTH AVENUE WEST</code>.</para>
</listitem>
</varlistentry>
</variablelist>
<para><emphasis role="bold"><varname>ARC_C</varname></emphasis></para>
<para>(token number = "<emphasis role="bold">2</emphasis>"). The class of rules for parsing MICRO clauses, excluding the HOUSE attribute. As such uses same set of output tokens as MICRO_C minus the HOUSE token.</para>
<para><emphasis role="bold"><varname>CIVIC_C</varname></emphasis></para>
<para>(token number = "<emphasis role="bold">3</emphasis>"). The class of rules for parsing the HOUSE attribute.</para>
<para><emphasis role="bold"><varname>EXTRA_C</varname></emphasis></para>
<para>(token number = "<emphasis role="bold">4</emphasis>"). The class of rules for parsing EXTRA attributes - attributes excluded from geocoding. These rules are not used in the build phase.</para>
<para><emphasis role="bold"><varname>EXTRA_C</varname> output tokens</emphasis> (excerpted from <link xlink:href="http://www.pagcgeo.org/docs/html/pagc-12.html#--r-typ--">http://www.pagcgeo.org/docs/html/pagc-12.html#--r-typ--</link>.</para>
<variablelist>
<varlistentry><term><varname>BLDNG</varname></term>
<listitem>
<para>(token number <code>0</code>): Unparsed building identifiers and types.</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>BOXH</varname></term>
<listitem>
<para>(token number <code>14</code>): The <emphasis role="bold">BOX</emphasis> in <code>BOX 3B</code></para>
</listitem>
</varlistentry>
<varlistentry><term><varname>BOXT</varname></term>
<listitem>
<para>(token number <code>15</code>): The <emphasis role="bold">3B</emphasis> in <code>BOX 3B</code></para>
</listitem>
</varlistentry>
<varlistentry><term><varname>RR</varname></term>
<listitem>
<para>(token number <code>8</code>): The <emphasis role="bold">RR</emphasis> in <code>RR 7</code></para>
</listitem>
</varlistentry>
<varlistentry><term><varname>UNITH</varname></term>
<listitem>
<para>(token number <code>16</code>): The <emphasis role="bold">APT</emphasis> in <code>APT 3B</code></para>
</listitem>
</varlistentry>
<varlistentry><term><varname>UNITT</varname></term>
<listitem>
<para>(token number <code>17</code>): The <emphasis role="bold">3B</emphasis> in <code>APT 3B</code></para>
</listitem>
</varlistentry>
<varlistentry><term><varname>UNKNWN</varname></term>
<listitem>
<para>(token number <code>9</code>): An otherwise unclassified output.</para>
</listitem>
</varlistentry>
</variablelist>
</refsection>
</refentry>
<refentry xml:id="lextab">
<refnamediv>
<refname>lex table</refname>
<refpurpose>A lex table is used to classify alphanumeric input and associate that input with (a) input tokens ( See <xref linkend="rule_input_tokens"/>) and (b) standardized representations.</refpurpose>
</refnamediv>
<refsection>
<title>Description</title>
<para>A lex (short for lexicon) table is used to classify alphanumeric input and associate that input with <xref linkend="rule_input_tokens"/> and (b) standardized representations. Things you will find in these tables are <code>ONE</code> mapped to stdword: <code>1</code>.</para>
<para>A lex has at least the following columns in the table. You may add</para>
<variablelist>
<varlistentry>
<term><varname>id</varname></term>
<listitem>
<para>Primary key of table</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>seq</varname></term>
<listitem>
<para>integer: definition number?</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>word</varname></term>
<listitem>
<para>text: the input word</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>stdword</varname></term>
<listitem>
<para>text: the standardized replacement word</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>token</varname></term>
<listitem>
<para>integer: the kind of word it is. Only if it is used in this context will it be replaced. Refer to <link xlink:href="http://www.pagcgeo.org/docs/html/pagc-12.html#--i-tok--">PAGC Tokens</link>.</para>
</listitem>
</varlistentry>
</variablelist>
</refsection>
</refentry>
<refentry xml:id="gaztab">
<refnamediv>
<refname>gaz table</refname>
<refpurpose>A gaz table is used to standardize place names and associate that input with (a) input tokens ( See <xref linkend="rule_input_tokens"/>) and (b) standardized representations.</refpurpose>
</refnamediv>
<refsection>
<title>Description</title>
<para>A gaz (short for gazeteer) table is used to standardize place names and associate that input with <xref linkend="rule_input_tokens"/> and (b) standardized representations. For example if you are in US, you may load these with State Names and associated abbreviations.</para>
<para>A gaz table has at least the following columns in the table. You may add more columns if you wish for your own purposes.</para>
<variablelist>
<varlistentry>
<term><varname>id</varname></term>
<listitem>
<para>Primary key of table</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>seq</varname></term>
<listitem>
<para>integer: definition number? - identifier used for that instance of the word</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>word</varname></term>
<listitem>
<para>text: the input word</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>stdword</varname></term>
<listitem>
<para>text: the standardized replacement word</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>token</varname></term>
<listitem>
<para>integer: the kind of word it is. Only if it is used in this context will it be replaced. Refer to <link xlink:href="http://www.pagcgeo.org/docs/html/pagc-12.html#--i-tok--">PAGC Tokens</link>.</para>
</listitem>
</varlistentry>
</variablelist>
</refsection>
</refentry>
</section>
<section xml:id="Address_Standardizer_Functions"><title>Address Standardizer Functions</title>
<refentry xml:id="debug_standardize_address">
<refnamediv>
<refname>debug_standardize_address</refname>
<refpurpose>Returns a json formatted text listing the parse tokens and standardizations</refpurpose>
</refnamediv>
<refsynopsisdiv>
<funcsynopsis>
<funcprototype>
<funcdef>text <function>debug_standardize_address</function></funcdef>
<paramdef><type>text </type> <parameter>lextab</parameter></paramdef>
<paramdef><type>text </type> <parameter>gaztab</parameter></paramdef>
<paramdef><type>text </type> <parameter>rultab</parameter></paramdef>
<paramdef><type>text </type> <parameter>micro</parameter></paramdef>
<paramdef choice="opt"><type>text </type> <parameter>macro=NULL</parameter></paramdef>
</funcprototype>
</funcsynopsis>
</refsynopsisdiv>
<refsection>
<title>Description</title>
<para>This is a function for debugging address standardizer rules and lex/gaz mappings. It returns a json formatted text that includes the matching rules, mapping of tokens, and best standardized address <xref linkend="stdaddr"/> form of an input address utilizing <xref linkend="lextab"/> table name, <xref linkend="gaztab"/>, and <xref linkend="rulestab"/> table names and an address.</para>
<para>For single line addresses use just <varname>micro</varname></para>
<para>For two line address A <varname>micro</varname> consisting of standard first line of postal address e.g. <code>house_num street</code>, and a macro consisting of standard postal second line of an address e.g <code>city, state postal_code country</code>.</para>
<para>Elements returned in the json document are </para>
<variablelist>
<varlistentry>
<term><varname>input_tokens</varname></term>
<listitem>
<para>For each word in the input address, returns the position of the word,
token categorization of the word, and the standard word it is mapped to.
Note that for some input words, you might get back multiple records because some inputs can be categorized
as more than one thing. </para>
</listitem>
</varlistentry>
<varlistentry><term><varname>rules</varname></term>
<listitem>
<para>The set of rules matching the input and the corresponding score for each. The first rule (highest scoring) is
what is used for standardization</para>
</listitem>
</varlistentry>
<varlistentry><term><varname>stdaddr</varname></term>
<listitem>
<para>The standardized address elements <xref linkend="stdaddr"/> that would be returned when running <xref linkend="standardize_address"/></para>
</listitem>
</varlistentry>
</variablelist>
<!-- use this format if new function -->
<para role="availability" conformance="3.4.0">Availability: 3.4.0</para>
<para>&address_standardizer_required;</para>
</refsection>
<refsection>
<title>Examples</title>
<para>Using address_standardizer_data_us extension</para>
<programlisting>CREATE EXTENSION address_standardizer_data_us; -- only needs to be done once</programlisting>
<para>Variant 1: Single line address and returning the input tokens</para>
<programlisting>SELECT it->>'pos' AS position, it->>'word' AS word, it->>'stdword' AS standardized_word,
it->>'token' AS token, it->>'token-code' AS token_code
FROM jsonb(
debug_standardize_address('us_lex',
'us_gaz', 'us_rules', 'One Devonshire Place, PH 301, Boston, MA 02109')
) AS s, jsonb_array_elements(s->'input_tokens') AS it;</programlisting>
<screen>position | word | standardized_word | token | token_code
----------+------------+-------------------+--------+------------
0 | ONE | 1 | NUMBER | 0
0 | ONE | 1 | WORD | 1
1 | DEVONSHIRE | DEVONSHIRE | WORD | 1
2 | PLACE | PLACE | TYPE | 2
3 | PH | PATH | TYPE | 2
3 | PH | PENTHOUSE | UNITT | 17
4 | 301 | 301 | NUMBER | 0
(7 rows)</screen>
<para>Variant 2: Multi line address and returning first rule input mappings and score</para>
<programlisting>SELECT (s->'rules'->0->>'score')::numeric AS score, it->>'pos' AS position,
it->>'input-word' AS word, it->>'input-token' AS input_token, it->>'mapped-word' AS standardized_word,
it->>'output-token' AS output_token
FROM jsonb(
debug_standardize_address('us_lex',
'us_gaz', 'us_rules', 'One Devonshire Place, PH 301', 'Boston, MA 02109')
) AS s, jsonb_array_elements(s->'rules'->0->'rule_tokens') AS it;</programlisting>
<screen> score | position | word | input_token | standardized_word | output_token
----------+----------+------------+-------------+-------------------+--------------
0.876250 | 0 | ONE | NUMBER | 1 | HOUSE
0.876250 | 1 | DEVONSHIRE | WORD | DEVONSHIRE | STREET
0.876250 | 2 | PLACE | TYPE | PLACE | SUFTYP
0.876250 | 3 | PH | UNITT | PENTHOUSE | UNITT
0.876250 | 4 | 301 | NUMBER | 301 | UNITT
(5 rows)
</screen>
</refsection>
<!-- Optionally add a "See Also" section -->
<refsection>
<title>See Also</title>
<para><xref linkend="stdaddr"/>, <xref linkend="rulestab"/>, <xref linkend="lextab"/>, <xref linkend="gaztab"/>, <xref linkend="Pagc_Normalize_Address"/></para>
</refsection>
</refentry>
<refentry xml:id="parse_address">
<refnamediv>
<refname>parse_address</refname>
<refpurpose>Takes a 1 line address and breaks into parts</refpurpose>
</refnamediv>
<refsynopsisdiv>
<funcsynopsis>
<funcprototype>
<funcdef>record <function>parse_address</function></funcdef>
<paramdef><type>text </type> <parameter>address</parameter></paramdef>
</funcprototype>
</funcsynopsis>
</refsynopsisdiv>
<refsection>
<title>Description</title>
<para>Returns takes an address as input, and returns a record output consisting of fields <emphasis>num</emphasis>, <emphasis>street</emphasis>, <emphasis>street2</emphasis>,
<emphasis>address1</emphasis>, <emphasis>city</emphasis>, <emphasis>state</emphasis>, <emphasis>zip</emphasis>, <emphasis>zipplus</emphasis>, <emphasis>country</emphasis>.</para>
<!-- use this format if new function -->
<para role="availability" conformance="2.2.0">Availability: 2.2.0</para>
<para>&address_standardizer_required;</para>
</refsection>
<refsection>
<title>Examples</title>
<para>Single Address</para>
<programlisting>SELECT num, street, city, zip, zipplus
FROM parse_address('1 Devonshire Place, Boston, MA 02109-1234') AS a;</programlisting>
<screen>
num | street | city | zip | zipplus
-----+------------------+--------+-------+---------
1 | Devonshire Place | Boston | 02109 | 1234 </screen>
<para>Table of addresses</para>
<programlisting>-- basic table
CREATE TABLE places(addid serial PRIMARY KEY, address text);
INSERT INTO places(address)
VALUES ('529 Main Street, Boston MA, 02129'),
('77 Massachusetts Avenue, Cambridge, MA 02139'),
('25 Wizard of Oz, Walaford, KS 99912323'),
('26 Capen Street, Medford, MA'),
('124 Mount Auburn St, Cambridge, Massachusetts 02138'),
('950 Main Street, Worcester, MA 01610');
-- parse the addresses
-- if you want all fields you can use (a).*
SELECT addid, (a).num, (a).street, (a).city, (a).state, (a).zip, (a).zipplus
FROM (SELECT addid, parse_address(address) As a
FROM places) AS p;</programlisting>
<screen> addid | num | street | city | state | zip | zipplus
-------+-----+----------------------+-----------+-------+-------+---------
1 | 529 | Main Street | Boston | MA | 02129 |
2 | 77 | Massachusetts Avenue | Cambridge | MA | 02139 |
3 | 25 | Wizard of Oz | Walaford | KS | 99912 | 323
4 | 26 | Capen Street | Medford | MA | |
5 | 124 | Mount Auburn St | Cambridge | MA | 02138 |
6 | 950 | Main Street | Worcester | MA | 01610 |
(6 rows)</screen>
</refsection>
<!-- Optionally add a "See Also" section -->
<refsection>
<title>See Also</title>
<para/>
</refsection>
</refentry>
<refentry xml:id="standardize_address">
<refnamediv>
<refname>standardize_address</refname>
<refpurpose>Returns an stdaddr form of an input address utilizing lex, gaz, and rule tables.</refpurpose>
</refnamediv>
<refsynopsisdiv>
<funcsynopsis>
<funcprototype>
<funcdef>stdaddr <function>standardize_address</function></funcdef>
<paramdef><type>text </type> <parameter>lextab</parameter></paramdef>
<paramdef><type>text </type> <parameter>gaztab</parameter></paramdef>
<paramdef><type>text </type> <parameter>rultab</parameter></paramdef>
<paramdef><type>text </type> <parameter>address</parameter></paramdef>
</funcprototype>
<funcprototype>
<funcdef>stdaddr <function>standardize_address</function></funcdef>
<paramdef><type>text </type> <parameter>lextab</parameter></paramdef>
<paramdef><type>text </type> <parameter>gaztab</parameter></paramdef>
<paramdef><type>text </type> <parameter>rultab</parameter></paramdef>
<paramdef><type>text </type> <parameter>micro</parameter></paramdef>
<paramdef><type>text </type> <parameter>macro</parameter></paramdef>
</funcprototype>
</funcsynopsis>
</refsynopsisdiv>
<refsection>
<title>Description</title>
<para>Returns an <xref linkend="stdaddr"/> form of an input address utilizing <xref linkend="lextab"/> table name, <xref linkend="gaztab"/>, and <xref linkend="rulestab"/> table names and an address.</para>
<para>Variant 1: Takes an address as a single line.</para>
<para>Variant 2: Takes an address as 2 parts. A <varname>micro</varname> consisting of standard first line of postal address e.g. <code>house_num street</code>, and a macro consisting of standard postal second line of an address e.g <code>city, state postal_code country</code>.</para>
<!-- use this format if new function -->
<para role="availability" conformance="2.2.0">Availability: 2.2.0</para>
<para>&address_standardizer_required;</para>
</refsection>
<refsection>
<title>Examples</title>
<para>Using address_standardizer_data_us extension</para>
<programlisting>CREATE EXTENSION address_standardizer_data_us; -- only needs to be done once</programlisting>
<para>Variant 1: Single line address. This doesn't work well with non-US addresses</para>
<programlisting>SELECT house_num, name, suftype, city, country, state, unit FROM standardize_address('us_lex',
'us_gaz', 'us_rules', 'One Devonshire Place, PH 301, Boston, MA 02109');</programlisting>
<screen>house_num | name | suftype | city | country | state | unit
----------+------------+---------+--------+---------+---------------+-----------------
1 | DEVONSHIRE | PLACE | BOSTON | USA | MASSACHUSETTS | # PENTHOUSE 301</screen>
<para>Using tables packaged with tiger geocoder. This example only works if you installed <varname>postgis_tiger_geocoder</varname>.</para>
<programlisting>SELECT * FROM standardize_address('tiger.pagc_lex',
'tiger.pagc_gaz', 'tiger.pagc_rules', 'One Devonshire Place, PH 301, Boston, MA 02109-1234');</programlisting>
<para>Make easier to read we'll dump output using hstore extension CREATE EXTENSION hstore; you need to install</para>
<programlisting>SELECT (each(hstore(p))).*
FROM standardize_address('tiger.pagc_lex', 'tiger.pagc_gaz',
'tiger.pagc_rules', 'One Devonshire Place, PH 301, Boston, MA 02109') As p;</programlisting>
<screen> key | value
------------+-----------------
box |
city | BOSTON
name | DEVONSHIRE
qual |
unit | # PENTHOUSE 301
extra |
state | MA
predir |
sufdir |
country | USA
pretype |
suftype | PL
building |
postcode | 02109
house_num | 1
ruralroute |
(16 rows)
</screen>
<para>Variant 2: As a two part Address</para>
<programlisting>SELECT (each(hstore(p))).*
FROM standardize_address('tiger.pagc_lex', 'tiger.pagc_gaz',
'tiger.pagc_rules', 'One Devonshire Place, PH 301', 'Boston, MA 02109, US') As p;</programlisting>
<screen> key | value
------------+-----------------
box |
city | BOSTON
name | DEVONSHIRE
qual |
unit | # PENTHOUSE 301
extra |
state | MA
predir |
sufdir |
country | USA
pretype |
suftype | PL
building |
postcode | 02109
house_num | 1
ruralroute |
(16 rows)</screen>
</refsection>
<!-- Optionally add a "See Also" section -->
<refsection>
<title>See Also</title>
<para><xref linkend="stdaddr"/>, <xref linkend="rulestab"/>, <xref linkend="lextab"/>, <xref linkend="gaztab"/>, <xref linkend="Pagc_Normalize_Address"/></para>
</refsection>
</refentry>
</section>
</section>
|