1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907
|
; (ld "fast.lisp" :ld-pre-eval-print t)
; (certify-book "fast")
; A Proof of the Correctness of the Boyer-Moore Fast String Searching Algorithm
; by J Strother Moore and Matt Martinez
; May 2008
; ---------------------------------------------------------------------------
; Summmary
; In this book we define the Boyer-Moore fast string searching algorithm and
; prove it correct. The original Boyer-Moore algorithm is described in
; "A Fast String Searching Algorithm," R.S. Boyer and J S. Moore, Communications
; of the Association for Computing Machinery, 20(10), 1977, pp. 762-772.
; In that version, two arrays were used, called delta1 and delta2. The skip
; was the maximum of those two arrays. Here we deal with an improvement of the
; original algorithm in which only one array, called delta, is used and it
; insures a skip that is never shorter than the maximum of delta1 and delta2
; and is often greater than either. A key fact about the original algorithm is
; that preprocessing can be done in linear time in the length, k, of the
; pattern plus linear time in the alphabet size, a. We do not worry about
; preprocessing time, but the naive way of setting up our delta has complexity
; a*k^2.
; Roughly speaking the algorithm has three main parts: preprocessing, the loop
; which skips ahead by the contents of an array, and a wrapper which enters the
; loop with the right arguments, including the array set up by the
; preprocessing. The loop is not, in general, total -- that is, it does not
; terminate for arbitrary values of the array. Therefore, we cannot define the
; loop with the general array as an argument. Instead, we define the loop to
; skip ahead by the value computed by a function, delta, of the character just
; read, an index into the pattern, and the pattern. We prove this loop is
; total and admit it; then we define the wrapper that calls it appropriately.
; The loop is called ``fast-loop'' and the wrapper is called is called ``fast''.
; At no point in the execution of the algorithm is an array used. But we define
; the preprocessing anyway and we prove that indexing into the 2-dimensional array
; it computes returns the appropriate value of delta.
; To prove correctness we define the naive string searching algorithm, which we
; name ``correct'', and prove that ``fast'' is equivalent to ``correct''.
; These pieces -- the preprocessing to produce an array, a total recursive
; function that implements the algorithm in terms of the individually computed
; values of delta, and the proof of the correctness of the algorithm -- can be
; used to prove that code is correct. We do that in another book.
; ---------------------------------------------------------------------------
; Preamble
; This book relies on a collection of utility lemmas about firstn, nthcdr, and other
; basic list processing functions.
(in-package "ACL2")
(include-book "utilities")
; ---------------------------------------------------------------------------
; The Obviously Correct String Searching Algorithm
; We start by defining the obviously correct string searching algorithm. The
; fundamental notion is that of an ``exact match,'' the equality of
; corresponding characters from the two strings at a given alignment.
(defun xmatch (pat j txt i)
(declare (xargs :measure (nfix (- (length pat) j))))
(cond ((not (natp j)) nil)
((>= j (length pat)) t)
((>= i (length txt)) nil)
((equal (char pat j)
(char txt i))
(xmatch pat (+ 1 j)
txt (+ 1 i)))
(t nil)))
; The correct algorithm just looks for the first xmatch of pat in txt, by
; trying them all starting at 0.
(defun correct-loop (pat txt i)
(declare (xargs :measure (nfix (- (length txt) i))))
(cond ((not (natp i)) nil)
((>= i (length txt)) nil)
((xmatch pat 0 txt i) i)
(t (correct-loop pat txt (+ 1 i)))))
(defun correct (pat txt)
(correct-loop pat txt 0))
; One might think the empty pat can always be found at location 0 of txt.
; However, this algorithm always return either a legal index into txt (where
; the first match with pat occurs) or else it returns nil. Since 0 is not a
; legal index into the empty txt, the algorithm returns nil when both pat and
; txt are empty.
; ---------------------------------------------------------------------------
; The Boyer-Moore Algorithm -- Preprocessing
; We assume the reader is familiar with the Boyer-Moore algorithm. In this
; section we show the preprocessing.
; We follow the convention that if a variable's name ends in the character "~"
; the variable ranges over lists. We explain later. The reader might want to
; pronounce "pat~" as "pat prime". We cannot use pat' as a variable in ACL2.
; First we define the amount, delta, by which you can skip. It is defined in
; terms of the first "pseudo match" you find for dt~ in pat~, where dt~ is the
; list of characters discovered in the txt before the algorithm encountered the
; unequal characters that caused it to skip ahead. "Pmatch" stands for "pseudo
; match" and allows dt~ to "fall off" the left end of pat~ and still match, in
; the sense that (A B C D E) pseudo matches (C D E F G).
; Note that pseudo-xmatch is defined on lists, not strings. We discuss this
; issue in more detail below. But every string has a unique list counterpart
; and some functions are more naturally defined with lists. Mathematically,
; the notion that one string, pat of length n, matches at location i in another
; string txt is really just an equality: the substring of length n starting at
; location i in txt IS pat. We didn't implement xmatch that way because it
; would require constructing the various substrings as objects, which is
; inefficient. It is more efficient to just compare corresponding characters.
; But the code we're about to write is used only in the preprocessing. This is
; only meant to specify what the array contains, not necessarily how it is
; computed. So here it is more natural to use the "mathematical" notion of
; match. But if we're doing that -- if we're in the business of constructing
; substrings of strings -- it is easier to deal with lists.
(defun pmatch (dt~ pat~ j)
(declare (xargs :guard (and (true-listp dt~)
(true-listp pat~)
(integerp j))))
(if (< j 0)
(equal (firstn (len (nthcdr (- j) dt~)) pat~)
(nthcdr (- j) dt~))
(equal (firstn (len dt~) (nthcdr j pat~))
dt~)))
; ``X marks the spot.'' The function x computes the right most position (to
; the left of j) at which dt~ occurs in pat~. For example,
; 0 1 2 3 4 5 6 7 8
; (x '(a b c) '(a b c a b c x b c) 6) = 3
; If we have just discovered dt in txt upon reading txt[i], we want to slide
; pat so that the character at x in pat is aligned with the one at i in txt.
(defun x (dt~ pat~ j)
(declare (xargs :measure (+ 1 (nfix (+ (len dt~) j)))
:guard (and (true-listp dt~)
(true-listp pat~)
(integerp j))))
(cond ((or (not (integerp j))
(not (true-listp dt~)))
0)
((pmatch dt~ pat~ j) j)
(t (x dt~ pat~ (- j 1)))))
; This computes how much to increment i to align the discovered text with the
; pat at x and then shift i further to align with the end of pat in that new
; alignment.
(defun delta (v j pat)
(declare (xargs :guard (and (characterp v)
(natp j)
(stringp pat))))
(let* ((pat~ (coerce pat 'list))
(dt~ (cons v (nthcdr (+ j 1) pat~))))
(+ (- (len pat~) 1) (- (x dt~ pat~ (- j 1))))))
; Now we set up an "array", really a list of lists. There will be 256 rows,
; one for each of the ASCII codes. Each row will have n entries, where n is
; the length of pat. At row c and column j we will store (delta c' j pat),
; where c' is the character corresponding to ASCII code c.
; Note: The "j" in the code below runs between 1 and (length pat) and we store in
; column j-1.
(defun setup-delta1-row (c j pat row)
(if (zp j)
row
(setup-delta1-row c
(- j 1)
pat
(cons (delta (code-char c) (- j 1) pat) row))))
(defun setup-delta (c pat ans)
(if (zp c)
ans
(setup-delta (- c 1)
pat
(cons (setup-delta1-row (- c 1) (length pat) pat nil)
ans))))
(defun preprocess (pat)
(setup-delta 256 pat nil))
(defun index2 (array c j)
(nth j (nth c array)))
; For example, consider the pattern
; j = 01234567
; pat = EABCDABC
; Note that the substring ABC occurs in it twice, once starting at j=1 and
; again starting at j=5. In its first occurrence, it is preceded by E and in
; its second it is preceded by D.
; Here is the 2-dimensional array computed by preprocess. Most of the rows are
; identical because most of the ASCII characters do not occur in the pattern.
; j = 0 1 2 3 4 5 6 7 ; ASCII code (char)
; ; for row
; ((15 14 13 12 11 10 9 8) ; 0
; (15 14 13 12 11 10 9 8) ; 1
; ... ; ...
; (15 14 13 12 11 10 9 8) ; 63 (?)
; (15 14 13 12 11 10 9 8) ; 64 (@)
; (15 14 13 12 11 6 9 2) ; 65 (A)
; (15 14 13 12 11 10 5 1) ; 66 (B)
; (15 14 13 12 11 10 9 4) ; 67 (C)
; (15 14 13 12 11 10 9 3) ; 68 (D)
; (15 14 13 12 7 10 9 7) ; 69 (E)
; (15 14 13 12 11 10 9 8) ; 70 (F)
; ... ; ...
; (15 14 13 12 11 10 9 8) ; 255
; )
; Below we show several examples of the use of the array. Suppose pat and txt
; are aligned as shown and we are reading the character at txt position i.
; Suppose we read an F (ASCII code 70). This is suggested in the display
; below.
; j = 01234567
; pat = EABCDABC
; txt = xxxxxxxxxxxxFxxxxxxxxxxxxx
; i = |
; Then we index the array with (index2 array 70 7), since the corresponding
; character in pat is at j=7. The array's contents is 8, which means we can
; skip ahead by 8 (i.e., increment i by 8) to produce this alignment:
; pat = EABCDABC
; txt = xxxxxxxxxxxxFxxxxxxxxxxxx
; i = |
; Suppose instead we are in this alignment:
; j = 01234567
; pat = EABCDABC
; txt = xxxxxxxxxEABCxxxxxxxxxxxxx
; i = |
; That is, we have matched the ABC at the end of pat with the corresponding
; characters of txt and now read txt at i and get an E (ASCII 69). Then
; (index array 69 4) = 7, so we increment i by 7 and get:
; pat = EABCDABC
; txt = xxxxxxxxxEABCxxxxxxxxxxxxx
; i = |
; This shifts the pattern forward to align the discovered text, "EABC", with
; its right-most occurrence in pat.
; ---------------------------------------------------------------------------
; A Total Recursive Function that Performs the Boyer-Moore Search
; Fast will be defined in terms of fast-loop, which is like boyer-moore-loop
; but (a) checks for all the preconditions we assume and (b) uses delta
; directly instead of an array. To prove that fast-loop terminates, we need
; some theorems:
(defthm integerp-delta
(implies (integerp j)
(integerp (delta v j pat)))
:rule-classes :type-prescription)
(defthm x-mono
(implies (and (true-listp dt~)
(integerp j))
(<= (x dt~ pat~ j) j))
:rule-classes :linear)
; This is the key one. It shows that delta is big enough to make sure the
; pattern monotonically shifts to the right.
(defthm delta-very-positive
(implies (and (stringp pat)
(natp j)
(< j (length pat))
(not (equal v (char pat j))))
(<= (- (length pat) j)
(delta v j pat)))
:rule-classes :linear)
; (The truth is we do not need integerp-delta and delta-very-positive until we
; get to the penultimate lemma about fast-loop and correct-loop at the string
; level. If we did, we'd have to disable delta here since it is non-recursive.
; For now, we just leave delta enabled and prove these two results as needed
; from the defun of delta and x-mono. But when we disable delta to stay at the
; string level, we'll need those two.)
; To prove that a function terminates we show a measure that decreases as the
; function recurs. Our measure is a lexicographic combination of two natural
; numbers. The first measures how far the right-hand of the pattern is from
; the right-hand end of the text. The second is how far down the pattern we've
; matched. When we shift the pattern, the first number decreases. When we
; leave the pattern in place and decrement j as we match right-to-left, the
; second number decreases.
(defun measure (pat j txt i)
(declare (xargs :guard (and (stringp pat)
(integerp j)
(<= -1 j)
(stringp txt)
(integerp i)
(<= j i))))
(llist (- (+ (length txt) (length pat)) ; first component
(+ i (- (length pat) j) -1))
(+ j 1))) ; second component
; ---------------------------------------------------------------------------
; Aside: This is a hack to print some trace output when the algorithm runs.
; Note that it prints only for short pat and txt. That is because we want the
; printing to fit neatly on a line. When the strings get "too long" it runs
; anyway, but it doesn't print trace output.
(defun show-config (pat j txt i)
(declare (xargs :guard (and (stringp pat)
(integerp j)
(<= -1 j)
(stringp txt)
(integerp i)
(<= j i))))
(cond
((> (+ (length pat)
(length txt))
55)
nil)
(t
(cw "~%~
pat:~t4~s0~t5j=~x1~%~
txt: ~s2~t5i=~x3~%~
~t6|~t5m=~x7~%"
pat ; 0
j ; 1
txt ; 2
i ; 3
(+ 5 (- i j)) ; 4 tab position for pat
61 ; 5 tab position for j= and i=
(+ 5 i) ; 6 tab position for | marker
(measure pat j txt i) ; 7 measure that is supposed to decrease
))))
; ---------------------------------------------------------------------------
; Here is the runnable, terminating, version of the Boyer-Moore loop. We'll
; prove that in a moment. First we admit it and prove that it terminates.
; Then we'll prove it is equal to boyer-moore-loop under certain conditions.
(defun fast-loop (pat j txt i)
(declare (xargs :measure (measure pat j txt i)
:well-founded-relation l<
:guard (and (stringp pat)
(integerp j)
(<= -1 j)
(< j (length pat))
(stringp txt)
(integerp i)
(<= j i))))
(prog2$
(show-config pat j txt i)
(cond
((not (and (stringp pat)
(integerp j)
(stringp txt)
(integerp i)
(<= -1 j)
(< j (length pat))
(<= j i)))
nil)
((< j 0)
(+ 1 i))
((<= (length txt) i)
nil)
((equal (char pat j) (char txt i))
(fast-loop pat (- j 1) txt (- i 1)))
(t (fast-loop pat
(- (length pat) 1)
txt
(+ i (delta (char txt i)
j
pat)))))))
; Here is the Boyer-Moore algorithm, which doesn't actually precompute all
; possible deltas but just computes the ones it needs, every time it needs one.
(defun fast (pat txt)
(declare (xargs :guard (and (stringp pat)
(stringp txt))))
(if (equal pat "")
(if (equal txt "")
nil
0)
(fast-loop pat
(- (length pat) 1)
txt
(- (length pat) 1))))
; This function terminates. All the subroutines in it were proved to
; terminate.
; ---------------------------------------------------------------------------
; Proving That the Preprocessing is Correct
; Now we prove that preprocessing is correct and causes the index2 expression
; used in boyer-moore-loop to produce delta when array is (preprocess pat).
(defthm car-nthcdr-setup-delta1-row
(implies (natp maxpat)
(equal (car (nthcdr j (setup-delta1-row c maxpat pat ans)))
(if (natp j)
(if (< j maxpat)
(delta (code-char c) j pat)
(nth (- j maxpat) ans))
(if (zp maxpat)
(car ans)
(delta (code-char c) 0 pat))))))
(defthm car-nthcdr-setup-delta
(implies (natp maxchar)
(equal (car (nthcdr c (setup-delta maxchar pat ans)))
(if (natp c)
(if (< c maxchar)
(setup-delta1-row c (length pat) pat nil)
(nth (- c maxchar) ans))
(if (zp maxchar)
(car ans)
(setup-delta1-row 0 (length pat) pat nil))))))
(defthm preprocess-correct
(implies (and (stringp pat)
(characterp v)
(natp j)
(< j (length pat)))
(equal (index2 (preprocess pat) (char-code v) j)
(delta v j pat))))
(in-theory (disable preprocess index2))
; ---------------------------------------------------------------------------
; Our Basic Proof Strategy
; For any theorem about strings there is a corresponding theorem about
; true-lists of characters. In fact, because we almost never use the fact that
; there are only a finite number of characters, for any theorem about strings
; there is almost always a corresponding theorem about true-lists.
; Of particular interest is the fact that if you have a string pat then you can
; convert it to a unique true-list of characters with (coerce pat 'list). Many
; string processing functions are defined in terms of their list processing
; counterparts. For example, if pat is a string, then (length pat) is (len
; (coerce pat 'list)) and the jth character of pat, i.e., (char pat j), is
; defined to be (nth j (coerce pat 'list)). So, a theorem about a stringp pat
; that talks about length and char might be proved by proving the corresponding
; theorem about a true-listp pat~ that talks about len and nth. Instantiating
; the latter, replacing pat~ with (coerce pat 'list), proves the former.
; In this file, the variables pat and txt range over strings and the variables
; pat~ and txt~ range over lists.
; We call len the "list level counterpart" of length. Similarly, pat~ is the
; "list level counterpart" of pat.
; There are two "great ideas" in our proof strategy.
; The first is that xmatch has a list level counterpart that can be expressed
; entirely with equality and the familiar functions firstn and nthcdr. Firstn
; returns the first n elements of a list and nthcdr returns all but the first
; n elements of a list.
; Let pat and txt be strings and let pat~ and txt~ be their list level
; counterparts. Let n be the length of pat (i.e., the len of pat~). Then the
; list level counterpart of (xmatch pat j txt i) is
;
; (equal (firstn n
; (nthcdr i txt'))
; (nthcdr j pat'))
; That is the xmatch holds iff the first n elements of the sublist of txt'
; starting at i is equal to the sublist of pat' starting at j. This "trades"
; xmatch for an equality between two list expressions in terms of firstn and
; nthcdr.
; Here is the formal statement of this first great strategy.
(defthm xmatch-trade
(implies (and (stringp pat)
(stringp txt)
(natp j)
(natp i))
(equal (xmatch pat j txt i)
(equal (firstn (len (nthcdr j (coerce pat 'list)))
(nthcdr i (coerce txt 'list)))
(nthcdr j (coerce pat 'list))))))
; As long as this lemma is active, xmatch expressions will be traded for firstn
; and nthcdr equalities. We will disable this lemma when we no longer want
; that to happen.
; The second great idea is that we can prove things about firstn and nthcdr
; by what ACL2 calls "destructor elimination." See the lemma
; firstn-nthcdr-elim in utilities, which states that
; (implies (and (natp n)
; (< n (len x)))
; (equal (append (firstn n x) (nthcdr n x)) x)).
; The strategy is as follows.
; Suppose you have a list x and some index n into it. Suppose you
; you want to prove something about (firstn n x) and/or (nthcdr n x),
; such as:
; (implies (and (true-listp x)
; (natp n)
; (< n (len x)))
; (p n (firstn n x) (nthcdr n x) x))
; Then you can prove instead:
; (implies (and (true-listp a)
; (< 0 (len b))
; (true-listp b))
; (p (len a) a b (append a b)))
; because if you prove the latter, you can instantiate it, replacing a by
; (firstn n x) and b by (nthcdr n x) and use facts in the utilities book to
; recover the former. The key fact is, of course, firstn-nthcdr-elim.
; This strategy "trades" firstn and nthcdr for append! The utilities book not
; only implements this strategy but contains a lot of theorems to simplify
; firstn, nthcdr, and append expressions. In the interests of canonicalizing
; everything, utilities also rewrites (nth i x) to (car (nthcdr i x)).
; ---------------------------------------------------------------------------
; The Proof Plan
; We want to prove that fast is correct, i.e.,
; (defthm fast-is-correct
; (implies (and (stringp pat)
; (stringp txt))
; (equal (fast pat txt)
; (correct pat txt))))
; from which it will follow that boyer-moore is correct. Below we sketch the
; proof.
; The plan has some special markers in it (*** 1 ***), (*** 2 ***), etc.
; These markers help you locate the corresponding steps in the ACL2 proof in
; the next section.
; Proof Plan
; To prove fast-is-correct (*** 1 ***), we need to prove that the two loops are
; equivalent, i.e., that
; (equal (fast-loop pat j txt i)
; (correct-loop pat txt (- i j)))
; under suitable hypotheses. The main hypothesis is that pat starting at j+1
; matches txt starting at i+1. The actual theorem we prove is called
; fast-loop-is-correct-loop below (*** 2 ***).
; The difference expression above reflects the fact that the j and i in the
; fast-loop term point to the right end of the current alignment while the
; index in correct-loop points to the left end.
; To prove this we'll do an induction to unwind fast-loop. There will be two
; inductive cases, one where we're backing up because the corresponding
; characters were identical and one where we're skipping forward by delta. We
; discuss the skipping forward case.
; The inductive step in that case will look something like this:
; (implies (and ...
; (not (equal (char pat j) (char txt i))) ; [hyp 1]
; ...
; (equal (fast-loop pat j' txt i') ; [hyp 2]
; (correct-loop pat txt (- i' j')))
; ...)
; (equal (fast-loop pat j txt i) ; [concl]
; (correct-loop pat txt (- i j))))
; where [hyp 1] specifies the case we're in (we've found mismatched chars) and
; [hyp 2] is the inductive hypothesis, where we've assumed the formula for j
; replaced by j' and i replaced by i'.
; The values of j' and i' are just those that are used in the recursive
; call of fast-loop in this case:
; j' = (- (length pat) 1)
; i' = (+ i (delta (char txt i) j pat))
; The conclusion, [concl], equates fast-loop to correct-loop. But the
; fast-loop call will expand under the [hyp 1] case analysis and become
; (fast-loop pat j' txt i'). Note that this is the term [hyp2] mentions.
; Then we will use the induction hypothesis, [hyp2], and produce a new
; conclusion of the form:
; (equal (correct-loop pat txt (- i' j'))
; (correct-loop pat txt (- i j)))
; So the proof about fast-loop is EASY if we can prove this lemma about
; correct-loop! The actual lemma is called crux below (*** 3 ***). If we
; substitute in the values of j' and i' and do a little arithmetic we see the
; main idea:
; [crux]:
; (equal (correct-loop pat txt
; (+ 1 i
; (- (length pat))
; (delta (char txt i) j pat)))
; (correct-loop pat txt
; (- i j)))
; Of course, crux has all sorts of hypotheses, the main two being that (a) pat
; starting at j+1 matches txt starting at i+1 and (b) the characters at pat[j]
; and txt[i] are different.
; Think of the 3rd argument of correct-loop being the index of a proposed
; alignment of pat and txt. Correct-loop just tests whether there is a match
; there and if not, moves the proposed alignment down by 1. Crux tells us that
; we can shift down by a larger amount than 1 without missing a match.
; That's the crux of the Boyer-Moore algorithm.
; The main lesson is that we can do the proof of the equivalence of fast-loop
; and correct-loop with one standard induction on fast-loop, if we can prove
; that correct-loop allows the same big skips that fast-loop does. This will
; take some clever reasoning about matches, mismatched characters, and delta.
; We'd like to do that reasoning at the list level, not the string level.
; So to prove crux (which is about strings) we will jump to its list level
; counterpart, which is called crux~ (*** 4 ***). But to even state the lemma
; at the list level we need a recursive function on lists that is the
; list-level correspondent of the string-level correct-loop.
; We call that function correct-loop~ (*** 5 ***) and we prove that it is the
; list-level counterpart of correct-loop in (*** 6 ***). If you look at [crux]
; and think about how to jump into the list world, we'll replace correct-loop
; by correct-loop~, pat by pat~ txt by txt~ and length by len. But we also
; have to deal with the delta and char expressions, which operate on strings.
; Fortunately, char just becomes nth after we coerce txt to a list and delta
; can be expanded to be expressed in terms of x, which deals with lists
; already.
; So with those transformations we get:
; [crux~]:
; (equal
; (correct-loop~ pat~ txt~
; (+ i
; (- (x
; (cons (car (nthcdr i txt~)) (nthcdr j (cdr pat~)))
; pat~
; (+ -1 j)))))
; (correct-loop~ pat~ txt~
; (+ i (- j))))
; This may look ugly but it's exactly the list-level counterpart of the crux of
; the algorithm.
; So how do you prove that correct-loop~ allows jumps to alignment x?
; The answer is really simple:
; First, correct-loop~ allows jumps by ANY amount -- provided no matches are
; skipped! To state this we need to invent a list-level predicate that says
; there are no matches between here and there. That predicate is called clear
; (*** 7 ***) and the basic idea is formalized as clear-implies-skip (*** 8
; ***).
; Second, we have to prove that there are no matches in the region jumped over
; when we shift the pattern to the x alignment. That is proved in clear-x
; (*** 9 ***).
; We cannot find a good way ``code'' these lemmas, clear-implies-skip and
; clear-x, that will allow ACL2 to use them automatically as rewrite rules and
; prove crux~. So we prove crux~ by an explicit hint that instantiates the two
; lemmas.
; "Q.E.D." -- But this is just the plan!
; ---------------------------------------------------------------------------
; The ACL2 Proof
; Below we carry out the proof plan. The numbers appear in different order
; because ACL2 proofs are presented bottom up and our proof was more or less
; top down. We don't provide many comments because the numbering and plan
; pretty much say all there is to say.
(defun correct-loop~ (pat~ txt~ i) ; (*** 5 ***)
(declare (xargs :measure (nfix (- (len txt~) i))))
(cond ((not (natp i)) nil)
((>= i (len txt~)) nil)
((equal (firstn (len pat~) (nthcdr i txt~))
pat~)
i)
(t (correct-loop~ pat~ txt~ (+ 1 i)))))
(defthm correct-loop-trade ; (*** 6 ***)
(implies (and (stringp pat)
(stringp txt))
(equal (correct-loop pat txt i)
(correct-loop~ (coerce pat 'list)
(coerce txt 'list)
i))))
(defun clear (pat~ txt~ k n) ; (*** 7 ***)
(declare (xargs :measure (nfix n)))
(cond ((zp n) t)
((equal (firstn (len pat~) (nthcdr k txt~))
pat~)
nil)
(t (clear pat~ txt~ (+ k 1) (- n 1)))))
; Next is one of the two lemmas in our decomposition of crux~. It is not
; stated as a rewrite rule because in actual fact when the lemma is used k,
; below, becomes (- i j) and the (+ k n) becomes a difference expression.
(defthm clear-implies-skip ; (*** 8 ***)
(implies (and (natp k)
(< k (len txt~))
(natp n)
(clear pat~ txt~ k n))
(equal (correct-loop~ pat~ txt~ (+ k n))
(correct-loop~ pat~ txt~ k)))
:rule-classes nil)
; The next lemma, clear-x, is the other half of our decomposition. The trick
; to stating it is how one writes dt, the discovered text. Technically, it is
; txt[i] consed onto the tail end of pat starting at j+1. But this (a) hides
; it relation to txt -- dt is in fact just a certain substring of txt starting
; at i -- and (b) is a function of j. But to prove the theorem below we must
; induct on j. We cannot tolerate dt being a function of the induction
; variable. So we have to generalize the theorem and not deal with the actual
; dt but with some arbitrary substring of txt of d characters. But this
; introduction of d and the separation of dt from pat (now it is just a piece
; of txt) makes this lemma unuseful as a rewrite rule in crux~.
(defthm clear-x ; (*** 9 ***)
(implies (and (true-listp pat~)
(true-listp txt~)
(consp pat~)
(natp i)
(natp d)
(<= (+ i d) (len txt~))
(integerp j)
(natp (- i j))
(natp (+ j d))
(<= (+ j d) (len pat~)))
(clear pat~
txt~
(- i j)
(- j (x (firstn d (nthcdr i txt~)) pat~ j))))
:rule-classes nil)
(defthm crux~ ; (*** 4 ***)
(implies (and (true-listp pat~)
(true-listp txt~)
(integerp j)
(<= -1 j)
(integerp i)
(<= -1 i)
(<= 0 j)
(< i (len txt~))
(not (equal (nth j pat~) (nth i txt~)))
(consp pat~)
(<= 0 (len pat~))
(< j (len pat~))
(<= j i)
(equal (firstn (len (nthcdr (+ 1 j) pat~))
(nthcdr (+ 1 i) txt~))
(nthcdr (+ 1 j) pat~)))
(equal
(correct-loop~
pat~ txt~
(+ i
(- (x (cons (car (nthcdr i txt~)) (cdr (nthcdr j pat~)))
pat~ (+ -1 j)))))
(correct-loop~ pat~ txt~ (+ i (- j)))))
; Note: Recall the plan: You can skip over any distance if you know there are
; no matches in the region and we know there are no matches in the region
; scanned by x. We could not find a way to express the two lemmas
; (numbers 8 and 9 above) to make ACL2 just use them automatically. So we
; literally provide ACL2 with the required instantiations of them below.
:hints
(("Goal"
:use
((:instance
clear-x
(d (- (len pat~) j)))
(:instance
clear-implies-skip
(pat~ pat~)
(txt~ txt~)
(k (+ i (- j)))
(n
(+ j
(- (x (cons (nth i txt~) (nthcdr j (cdr pat~)))
pat~ (+ -1 j))))))))))
(defthm crux ; (*** 3 ***)
(implies (and (stringp pat)
(stringp txt)
(integerp j)
(<= -1 j)
(integerp i)
(<= -1 i)
(<= 0 j)
(< i (length txt))
(not (equal (char pat j)
(char txt i)))
(not (equal pat ""))
(< j (length pat))
(<= j i)
(xmatch pat (+ 1 j) txt (+ 1 i)))
(equal (correct-loop pat txt
(+ 1 i (- (length pat))
(delta (char txt i) j pat)))
(correct-loop pat txt (+ i (- j))))))
; Note: The plan calls for us to ascend to the string level now and prove that
; fast-loop is correct-loop by a routine fast-loop induction. The plan left
; out two things that come up in trying to carry out that proof.
; This first lemma, below, says that (xmatch pat j txt i) always succeeds if j
; is (length pat). Recall that the equivalence of fast-loop and correct-loop
; has the hypothesis that pat starting at j+1 matches txt starting at i+1. But
; when we slide the pattern down, j becomes (- (length pat) 1). That is, the
; "pre-matched" part of the pattern becomes empty. This theorem lets us detach
; the induction hypothesis in that case.
(defthm empty-xmatch
(implies (and (stringp pat)
(stringp txt)
(natp i))
(xmatch pat (length pat) txt i)))
; The next lemma says that correct-loop fails if the pat overhangs the right
; end of the txt. The fast loop stops under this condition, but the correct
; loop keeps going until the left end of the pat is past the end of txt. This
; lemma tells us correct-loop could stop early.
(defthm early-termination
(implies (and (natp k)
(stringp pat)
(stringp txt)
(<= (length txt) (+ k (- (length pat) 1))))
(not (correct-loop pat txt k))))
; Now we want to do everything else at the string level. So we disable the
; rules that are driving us down to the list level.
(in-theory (disable xmatch-trade
correct-loop-trade
delta
length
char))
; So now we're at the string level again and we're almost done! Here we do a routine
; fast-loop induction and appeal to crux:
(defthm fast-loop-is-correct-loop ; (*** 2 ***)
(implies (and (stringp pat)
(integerp j)
(stringp txt)
(integerp i)
(<= -1 j)
(< j (length pat))
(<= j i)
(not (equal pat ""))
(xmatch pat (+ j 1) txt (+ i 1)))
(equal (fast-loop pat j txt i)
(correct-loop pat txt (- i j)))))
; And now we're done.
(defthm fast-is-correct ; (*** 1 ***)
(implies (and (stringp pat)
(stringp txt))
(equal (fast pat txt)
(correct pat txt))))
; Q.E.D.
|