#!/usr/bin/ruby # DocDiff - Document DIFF for Editors - (docdiff.rb) =begin comment ================================================= Version: 0.1.5 Created: Sat, 09 Dec 2000 21:52:49 +0900 Updated: Wed, 17 Jan 2001 23:51:32 +0900 Copyright: Copyright 2000-2001 Hisashi Morita License: Same as Ruby's Description: Compares two text files by character or by word/morpheme, and output the result in pseudo HTML format. Usage: docdiff.rb [--bychar|--bymorpheme] [--html|--esc] file1 file2 ex: docdiff.rb --html oldtext newtext > difference.html Requirements: Ruby (Tested with Ruby 1.6.2/i386-linux) GNU diff (Tested with GNU diffutils 2.7/i386-linux) Recommended: Web Browser (Useful when viewing output. Tested with Netscape Navigator 4.7x/Win32) Optional: ChaSen (Required for morpheme analysis. Tested with ChaSen 2.0/i386-linux. I have not tested using ChaSen on Windows.) Why not plain diff?: DocDiff is a character/word-oriented tool for comparing text documents, while diff is a line-oriented tool for program source codes. Unlike program source code, text documents have no carriage return in the middle of a paragraph, so diff does not work well with them. (A whole paragraph is inserted/removed.) Microsoft Word (>=97 afaik) has a function to compare two text files and detect changes within a line, but it does not fully meet my need. (It does not try to pick up small changes.) Most of all, I prefer going with Free Software. Known Problem: * Some of the "identical" portion of the two document may be missing from output. This happens since diff is designed to pick up "different" parts only. If you encounter this kind of problem, try to increase the number of lines at -u and --horizon-lines option, though I guess 100000 is large enough. * You need to prepare ~/.chasenrc.docdiff file in order to use ChaSen. Sorry for inconvenience, but ChaSen command line options are bit too simple. * It cannot show changes in spaces and carriage returns precisely. If you need to inspect trivial changes, @raw_diff might help. If you happen to deal with program code somehow, remove
tags at end of each line, and wrap the whole output with

tags, so that spaces are not ignored. * Accepts only ASCII and EUC-ja code as input for now. Sorry. Preparation for using ChaSen: Copy /usr/lib/chasen/chasenrc to ~/.chasenrc.docdiff Check the location of your ChaSen dic files, and edit ~/.chasenrc.docdiff to validate these options. ---- (文法ファイル /usr/lib/chasen/dic) ; this depends on OS (空白品詞 (記号空白)) ; don't let ChaSen ignore ASCII spaces and tabs (EOS文字列 "\n") ; remove "EOS" at eos. LF only is preferred ---- (Things would be easier if ChaSen accepts these in command line option.) Windows users might have to set HOME environment var. In Windows 98, it may be like this (not tested at all!): set HOME=c:\home Links: Ruby: http://www.ruby-lang.org/ ChaSen: http://chasen.aist-nara.ac.jp/ History: 0.0.1 (Sat, 09 Dec 2000 +0900): 1 hour hack. 0.0.2 (Sun, 10 Dec 2000 +0900): Rewrote to use class. 0.1.0a1 (Sat, 16 Dec 2000 +0900): Added ChaSen support. Converted scripts from SJIS/CRLF to EUC-Ja/LF. 0.1.0 (Tue, 19 Dec 2000 +0900): ChaSen works fine now. GetOptLong was introduced. 0.1.1 (Mon, 25 Dec 2000 +0900): Bug fixing and some cleanup. Now it quotes <>& when output in HTML. Added (unnecessary) support for escape sequence. 0.1.2 (Thu, 28 Dec 2000 +0900): Bug fix, mostly. 0.1.3 (Tue, 09 Jan 2001 +0900): Tested with Ruby 1.6.2. Fixed "meth(a,b,)" bug with kind help from Masatoshi Seki. Thank you. Now I use only Linux for development, but it should work fine on Windows too, except for Chasen stuff. 0.1.4 (Tue, 16 Jan 2001 +0900): Now the output is like ab, instead of ugly ab. Thanks again to Masatoshi Seki for suggestion. Fixed hidden bug ('puts' was used to output result). Some code clean-up, though still hairly enough. 0.1.5 (Wed, 17 Jan 2001 +0900): Erased useless old code which were already commented out. Added documentation. (Updated README, more comments) Things to do: * Release in public. * Separate the following: command-line handling / class definitions / configuration * Automatic Japanese code conversion with Kconv. (EUC-ja/SJIS) Things to do (low priority): * Optimization. (Slow as 1 minute for 100KB*2 textfiles) * Valid HTML4 and hopefully XHTML support. * Better algo. * Cleaner code. * Extract info on unchanged/changed part by Range objects or something, so that we can make some patching utility later. Maybe I need to prepare --ascii and --japanese switch for specifying how to count. (byte or character) I've been looking forward to m17n Ruby. * ChaSen/Ruby library in C (any volunteer?) * Exception handling (this may never be done) * Implement diffing function internally (obviously impossible) =end comment =================================================== # ============================================================== # Configuration section. Edit. # Nothing yet. Maybe I should do "require 'docdiff.config'"? # ============================================================== # Automatic platform detection and configuration (duh!) if /win/ =~ RUBY_PLATFORM # Win32 $DIFF = "diff.exe" $CHASEN = "chasen.exe" $CHASENRC = ENV["HOME"] + "/.chasenrc.docdiff" # supports Cygwin only $KCODE = "SJIS" else # UNIX and other platform $DIFF = "diff" $CHASEN = "chasen" $CHASENRC = ENV["HOME"] + "/.chasenrc.docdiff" $KCODE = "EUC" end # ============================================================== require 'jcode' # Support for multibyte characters (>=Japanese) require 'tempfile' require 'getoptlong' # Library for handling command line options # ============================================================== class String # Adding some useful function to built-in String class... def quote_html # Quote HTML special characters. ("<" to "<", etc.) dic = {'<'=>'<', '>'=>'>', '&'=>'&'} pat = "(" + dic.keys.sort.join("|") + ")" #=> (key1|key2| ... |keyX) pat = Regexp::compile(pat) #=> /pat/ # I know it's stupid to compile RE each time, but anyways. return self.gsub(pat){dic[$1]} end end # ============================================================== class Docdiff def initialize(doc1 = [], doc2 = []) @compare_method = "by_char" @doc1 = doc1; @doc2 = doc2 @raw_diff = []; @stripped_diff = [] @marks = [] @elements = [] end attr_accessor :compare_method # by_char or by_morpheme attr_reader :doc1 # text document 1 ("before editing") attr_reader :doc2 # text dosument 2 ("after editing") attr_reader :raw_diff # raw output from diff command attr_reader :stripped_diff # diff output (cleaned up a bit) attr_reader :marks # leading marks (+,-," ") in "diff -u" output attr_reader :elements # strings that follow +- in "diff -u" output def split_by_char(array_of_lines) # split text by character # ["l1\n", "l2\n"] -> ["l", "1", "\n", "l", "2", "\n"] arraytmp = array_of_lines.each {|line| "#{line.chomp}\n"} arraytmp.join.split('') # still something is wrong, but working fine... end def split_by_morpheme(array_of_lines) # using tempfile, not writing to stdin # split text by morpheme, using ChaSen # the output we expect is like: # "1 morpheme for 1 line, LF at end of each line, delimiter is empty line" # for detail, try 'cat textfile | chasen -F "%m\n"' returnvaluearray = [] tf = Tempfile.new("__FILE__") tf.print(array_of_lines) tf.close chasenopt = "-F \"%m\n\" -r #{$CHASENRC}" chasen = IO.popen("#{$CHASEN} #{chasenopt} #{tf.path}", "rb+") returnvaluearray << chasen.readlines chasen.close tf.close(true) returnvaluearray.flatten.collect {|m|"#{m.chomp}"} # ["hello,\n", " \n", "world.\n"] -> ["hello,", " ", "world."] end def compare(doc1 = @doc1, doc2 = @doc2) # Write 1 char for 1 line on tempfiles, so that diff can compare. # If Tempfile had a 'put' method, it would be more simple. tf1 = Tempfile.new("__FILE__") tf2 = Tempfile.new("__FILE__") case @compare_method when "by_char" tf1.print(split_by_char(doc1).collect {|c|"#{c.chomp}\n"}) tf2.print(split_by_char(doc2).collect {|c|"#{c.chomp}\n"}) when "by_morpheme" tf1.print(split_by_morpheme(doc1).collect {|m|"#{m}\n"}) tf2.print(split_by_morpheme(doc2).collect {|m|"#{m}\n"}) else tf1.print(split_by_char(doc1).collect {|c|"#{c.chomp}\n"}) tf2.print(split_by_char(doc2).collect {|c|"#{c.chomp}\n"}) end tf1.close tf2.close # Now, let's compare the two, using GNU diff # -d: check carefully # -u: unified format; add specified num of lines before and after # --horizon-lines: don't cut specified num of lines at head & tail diffopt = "-d -u -100000 --horizon-lines=100000" diff = IO.popen("#{$DIFF} #{diffopt} #{tf1.path} #{tf2.path}", 'rb+') @raw_diff = diff.readlines diff.close # Erase temporary files tf1.close(true) tf2.close(true) # Remove first 3 lines of garbage from diff -u output @stripped_diff = @raw_diff[3..@raw_diff.size-1] # separate marker and compared text itself. # BTW, pure blank line means CR, so leave it untouched. # otherwise, chomp the CRLF at the end. @raw_diff[3 .. (@raw_diff.size - 1)].each do |line| # skip 3 lines mark = line[0..0] # status marker, that are: +, -, " " element = line[1..(line.size - 1)] if element != "\n" # if it's not a CR, chomp it. element = element.chomp end @marks << mark; @elements << element end # end each end # end compare def html # Returns HTMLized result # HTML Tag bindings. Define as you like. remove_start = "~~" remove_end = "~~" insert_start = "" insert_end = "" #insert_start = "" # Not that good. Looks too much like ~~. #insert_end = "~~" #insert_start = "" # Yet another emphasizing option. #insert_end = "" carriage_return = "
" result = [] # check how marks change and mark-up the text with tags. for i in (0 .. (@elements.size - 1)) # Same as (0..-1) # Quote HTML special characters before adding tags. element = @elements[i].quote_html case "#{@marks[i-1]}#{@marks[i]}" when " +"; result << (insert_start + element) when " -"; result << (remove_start + element) when "+ "; result << (insert_end + element) when "+-"; result << (insert_end + remove_start + element) when "- "; result << (remove_end + element) when "-+"; result << (remove_end + insert_start + element) else # We can ignore " ", "++", "--" result << element end # end case if (@elements[i+1] == "\n") # insert "CR" when blank line is coming result << carriage_return elsif (@elements[i+1] == nil) # hey, it seems like EOF. result << carriage_return << "\n" # add real CR (or LF) end # end if end # end for return result end # def html def esc # Returns result which is styled with escape-sequences. # Maybe nobody wants this feature... insert_start = ulon = "\033[4m" # Underline on insert_end = uloff = "\033[24m" # Underline off remove_start = xoron = "\033[7m" # XOR on remove_end = xoroff = "\033[27m" # XOR off #insert_start = boldon = "\033[1m" # Bold on (does not work in my env.) #insert_end = boldoff = "\033[21m" # Bold off (does not work in my env) carriage_return = "" # No need for consideration in esc-sequence. result = [] # check how marks change and mark-up the text with tags. for i in (0 .. (@elements.size - 1)) # Duplicate before adding tags. Unnecessary, maybe. element = @elements[i].dup case "#{@marks[i-1]}#{@marks[i]}" when " +"; result << (insert_start + element) when " -"; result << (remove_start + element) when "+ "; result << (insert_end + element) when "+-"; result << (insert_end + remove_start + element) when "- "; result << (remove_end + element) when "-+"; result << (remove_end + insert_start + element) else # We can ignore " ", "++", "--" result << element end # end case if (@elements[i+1] == "\n") # insert "CR" when blank line is coming result << carriage_return elsif (@elements[i+1] == nil) # hey, it seems like EOF. result << carriage_return << "\n" # add real CR (or LF) end # end if end # end for return result end # def esc end # end class docdiff # ============================================================== # Really ugly test code, but who cares. if __FILE__ == $0 # If this script is run as stand-alone script.. $output_format = "html" # default output format # Usage example def usage $stderr.printf "Usage:\n" $stderr.printf "%s [--html|--esc][--bychar|--bymorpheme] file1 file2\n", $0 $stderr.printf "For example:\n%s oldtext newtext > diff.html\n", $0 end if ARGV == [] usage; exit end $getoptlong = GetoptLong.new(['--bychar', '-c', GetoptLong::NO_ARGUMENT], ['--bymorpheme','-m', GetoptLong::NO_ARGUMENT], ['--html', '-h', GetoptLong::NO_ARGUMENT], ['--esc', '-e', GetoptLong::NO_ARGUMENT]) docdiff = Docdiff.new # Docdiff.new(doc1, doc2) is also acceptable, # though it's ugly I guess. # Take care of command line option, using GetOptLong library. begin $getoptlong.each do |optname, optarg| case optname when '--bychar' docdiff.compare_method = "by_char" when '--bymorpheme' docdiff.compare_method = "by_morpheme" when '--html' $output_format = "html" when '--esc' $output_format = "esc" end end rescue usage exit(1) end # Read 2 text files, and then compare. doc1 = File.open(ARGV.shift, "rb").readlines doc2 = File.open(ARGV.shift, "rb").readlines docdiff.compare(doc1, doc2) # not necessary if given at new() # puts docdiff.raw_diff # Uncomment this for debugging if $output_format == "html" print docdiff.html elsif $output_format == "esc" print docdiff.esc else $stderr.print "Error: Output format not specified.\n" end end # end if __FILE__ == $0 # ============================================================== # End of Script