#!/usr/bin/ruby
# DocDiff - Document DIFF for Editors - (docdiff.rb)

=begin comment =================================================
Version: 0.1.5
Created: Sat, 09 Dec 2000 21:52:49 +0900
Updated: Wed, 17 Jan 2001 23:51:32 +0900
Copyright: Copyright 2000-2001 Hisashi Morita
License: Same as Ruby's
Description:
    Compares two text files by character or by word/morpheme, 
    and output the result in pseudo HTML format.
Usage:
    docdiff.rb [--bychar|--bymorpheme] [--html|--esc] file1 file2
    ex: docdiff.rb --html oldtext newtext > difference.html
Requirements:
    Ruby (Tested with Ruby 1.6.2/i386-linux)
    GNU diff (Tested with GNU diffutils 2.7/i386-linux)
Recommended:
    Web Browser (Useful when viewing output.  
        Tested with Netscape Navigator 4.7x/Win32)
Optional:
    ChaSen (Required for morpheme analysis.  
        Tested with ChaSen 2.0/i386-linux.
        I have not tested using ChaSen on Windows.)
Why not plain diff?:
    DocDiff is a character/word-oriented tool for comparing text 
    documents, while diff is a line-oriented tool for program 
    source codes.
    Unlike program source code, text documents have no carriage
    return in the middle of a paragraph, so diff does not work
    well with them. (A whole paragraph is inserted/removed.)  
    Microsoft Word (>=97 afaik) has a function to compare two
    text files and detect changes within a line, but it does not
    fully meet my need. (It does not try to pick up small changes.)
    Most of all, I prefer going with Free Software.
Known Problem:
  * Some of the "identical" portion of the two document may be
    missing from output.  This happens since diff is designed to
    pick up "different" parts only.  If you encounter this kind
    of problem, try to increase the number of lines at -u and 
    --horizon-lines option, though I guess 100000 is large enough.
  * You need to prepare ~/.chasenrc.docdiff file in order to
    use ChaSen.  Sorry for inconvenience, but ChaSen command line
    options are bit too simple.
  * It cannot show changes in spaces and carriage returns precisely.
    If you need to inspect trivial changes, @raw_diff might help.
    If you happen to deal with program code somehow, remove <br>
    tags at end of each line, and wrap the whole output with
    <pre></pre> tags, so that spaces are not ignored.
  * Accepts only ASCII and EUC-ja code as input for now.  Sorry.
Preparation for using ChaSen:
    Copy /usr/lib/chasen/chasenrc to ~/.chasenrc.docdiff
    Check the location of your ChaSen dic files, 
    and edit ~/.chasenrc.docdiff to validate these options.  
    ----
    (文法ファイル  /usr/lib/chasen/dic) ; this depends on OS
    (空白品詞 (記号 空白)) ; don't let ChaSen ignore ASCII spaces and tabs
    (EOS文字列 "\n") ; remove "EOS" at eos.  LF only is preferred
    ----
    (Things would be easier if ChaSen accepts these in command 
    line option.)
    Windows users might have to set HOME environment var.
    In Windows 98, it may be like this (not tested at all!):
    set HOME=c:\home
Links:
    Ruby:   http://www.ruby-lang.org/
    ChaSen: http://chasen.aist-nara.ac.jp/
History:
    0.0.1 (Sat, 09 Dec 2000 +0900): 1 hour hack.
    0.0.2 (Sun, 10 Dec 2000 +0900): Rewrote to use class.
    0.1.0a1 (Sat, 16 Dec 2000 +0900): Added ChaSen support.
      Converted scripts from SJIS/CRLF to EUC-Ja/LF.
    0.1.0 (Tue, 19 Dec 2000 +0900): ChaSen works fine now.
      GetOptLong was introduced.
    0.1.1 (Mon, 25 Dec 2000 +0900): Bug fixing and some cleanup.
      Now it quotes <>& when output in HTML.
      Added (unnecessary) support for escape sequence.
    0.1.2 (Thu, 28 Dec 2000 +0900): Bug fix, mostly.
    0.1.3 (Tue, 09 Jan 2001 +0900): Tested with Ruby 1.6.2.
      Fixed "meth(a,b,)" bug with kind help from Masatoshi Seki.
      Thank you.
      Now I use only Linux for development, but it should work
      fine on Windows too, except for Chasen stuff.
    0.1.4 (Tue, 16 Jan 2001 +0900):
      Now the output is like <tag>ab</tag>, instead of ugly
      <tag>a</tag><tag>b</tag>.  Thanks again to Masatoshi Seki
      for suggestion.
      Fixed hidden bug ('puts' was used to output result).
      Some code clean-up, though still hairly enough.
    0.1.5 (Wed, 17 Jan 2001 +0900):
      Erased useless old code which were already commented out.
      Added documentation. (Updated README, more comments)
Things to do:
  * Release in public.
  * Separate the following:
    command-line handling / class definitions / configuration
  * Automatic Japanese code conversion with Kconv. (EUC-ja/SJIS)
Things to do (low priority):
  * Optimization.  (Slow as 1 minute for 100KB*2 textfiles)
  * Valid HTML4 and hopefully XHTML support.
  * Better algo.  
  * Cleaner code.
  * Extract info on unchanged/changed part by Range objects or 
    something, so that we can make some patching utility later.  
    Maybe I need to prepare --ascii and --japanese switch for 
    specifying how to count. (byte or character)
    I've been looking forward to m17n Ruby.
  * ChaSen/Ruby library in C (any volunteer?)
  * Exception handling (this may never be done)
  * Implement diffing function internally (obviously impossible)
=end comment ===================================================

# ==============================================================
# Configuration section.  Edit.
# Nothing yet.  Maybe I should do "require 'docdiff.config'"?

# ==============================================================
# Automatic platform detection and configuration (duh!)
if /win/ =~ RUBY_PLATFORM
  # Win32
  $DIFF = "diff.exe"
  $CHASEN = "chasen.exe"
  $CHASENRC = ENV["HOME"] + "/.chasenrc.docdiff"  # supports Cygwin only
  $KCODE = "SJIS"
else
  # UNIX and other platform
  $DIFF = "diff"
  $CHASEN = "chasen"
  $CHASENRC = ENV["HOME"] + "/.chasenrc.docdiff"
  $KCODE = "EUC"
end

# ==============================================================
require 'jcode'  # Support for multibyte characters (>=Japanese)
require 'tempfile'
require 'getoptlong'  # Library for handling command line options

# ==============================================================
class String  # Adding some useful function to built-in String class...

  def quote_html  # Quote HTML special characters. ("<" to "&lt;", etc.)
    dic = {'<'=>'&lt;', '>'=>'&gt;', '&'=>'&amp;'}
    pat = "(" + dic.keys.sort.join("|") + ")"  #=> (key1|key2| ... |keyX)
    pat = Regexp::compile(pat)  #=> /pat/
    # I know it's stupid to compile RE each time, but anyways.
    return self.gsub(pat){dic[$1]}
  end

end

# ==============================================================
class Docdiff

  def initialize(doc1 = [], doc2 = [])
    @compare_method = "by_char"
    @doc1 = doc1; @doc2 = doc2
    @raw_diff = []; @stripped_diff = []
    @marks = []
    @elements = []
  end

  attr_accessor :compare_method  # by_char or by_morpheme
  attr_reader :doc1           # text document 1 ("before editing")
  attr_reader :doc2           # text dosument 2 ("after editing")
  attr_reader :raw_diff       # raw output from diff command
  attr_reader :stripped_diff  # diff output (cleaned up a bit)
  attr_reader :marks          # leading marks (+,-," ") in "diff -u" output
  attr_reader :elements       # strings that follow +- in "diff -u" output

  def split_by_char(array_of_lines)
    # split text by character
    # ["l1\n", "l2\n"] -> ["l", "1", "\n", "l", "2", "\n"]
    arraytmp = array_of_lines.each {|line| "#{line.chomp}\n"}
    arraytmp.join.split('')  # still something is wrong, but working fine...
  end

  def split_by_morpheme(array_of_lines)  # using tempfile, not writing to stdin
    # split text by morpheme, using ChaSen
    # the output we expect is like:
    # "1 morpheme for 1 line, LF at end of each line, delimiter is empty line"
    # for detail, try 'cat textfile | chasen -F "%m\n"'
    returnvaluearray = []
    tf = Tempfile.new("__FILE__")
    tf.print(array_of_lines)
    tf.close
    chasenopt = "-F \"%m\n\" -r #{$CHASENRC}"
    chasen = IO.popen("#{$CHASEN} #{chasenopt} #{tf.path}", "rb+")
    returnvaluearray << chasen.readlines
    chasen.close
    tf.close(true)
    returnvaluearray.flatten.collect {|m|"#{m.chomp}"}
    # ["hello,\n", " \n", "world.\n"] -> ["hello,", " ", "world."]
  end

  def compare(doc1 = @doc1, doc2 = @doc2)

    # Write 1 char for 1 line on tempfiles, so that diff can compare.
    # If Tempfile had a 'put' method, it would be more simple.
    tf1 = Tempfile.new("__FILE__")
    tf2 = Tempfile.new("__FILE__")
    case @compare_method
    when "by_char"
      tf1.print(split_by_char(doc1).collect {|c|"#{c.chomp}\n"})
      tf2.print(split_by_char(doc2).collect {|c|"#{c.chomp}\n"})
    when "by_morpheme"
      tf1.print(split_by_morpheme(doc1).collect {|m|"#{m}\n"})
      tf2.print(split_by_morpheme(doc2).collect {|m|"#{m}\n"})
      
    else
      tf1.print(split_by_char(doc1).collect {|c|"#{c.chomp}\n"})
      tf2.print(split_by_char(doc2).collect {|c|"#{c.chomp}\n"})
    end
    tf1.close
    tf2.close

    # Now, let's compare the two, using GNU diff
    # -d: check carefully
    # -u: unified format; add specified num of lines before and after
    # --horizon-lines: don't cut specified num of lines at head & tail
    diffopt = "-d -u -100000 --horizon-lines=100000"
    diff = IO.popen("#{$DIFF} #{diffopt} #{tf1.path} #{tf2.path}", 'rb+')
    @raw_diff = diff.readlines
    diff.close

    # Erase temporary files
    tf1.close(true)
    tf2.close(true)

    # Remove first 3 lines of garbage from diff -u output
    @stripped_diff = @raw_diff[3..@raw_diff.size-1]

    # separate marker and compared text itself.
    # BTW, pure blank line means CR, so leave it untouched.
    # otherwise, chomp the CRLF at the end.

    @raw_diff[3 .. (@raw_diff.size - 1)].each do |line|  # skip 3 lines
      mark = line[0..0]  # status marker, that are: +, -, " "
      element = line[1..(line.size - 1)]
      if element != "\n"  # if it's not a CR, chomp it.
        element = element.chomp
      end
      @marks << mark; @elements << element
    end  # end each
  end  # end compare

  def html  # Returns HTMLized result

    # HTML Tag bindings.  Define as you like.
    remove_start = "<strike>"
    remove_end = "</strike>"
    insert_start = "<strong>"
    insert_end = "</strong>"
    #insert_start = "<u>"  # Not that good. Looks too much like <strike>.
    #insert_end = "</u>"
    #insert_start = "<font color='red'>"  # Yet another emphasizing option.
    #insert_end = "</font>"
    carriage_return = "<br>"

    result = []
    
    # check how marks change and mark-up the text with tags.
    for i in (0 .. (@elements.size - 1))  # Same as (0..-1)

      # Quote HTML special characters before adding tags.
      element = @elements[i].quote_html

      case "#{@marks[i-1]}#{@marks[i]}"
      when " +";  result << (insert_start + element)
      when " -";  result << (remove_start + element)
      when "+ ";  result << (insert_end + element)
      when "+-";  result << (insert_end + remove_start + element)
      when "- ";  result << (remove_end + element)
      when "-+";  result << (remove_end + insert_start + element)
      else  # We can ignore "  ", "++", "--"
        result << element
      end  # end case
      if (@elements[i+1] == "\n")  # insert "CR" when blank line is coming
        result << carriage_return
      elsif (@elements[i+1] == nil)  # hey, it seems like EOF.
        result << carriage_return << "\n"  # add real CR (or LF)
      end  # end if
    end  # end for

    return result

  end  # def html

  def esc  # Returns result which is styled with escape-sequences.
           # Maybe nobody wants this feature...

    insert_start  = ulon = "\033[4m"     # Underline on
    insert_end   = uloff = "\033[24m"    # Underline off
    remove_start = xoron = "\033[7m"     # XOR on
    remove_end  = xoroff = "\033[27m"    # XOR off
    #insert_start = boldon = "\033[1m"    # Bold on  (does not work in my env.)
    #insert_end  = boldoff = "\033[21m"   # Bold off (does not work in my env)
    carriage_return = ""  # No need for consideration in esc-sequence.
    
    result = []
    
    # check how marks change and mark-up the text with tags.
    for i in (0 .. (@elements.size - 1))

      # Duplicate before adding tags.  Unnecessary, maybe.
      element = @elements[i].dup

      case "#{@marks[i-1]}#{@marks[i]}"
      when " +";  result << (insert_start + element)
      when " -";  result << (remove_start + element)
      when "+ ";  result << (insert_end + element)
      when "+-";  result << (insert_end + remove_start + element)
      when "- ";  result << (remove_end + element)
      when "-+";  result << (remove_end + insert_start + element)
      else  # We can ignore "  ", "++", "--"
        result << element
      end  # end case
      if (@elements[i+1] == "\n")  # insert "CR" when blank line is coming
        result << carriage_return
      elsif (@elements[i+1] == nil)  # hey, it seems like EOF.
        result << carriage_return << "\n"  # add real CR (or LF)
      end  # end if
    end  # end for

    return result

  end  # def esc

end  # end class docdiff

# ==============================================================
# Really ugly test code, but who cares.
if __FILE__ == $0  # If this script is run as stand-alone script..

  $output_format = "html"  # default output format

  # Usage example
  def usage
    $stderr.printf "Usage:\n"
    $stderr.printf "%s [--html|--esc][--bychar|--bymorpheme] file1 file2\n", $0
    $stderr.printf "For example:\n%s oldtext newtext > diff.html\n", $0
  end

  if ARGV == []
    usage; exit
  end

  $getoptlong = 
    GetoptLong.new(['--bychar',    '-c', GetoptLong::NO_ARGUMENT],
                   ['--bymorpheme','-m', GetoptLong::NO_ARGUMENT],
                   ['--html',      '-h', GetoptLong::NO_ARGUMENT],
                   ['--esc',       '-e', GetoptLong::NO_ARGUMENT])

  docdiff = Docdiff.new  # Docdiff.new(doc1, doc2) is also acceptable,
                         # though it's ugly I guess.

  # Take care of command line option, using GetOptLong library.
  begin
    $getoptlong.each do |optname, optarg|
      case optname
      when '--bychar'
	docdiff.compare_method = "by_char"
      when '--bymorpheme'
	docdiff.compare_method = "by_morpheme"
      when '--html'
	$output_format = "html"
      when '--esc'
	$output_format = "esc"
      end
    end
  rescue
    usage
    exit(1)
  end  

  # Read 2 text files, and then compare.
  doc1 = File.open(ARGV.shift, "rb").readlines
  doc2 = File.open(ARGV.shift, "rb").readlines
  docdiff.compare(doc1, doc2)  # not necessary if given at new()
#  puts docdiff.raw_diff  # Uncomment this for debugging

  if $output_format == "html"
    print docdiff.html
  elsif $output_format == "esc"
    print docdiff.esc
  else
    $stderr.print "Error: Output format not specified.\n"
  end

end  # end if __FILE__ == $0

# ==============================================================
# End of Script
