1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277
|
require 'strscan'
# EmailReplyParser is a small library to parse plain text email content. The
# goal is to identify which fragments are quoted, part of a signature, or
# original body content. We want to support both top and bottom posters, so
# no simple "REPLY ABOVE HERE" content is used.
#
# Beyond RFC 5322 (which is handled by the [Ruby mail gem][mail]), there aren't
# any real standards for how emails are created. This attempts to parse out
# common conventions for things like replies:
#
# this is some text
#
# On <date>, <author> wrote:
# > blah blah
# > blah blah
#
# ... and signatures:
#
# this is some text
#
# --
# Bob
# http://homepage.com/~bob
#
# Each of these are parsed into Fragment objects.
#
# EmailReplyParser also attempts to figure out which of these blocks should
# be hidden from users.
#
# [mail]: https://github.com/mikel/mail
class EmailReplyParser
VERSION = "0.5.8"
# Public: Splits an email body into a list of Fragments.
#
# text - A String email body.
#
# Returns an Email instance.
def self.read(text)
Email.new.read(text)
end
# Public: Get the text of the visible portions of the given email body.
#
# text - A String email body.
#
# Returns a String.
def self.parse_reply(text)
self.read(text).visible_text
end
### Emails
# An Email instance represents a parsed body String.
class Email
# Emails have an Array of Fragments.
attr_reader :fragments
def initialize
@fragments = []
end
# Public: Gets the combined text of the visible fragments of the email body.
#
# Returns a String.
def visible_text
fragments.select{|f| !f.hidden?}.map{|f| f.to_s}.join("\n").rstrip
end
# Splits the given text into a list of Fragments. This is roughly done by
# reversing the text and parsing from the bottom to the top. This way we
# can check for 'On <date>, <author> wrote:' lines above quoted blocks.
#
# text - A String email body.
#
# Returns this same Email instance.
def read(text)
# in 1.9 we want to operate on the raw bytes
text = text.dup.force_encoding('binary') if text.respond_to?(:force_encoding)
# Normalize line endings.
text.gsub!("\r\n", "\n")
# Check for multi-line reply headers. Some clients break up
# the "On DATE, NAME <EMAIL> wrote:" line into multiple lines.
if text =~ /^(?!On.*On\s.+?wrote:)(On\s(.+?)wrote:)$/nm
# Remove all new lines from the reply header.
text.gsub! $1, $1.gsub("\n", " ")
end
# Some users may reply directly above a line of underscores.
# In order to ensure that these fragments are split correctly,
# make sure that all lines of underscores are preceded by
# at least two newline characters.
text.gsub!(/([^\n])(?=\n_{7}_+)$/m, "\\1\n")
# The text is reversed initially due to the way we check for hidden
# fragments.
text = text.reverse
# This determines if any 'visible' Fragment has been found. Once any
# visible Fragment is found, stop looking for hidden ones.
@found_visible = false
# This instance variable points to the current Fragment. If the matched
# line fits, it should be added to this Fragment. Otherwise, finish it
# and start a new Fragment.
@fragment = nil
# Use the StringScanner to pull out each line of the email content.
@scanner = StringScanner.new(text)
while line = @scanner.scan_until(/\n/n)
scan_line(line)
end
# Be sure to parse the last line of the email.
if (last_line = @scanner.rest.to_s).size > 0
scan_line(last_line)
end
# Finish up the final fragment. Finishing a fragment will detect any
# attributes (hidden, signature, reply), and join each line into a
# string.
finish_fragment
@scanner = @fragment = nil
# Now that parsing is done, reverse the order.
@fragments.reverse!
self
end
private
EMPTY = "".freeze
SIGNATURE = '(?m)(--\s*$|__\s*$|\w-$)|(^(\w+\s*){1,3} ym morf tneS$)'
begin
require 're2'
SIG_REGEX = RE2::Regexp.new(SIGNATURE)
rescue LoadError
SIG_REGEX = Regexp.new(SIGNATURE)
end
### Line-by-Line Parsing
# Scans the given line of text and figures out which fragment it belongs
# to.
#
# line - A String line of text from the email.
#
# Returns nothing.
def scan_line(line)
line.chomp!("\n")
line.lstrip! unless SIG_REGEX.match(line)
# We're looking for leading `>`'s to see if this line is part of a
# quoted Fragment.
is_quoted = !!(line =~ /(>+)$/n)
# Mark the current Fragment as a signature if the current line is empty
# and the Fragment starts with a common signature indicator.
if @fragment && line == EMPTY
if SIG_REGEX.match @fragment.lines.last
@fragment.signature = true
finish_fragment
end
end
# If the line matches the current fragment, add it. Note that a common
# reply header also counts as part of the quoted Fragment, even though
# it doesn't start with `>`.
if @fragment &&
((@fragment.quoted? == is_quoted) ||
(@fragment.quoted? && (quote_header?(line) || line == EMPTY)))
@fragment.lines << line
# Otherwise, finish the fragment and start a new one.
else
finish_fragment
@fragment = Fragment.new(is_quoted, line)
end
end
# Detects if a given line is a header above a quoted area. It is only
# checked for lines preceding quoted regions.
#
# line - A String line of text from the email.
#
# Returns true if the line is a valid header, or false.
def quote_header?(line)
line =~ /^:etorw.*nO$/n
end
# Builds the fragment string and reverses it, after all lines have been
# added. It also checks to see if this Fragment is hidden. The hidden
# Fragment check reads from the bottom to the top.
#
# Any quoted Fragments or signature Fragments are marked hidden if they
# are below any visible Fragments. Visible Fragments are expected to
# contain original content by the author. If they are below a quoted
# Fragment, then the Fragment should be visible to give context to the
# reply.
#
# some original text (visible)
#
# > do you have any two's? (quoted, visible)
#
# Go fish! (visible)
#
# > --
# > Player 1 (quoted, hidden)
#
# --
# Player 2 (signature, hidden)
#
def finish_fragment
if @fragment
@fragment.finish
if !@found_visible
if @fragment.quoted? || @fragment.signature? ||
@fragment.to_s.strip == EMPTY
@fragment.hidden = true
else
@found_visible = true
end
end
@fragments << @fragment
end
@fragment = nil
end
end
### Fragments
# Represents a group of paragraphs in the email sharing common attributes.
# Paragraphs should get their own fragment if they are a quoted area or a
# signature.
class Fragment < Struct.new(:quoted, :signature, :hidden)
# This is an Array of String lines of content. Since the content is
# reversed, this array is backwards, and contains reversed strings.
attr_reader :lines,
# This is reserved for the joined String that is build when this Fragment
# is finished.
:content
def initialize(quoted, first_line)
self.signature = self.hidden = false
self.quoted = quoted
@lines = [first_line]
@content = nil
@lines.compact!
end
alias quoted? quoted
alias signature? signature
alias hidden? hidden
# Builds the string content by joining the lines and reversing them.
#
# Returns nothing.
def finish
@content = @lines.join("\n")
@lines = nil
@content.reverse!
end
def to_s
@content
end
def inspect
to_s.inspect
end
end
end
|