1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389
|
<pre>Internet Engineering Task Force (IETF) T. Sirainen
Request for Comments: 6203 March 2011
Category: Standards Track
ISSN: 2070-1721
<span class="h1">IMAP4 Extension for Fuzzy Search</span>
Abstract
This document describes an IMAP protocol extension enabling a server
to perform searches with inexact matching and assigning relevancy
scores for matched messages.
Status of This Memo
This is an Internet Standards Track document.
This document is a product of the Internet Engineering Task Force
(IETF). It represents the consensus of the IETF community. It has
received public review and has been approved for publication by the
Internet Engineering Steering Group (IESG). Further information on
Internet Standards is available in <a href="./rfc5741#section-2">Section 2 of RFC 5741</a>.
Information about the current status of this document, any errata,
and how to provide feedback on it may be obtained at
<a href="http://www.rfc-editor.org/info/rfc6203">http://www.rfc-editor.org/info/rfc6203</a>.
Copyright Notice
Copyright (c) 2011 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to <a href="https://www.rfc-editor.org/bcp/bcp78">BCP 78</a> and the IETF Trust's Legal
Provisions Relating to IETF Documents
(<a href="http://trustee.ietf.org/license-info">http://trustee.ietf.org/license-info</a>) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
<span class="grey">Sirainen Standards Track [Page 1]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-2" ></span>
<span class="grey"><a href="./rfc6203">RFC 6203</a> IMAP4 FUZZY Search March 2011</span>
<span class="h2"><a class="selflink" id="section-1" href="#section-1">1</a>. Introduction</span>
When humans perform searches in IMAP clients, they typically want to
see the most relevant search results first. IMAP servers are able to
do this in the most efficient way when they're free to internally
decide how searches should match messages. This document describes a
new SEARCH=FUZZY extension that provides such functionality.
<span class="h2"><a class="selflink" id="section-2" href="#section-2">2</a>. Conventions Used in This Document</span>
In examples, "C:" indicates lines sent by a client that is connected
to a server. "S:" indicates lines sent by the server to the client.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in <a href="./rfc2119">RFC 2119</a> [<a href="#ref-KEYWORDS" title=""Key words for use in RFCs to Indicate Requirement Levels"">KEYWORDS</a>].
<span class="h2"><a class="selflink" id="section-3" href="#section-3">3</a>. The FUZZY Search Key</span>
The FUZZY search key takes another search key as its argument. The
server is allowed to perform all matching in an implementation-
defined manner for this search key, including ignoring the active
comparator as defined by [<a href="./rfc5255" title=""Internet Message Access Protocol Internationalization"">RFC5255</a>]. Typically, this would be used to
search for strings. For example:
C: A1 SEARCH FUZZY (SUBJECT "IMAP break")
S: * SEARCH 1 5 10
S: A1 OK Search completed.
Besides matching messages with a subject of "IMAP break", the above
search may also match messages with subjects "broken IMAP", "IMAP is
broken", or anything else the server decides that might be a good
match.
This example does a fuzzy SUBJECT search, but a non-fuzzy FROM
search:
C: A2 SEARCH FUZZY SUBJECT work FROM user@example.com
S: * SEARCH 1 4
S: A2 OK Search completed.
How the server handles multiple separate FUZZY search keys is
implementation-defined.
Fuzzy search algorithms might change, or the results of the
algorithms might be different from search to search, so that fuzzy
searches with the same parameters might give different results for
1) the same user at different times, 2) different users (searches
<span class="grey">Sirainen Standards Track [Page 2]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-3" ></span>
<span class="grey"><a href="./rfc6203">RFC 6203</a> IMAP4 FUZZY Search March 2011</span>
executed simultaneously), or 3) different users (searches executed at
different times). For example, a fuzzy search might adapt to a
user's search habits in an attempt to give more relevant results (in
a "learning" manner). Such differences can also occur because of
operational decisions, such as load balancing. Clients asking for
"fuzzy" really are requesting search results in a not-necessarily-
deterministic way and need to give the user appropriate warning about
that.
<span class="h2"><a class="selflink" id="section-4" href="#section-4">4</a>. Relevancy Scores for Search Results</span>
Servers SHOULD assign a search relevancy score for each matched
message when the FUZZY search key is given. Relevancy scores are
given in the range 1-100, where 100 is the highest relevancy. The
relevancy scores SHOULD use the full 1-100 range, so that clients can
show them to users in a meaningful way, e.g., as a percentage value.
As the name already indicates, relevancy scores specify how relevant
to the search the matched message is. It's not necessarily the same
as how precisely the message matched. For example, a message whose
subject fuzzily matches the search string might get a higher
relevancy score than a message whose body had the exact string in the
middle of a sentence. When multiple search keys are matched fuzzily,
how the relevancy score is calculated is server-dependent.
If the server also advertises the ESEARCH capability as defined by
[<a href="#ref-ESEARCH" title=""IMAP4 Extension to SEARCH Command for Controlling What Kind of Information Is Returned"">ESEARCH</a>], the relevancy scores can be retrieved using the new
RELEVANCY return option for SEARCH:
C: B1 SEARCH RETURN (RELEVANCY ALL) FUZZY TEXT "Helo"
S: * ESEARCH (TAG "B1") ALL 1,5,10 RELEVANCY (4 99 42)
S: B1 OK Search completed.
In the example above, the server would treat "hello", "help", and
other similar strings as fuzzily matching the misspelled "Helo".
The RELEVANCY return option MUST NOT be used unless a FUZZY search
key is also given. Note that SEARCH results aren't sorted by
relevancy; SORT is needed for that.
<span class="h2"><a class="selflink" id="section-5" href="#section-5">5</a>. Fuzzy Matching with Non-String Search Keys</span>
Fuzzy matching is not limited to just string matching. All search
keys SHOULD be matched fuzzily, although exactly what that means for
different search keys is left for server implementations to decide --
including deciding that fuzzy matching is meaningless for a
particular key, and falling back to exact matching. Some suggestions
are given below.
<span class="grey">Sirainen Standards Track [Page 3]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-4" ></span>
<span class="grey"><a href="./rfc6203">RFC 6203</a> IMAP4 FUZZY Search March 2011</span>
Dates:
A typical example could be when a user wants to find a message
"from Dave about a week ago". A client could perform this search
using SEARCH FUZZY (FROM "Dave" SINCE 21-Jan-2009 BEFORE
24-Jan-2009). The server could return messages outside the
specified date range, but the further away the message is, the
lower the relevancy score.
Sizes:
These should be handled similarly to dates. If a user wants to
search for "about 1 MB attachments", the client could do this by
sending SEARCH FUZZY (LARGER 900000 SMALLER 1100000). Again, the
further away the message size is from the specified range, the
lower the relevancy score.
Flags:
If other search criteria match, the server could return messages
that don't have the specified flags set, but with lower relevancy
scores. SEARCH SUBJECT "xyz" FUZZY ANSWERED, for example, might
be useful if the user thinks the message he is looking for has the
ANSWERED flag set, but he isn't sure.
Unique Identifiers (UIDs), sequences, modification sequences: These
are examples of keys for which exact matching probably makes sense.
Alternatively, a server might choose, for instance, to expand a UID
range by 5% on each side.
<span class="h2"><a class="selflink" id="section-6" href="#section-6">6</a>. Extensions to SORT and SEARCH</span>
If the server also advertises the SORT capability as defined by
[<a href="#ref-SORT" title=""Internet Message Access Protocol - SORT and THREAD Extensions"">SORT</a>], the results can be sorted by the new RELEVANCY sort criteria:
C: C1 SORT (RELEVANCY) UTF-8 FUZZY SUBJECT "Helo"
S: * SORT 5 10 1
S: C1 OK Sort completed.
The message with the highest score is returned first. As with the
RELEVANCY return option, RELEVANCY sort criteria MUST NOT be used
unless a FUZZY search key is also given.
If the server also advertises the ESORT capability as defined by
[<a href="#ref-CONTEXT" title=""Contexts for IMAP4"">CONTEXT</a>], the relevancy scores can be retrieved using the new
RELEVANCY return option for SORT:
C: C2 SORT RETURN (RELEVANCY ALL) (RELEVANCY) UTF-8 FUZZY TEXT
"Helo"
S: * ESEARCH (TAG "C2") ALL 5,10,1 RELEVANCY (99 42 4)
S: C2 OK Sort completed.
<span class="grey">Sirainen Standards Track [Page 4]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-5" ></span>
<span class="grey"><a href="./rfc6203">RFC 6203</a> IMAP4 FUZZY Search March 2011</span>
Furthermore, if the server advertises the CONTEXT=SORT (or
CONTEXT=SEARCH) capability, then the client can limit the number of
returned messages to a SORT (or a SEARCH) by using the PARTIAL return
option. For example, this returns the 10 most relevant messages:
C: C3 SORT RETURN (PARTIAL 1:10) (RELEVANCY) UTF-8 FUZZY TEXT
"World"
S: * ESEARCH (TAG "C3") PARTIAL (1:10 42,9,34,13,15,4,2,7,23,82)
S: C3 OK Sort completed.
<span class="h2"><a class="selflink" id="section-7" href="#section-7">7</a>. Formal Syntax</span>
The following syntax specification uses the augmented Backus-Naur
Form (BNF) as described in [<a href="#ref-ABNF" title=""Augmented BNF for Syntax Specifications: ABNF"">ABNF</a>]. It includes definitions from
[<a href="./rfc3501" title=""INTERNET MESSAGE ACCESS PROTOCOL - VERSION 4rev1"">RFC3501</a>], [<a href="#ref-IMAP-ABNF" title=""Collected Extensions to IMAP4 ABNF"">IMAP-ABNF</a>], and [<a href="#ref-SORT" title=""Internet Message Access Protocol - SORT and THREAD Extensions"">SORT</a>].
capability =/ "SEARCH=FUZZY"
score = 1*3DIGIT
;; (1 <= n <= 100)
score-list = "(" [score *(SP score)] ")"
search-key =/ "FUZZY" SP search-key
search-return-data =/ "RELEVANCY" SP score-list
;; Conforms to <search-return-data>, from [<a href="#ref-IMAP-ABNF" title=""Collected Extensions to IMAP4 ABNF"">IMAP-ABNF</a>]
search-return-opt =/ "RELEVANCY"
;; Conforms to <search-return-opt>, from [<a href="#ref-IMAP-ABNF" title=""Collected Extensions to IMAP4 ABNF"">IMAP-ABNF</a>]
sort-key =/ "RELEVANCY"
<span class="h2"><a class="selflink" id="section-8" href="#section-8">8</a>. Security Considerations</span>
Implementation of this extension might enable denial-of-service
attacks against server resources. Servers MAY limit the resources
that a single search (or a single user) may use. Additionally,
implementors should be aware of the following: Fuzzy search engines
are often complex with non-obvious disk space, memory, and/or CPU
usage patterns. Server implementors should at least test the fuzzy-
search behavior with large messages that contain very long words
and/or unique random strings. Also, very long search keys might
cause excessive memory or CPU usage.
Invalid input may also be problematic. For example, if the search
engine takes a UTF-8 stream as input, it might fail more or less
badly when illegal UTF-8 sequences are fed to it from a message whose
<span class="grey">Sirainen Standards Track [Page 5]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-6" ></span>
<span class="grey"><a href="./rfc6203">RFC 6203</a> IMAP4 FUZZY Search March 2011</span>
character set was claimed to be UTF-8. This could be avoided by
validating all the input and, for example, replacing illegal UTF-8
sequences with the Unicode replacement character (U+FFFD).
Search relevancy rankings might be susceptible to "poisoning" by
smart attackers using certain keywords or hidden markup (e.g., HTML)
in their messages to boost the rankings. This can't be fully
prevented by servers, so clients should prepare for it by at least
allowing users to see all the search results, rather than hiding
results below a certain score.
<span class="h2"><a class="selflink" id="section-9" href="#section-9">9</a>. IANA Considerations</span>
IMAP4 capabilities are registered by publishing a standards track or
IESG-approved experimental RFC. The "Internet Message Access
Protocol (IMAP) 4 Capabilities Registry" is available from
<a href="http://www.iana.org/">http://www.iana.org/</a>.
This document defines the SEARCH=FUZZY IMAP capability. IANA has
added it to the registry.
<span class="h2"><a class="selflink" id="section-10" href="#section-10">10</a>. Acknowledgements</span>
Alexey Melnikov, Zoltan Ordogh, Barry Leiba, Cyrus Daboo, and Dave
Cridland have helped with this document.
<span class="h2"><a class="selflink" id="section-11" href="#section-11">11</a>. Normative References</span>
[<a id="ref-ABNF">ABNF</a>] Crocker, D., Ed. and P. Overell, "Augmented BNF for
Syntax Specifications: ABNF", STD 68, <a href="./rfc5234">RFC 5234</a>,
January 2008.
[<a id="ref-CONTEXT">CONTEXT</a>] Cridland, D. and C. King, "Contexts for IMAP4",
<a href="./rfc5267">RFC 5267</a>, July 2008.
[<a id="ref-ESEARCH">ESEARCH</a>] Melnikov, A. and D. Cridland, "IMAP4 Extension to SEARCH
Command for Controlling What Kind of Information Is
Returned", <a href="./rfc4731">RFC 4731</a>, November 2006.
[<a id="ref-IMAP-ABNF">IMAP-ABNF</a>] Melnikov, A. and C. Daboo, "Collected Extensions to
IMAP4 ABNF", <a href="./rfc4466">RFC 4466</a>, April 2006.
[<a id="ref-KEYWORDS">KEYWORDS</a>] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", <a href="https://www.rfc-editor.org/bcp/bcp14">BCP 14</a>, <a href="./rfc2119">RFC 2119</a>, March 1997.
[<a id="ref-RFC3501">RFC3501</a>] Crispin, M., "INTERNET MESSAGE ACCESS PROTOCOL - VERSION
4rev1", <a href="./rfc3501">RFC 3501</a>, March 2003.
<span class="grey">Sirainen Standards Track [Page 6]</span></pre>
<hr class='noprint'/><!--NewPage--><pre class='newpage'><span id="page-7" ></span>
<span class="grey"><a href="./rfc6203">RFC 6203</a> IMAP4 FUZZY Search March 2011</span>
[<a id="ref-RFC5255">RFC5255</a>] Newman, C., Gulbrandsen, A., and A. Melnikov, "Internet
Message Access Protocol Internationalization", <a href="./rfc5255">RFC 5255</a>,
June 2008.
[<a id="ref-SORT">SORT</a>] Crispin, M. and K. Murchison, "Internet Message Access
Protocol - SORT and THREAD Extensions", <a href="./rfc5256">RFC 5256</a>,
June 2008.
Author's Address
Timo Sirainen
EMail: tss@iki.fi
Sirainen Standards Track [Page 7]
</pre>
|