1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
|
Description: force treat scanned string as UTF-8 when compiled regex is UTF8
re::engine::RE2 is documented (in BUGS section of README)
to not handle UTF-8 correctly.
.
Without this patch,
scanning Latin1 string with UTF-8 regex reports wrong positions
or potentially crashes,
and misses e.g. "£" (which Perl re engine correctly matches).
.
With this patch,
scanning UTF-8 string with UTF-8 regex should behave correctly,
and still misses e.g. "£".
.
Scanning should be safer and more correct for UTF-8 strings,
with only known side-effect of being slower for non-UTF-8 strings
due to always upgrading string to UTF-8.
For faster scanning of known ASCII string, use an ASCII regex.
Origin: https://github.com/dgl/re-engine-RE2/pull/8
Author: Todd Richmond <trichmond@proofpoint.com>
Bug: https://rt.cpan.org/Public/Bug/Display.html?id=116747
Bug: https://rt.cpan.org/Public/Bug/Display.html?id=131618
Last-Update: 2023-06-21
---
This patch header follows DEP-3: http://dep.debian.net/deps/dep3/
--- a/re2_xs.cc
+++ b/re2_xs.cc
@@ -101,10 +101,12 @@
// XXX: Need to compile two versions?
/* The pattern is not UTF-8. Tell RE2 to treat it as Latin1. */
#ifdef RXf_UTF8
- if (!(flags & RXf_UTF8))
+ if (flags & RXf_UTF8)
#else
- if (!SvUTF8(pattern))
+ if (SvUTF8(pattern))
#endif
+ extflags |= RXf_MATCH_UTF8;
+ else
options.set_encoding(RE2::Options::EncodingLatin1);
options.set_log_errors(false);
@@ -311,7 +313,7 @@
RE2::Options options;
options.Copy(previous->options());
- return new RE2 (re2::StringPiece(RX_WRAPPED(rx), RX_WRAPLEN(rx)), options);
+ return new RE2 (previous->pattern(), options);
}
SV *
|