1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
|
From 27682d02f7ac0669043faeb419dd5a104eecfb73 Mon Sep 17 00:00:00 2001
From: Dan Kogai <dankogai+github@gmail.com>
Date: Tue, 15 Sep 2015 22:49:12 +0900
Subject: [PATCH] Address https://rt.cpan.org/Public/Bug/Display.html?id=107043
! Unicode/Unicode.xs Unicode/Unicode.pm
Address RT#107043: If no BOM is found, the routine dies.
When you decode from UTF-(16|32) without -BE or LE without BOM,
Encode now assumes BE accordingly to RFC2781 and the Unicode
Standard version 8.0
Bug: https://rt.cpan.org/Public/Bug/Display.html?id=107043
Bug-Debian: https://bugs.debian.org/799086
--- a/Unicode/Unicode.pm
+++ b/Unicode/Unicode.pm
@@ -176,7 +176,13 @@
When BE or LE is omitted during decode(), it checks if BOM is at the
beginning of the string; if one is found, the endianness is set to
-what the BOM says. If no BOM is found, the routine dies.
+what the BOM says.
+
+=item Default Byte Order
+
+When no BOM is found, Encode 2.76 and below croaked. Since Encode
+2.77 (and 2.63-1+deb8u1), it falls back to BE accordingly to RFC2781 and the Unicode
+Standard version 8.0
=item *
--- a/Unicode/Unicode.xs
+++ b/Unicode/Unicode.xs
@@ -164,9 +164,19 @@
endian = 'V';
}
else {
- croak("%"SVf":Unrecognised BOM %"UVxf,
- *hv_fetch((HV *)SvRV(obj),"Name",4,0),
- bom);
+ /* No BOM found, use big-endian fallback as specified in
+ * RFC2781 and the Unicode Standard version 8.0:
+ *
+ * The UTF-16 encoding scheme may or may not begin with
+ * a BOM. However, when there is no BOM, and in the
+ * absence of a higher-level protocol, the byte order
+ * of the UTF-16 encoding scheme is big-endian.
+ *
+ * If the first two octets of the text is not 0xFE
+ * followed by 0xFF, and is not 0xFF followed by 0xFE,
+ * then the text SHOULD be interpreted as big-endian.
+ */
+ s -= size;
}
}
#if 1
|