1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221
|
# This file is part of the Detox package.
#
# Copyright (c) Doug Harple <detox.dharple@gmail.com>
#
# For the full copyright and license information, please view the LICENSE
# file that was distributed with this source code.
#
# Special thanks to:
# - https://www.unicode.org/Public/5.2.0/ucd/UnicodeData.txt
# - https://www.unicode.org/Public/15.1.0/ucd/UnicodeData.txt
# - https://metacpan.org/pod/Text::Unidecode
# - https://github.com/Behat/Transliterator
# - https://www.unicode.org/charts/
#
start
#
# C0 Controls and Basic Latin
#
# The ASCII characters, 0x20-0x7E are here because detox should convert
# multibyte versions of them to single-byte versions.
#
# For instance, the byte sequence 0xC0 0xA0 is really just 0x20 (space).
#
# https://unicode.org/charts/PDF/U0000.pdf
#
0x0001 "_" # START OF HEADING
0x0002 "_" # START OF TEXT
0x0003 "_" # END OF TEXT
0x0004 "_" # END OF TRANSMISSION
0x0005 "_" # ENQUIRY
0x0006 "_" # ACKNOWLEDGE
0x0007 "_" # BELL
0x0008 "_" # BACKSPACE
0x0009 "_" # CHARACTER TABULATION
0x000A "_" # LINE FEED (LF)
0x000B "_" # LINE TABULATION
0x000C "_" # FORM FEED (FF)
0x000D "_" # CARRIAGE RETURN (CR)
0x000E "_" # SHIFT OUT
0x000F "_" # SHIFT IN
0x0010 "_" # DATA LINK ESCAPE
0x0011 "_" # DEVICE CONTROL ONE
0x0012 "_" # DEVICE CONTROL TWO
0x0013 "_" # DEVICE CONTROL THREE
0x0014 "_" # DEVICE CONTROL FOUR
0x0015 "_" # NEGATIVE ACKNOWLEDGE
0x0016 "_" # SYNCHRONOUS IDLE
0x0017 "_" # END OF TRANSMISSION BLOCK
0x0018 "_" # CANCEL
0x0019 "_" # END OF MEDIUM
0x001A "_" # SUBSTITUTE
0x001B "_" # ESCAPE
0x001C "_" # INFORMATION SEPARATOR FOUR
0x001D "_" # INFORMATION SEPARATOR THREE
0x001E "_" # INFORMATION SEPARATOR TWO
0x001F "_" # INFORMATION SEPARATOR ONE
0x0020 " " # SPACE
0x0021 ! # EXCLAMATION MARK
0x0022 '"' # QUOTATION MARK
0x0023 # # NUMBER SIGN
0x0024 $ # DOLLAR SIGN
0x0025 % # PERCENT SIGN
0x0026 & # AMPERSAND
0x0027 "'" # APOSTROPHE
0x0028 ( # LEFT PARENTHESIS
0x0029 ) # RIGHT PARENTHESIS
0x002A * # ASTERISK
0x002B + # PLUS SIGN
0x002C , # COMMA
0x002D - # HYPHEN-MINUS
0x002E . # FULL STOP
0x002F / # SOLIDUS
0x0030 0 # DIGIT ZERO
0x0031 1 # DIGIT ONE
0x0032 2 # DIGIT TWO
0x0033 3 # DIGIT THREE
0x0034 4 # DIGIT FOUR
0x0035 5 # DIGIT FIVE
0x0036 6 # DIGIT SIX
0x0037 7 # DIGIT SEVEN
0x0038 8 # DIGIT EIGHT
0x0039 9 # DIGIT NINE
0x003A : # COLON
0x003B ; # SEMICOLON
0x003C < # LESS-THAN SIGN
0x003D = # EQUALS SIGN
0x003E > # GREATER-THAN SIGN
0x003F ? # QUESTION MARK
0x0040 @ # COMMERCIAL AT
0x0041 A # LATIN CAPITAL LETTER A
0x0042 B # LATIN CAPITAL LETTER B
0x0043 C # LATIN CAPITAL LETTER C
0x0044 D # LATIN CAPITAL LETTER D
0x0045 E # LATIN CAPITAL LETTER E
0x0046 F # LATIN CAPITAL LETTER F
0x0047 G # LATIN CAPITAL LETTER G
0x0048 H # LATIN CAPITAL LETTER H
0x0049 I # LATIN CAPITAL LETTER I
0x004A J # LATIN CAPITAL LETTER J
0x004B K # LATIN CAPITAL LETTER K
0x004C L # LATIN CAPITAL LETTER L
0x004D M # LATIN CAPITAL LETTER M
0x004E N # LATIN CAPITAL LETTER N
0x004F O # LATIN CAPITAL LETTER O
0x0050 P # LATIN CAPITAL LETTER P
0x0051 Q # LATIN CAPITAL LETTER Q
0x0052 R # LATIN CAPITAL LETTER R
0x0053 S # LATIN CAPITAL LETTER S
0x0054 T # LATIN CAPITAL LETTER T
0x0055 U # LATIN CAPITAL LETTER U
0x0056 V # LATIN CAPITAL LETTER V
0x0057 W # LATIN CAPITAL LETTER W
0x0058 X # LATIN CAPITAL LETTER X
0x0059 Y # LATIN CAPITAL LETTER Y
0x005A Z # LATIN CAPITAL LETTER Z
0x005B [ # LEFT SQUARE BRACKET
0x005C \ # REVERSE SOLIDUS
0x005D ] # RIGHT SQUARE BRACKET
0x005E ^ # CIRCUMFLEX ACCENT
0x005F _ # LOW LINE
0x0060 ` # GRAVE ACCENT
0x0061 a # LATIN SMALL LETTER A
0x0062 b # LATIN SMALL LETTER B
0x0063 c # LATIN SMALL LETTER C
0x0064 d # LATIN SMALL LETTER D
0x0065 e # LATIN SMALL LETTER E
0x0066 f # LATIN SMALL LETTER F
0x0067 g # LATIN SMALL LETTER G
0x0068 h # LATIN SMALL LETTER H
0x0069 i # LATIN SMALL LETTER I
0x006A j # LATIN SMALL LETTER J
0x006B k # LATIN SMALL LETTER K
0x006C l # LATIN SMALL LETTER L
0x006D m # LATIN SMALL LETTER M
0x006E n # LATIN SMALL LETTER N
0x006F o # LATIN SMALL LETTER O
0x0070 p # LATIN SMALL LETTER P
0x0071 q # LATIN SMALL LETTER Q
0x0072 r # LATIN SMALL LETTER R
0x0073 s # LATIN SMALL LETTER S
0x0074 t # LATIN SMALL LETTER T
0x0075 u # LATIN SMALL LETTER U
0x0076 v # LATIN SMALL LETTER V
0x0077 w # LATIN SMALL LETTER W
0x0078 x # LATIN SMALL LETTER X
0x0079 y # LATIN SMALL LETTER Y
0x007A z # LATIN SMALL LETTER Z
0x007B { # LEFT CURLY BRACKET
0x007C | # VERTICAL LINE
0x007D } # RIGHT CURLY BRACKET
0x007E ~ # TILDE
0x007F "_" # DELETE
#
# Latin 1 Supplemental - 0x0080-0x00FF
#
# https://unicode.org/charts/PDF/U0080.pdf
#
0x00A0 " " # NO-BREAK SPACE
0x00AD - # SOFT HYPHEN
#
# General Punctuation - 0x2000-0x206F
#
# https://unicode.org/charts/PDF/U2000.pdf
#
0x2000 " " # EN QUAD
0x2001 " " # EM QUAD
0x2002 " " # EN SPACE
0x2003 " " # EM SPACE
0x2004 " " # THREE-PER-EM SPACE
0x2005 " " # FOUR-PER-EM SPACE
0x2006 " " # SIX-PER-EM SPACE
0x2007 " " # FIGURE SPACE
0x2008 " " # PUNCTUATION SPACE
0x2009 " " # THIN SPACE
0x200A " " # HAIR SPACE
0x200B "" # ZERO WIDTH SPACE
0x200C "" # ZERO WIDTH NON-JOINER
0x200D "" # ZERO NON-JOINER
0x200E "" # LEFT-TO-RIGHT MARK
0x200F "" # RIGHT-TO-LEFT MARK
0x2010 - # HYPHEN
0x2011 - # NON-BREAKING HYPHEN
0x2012 - # FIGURE DASH
0x2013 - # EN DASH
0x2014 - # EM DASH
0x2015 - # HORIZONTAL BAR
0x2017 _ # DOUBLE LOW LINE
0x2028 " " # LINE SEPARATOR
0x2029 " " # PARAGRAPH SEPARATOR
0x202A " " # LEFT-TO-RIGHT EMBEDDING
0x202B " " # RIGHT-TO-LEFT EMBEDDING
0x202C " " # POP DIRECTIONAL FORMATTING
0x202D " " # LEFT-TO-RIGHT OVERRIDE
0x202E " " # RIGHT-TO-LEFT OVERRIDE
0x202F " " # NARROW NO-BREAK SPACE
0x2044 "_" # FRACTION SLASH
0x205F " " # MEDIUM MATHEMATICAL SPACE
0x2060 "" # WORD JOINER
end
|