1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782
|
<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- CHAPTER NOT IN USE.... now mostly in the dbconcepts chapter -->
<!--
-
- This file is part of the OpenLink Software Virtuoso Open-Source (VOS)
- project.
-
- Copyright (C) 1998-2018 OpenLink Software
-
- This project is free software; you can redistribute it and/or modify it
- under the terms of the GNU General Public License as published by the
- Free Software Foundation; only version 2 of the License, dated June 1991.
-
- This program is distributed in the hope that it will be useful, but
- WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
- General Public License for more details.
-
- You should have received a copy of the GNU General Public License along
- with this program; if not, write to the Free Software Foundation, Inc.,
- 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
-
-
-->
<!--
$Id$
-->
<chapter label="intl.xml" id="intl">
<title>International Character Support and Compatibility</title>
<abstract>
<para>Virtuoso Internationalization features. This chapter describes the storage and manipulation of international character data.</para>
</abstract>
<!-- ======================================== -->
<sect1 id="internationaldesc">
<title>Internationalization</title>
<para>
National strings are best represented as Unicode (NCHAR/LONG NVARCHAR) columns.
There is no guarantee that values stored inside narrow (VARCHAR/LONG VARCHAR)
columns will get correctly represented. If the client application is also
Unicode there is no internationalization conversions taking place.
But most of the current applications are using narrow characters.
The national caret defines how strings will get converted from Narrow to Wide
characters and back throughout Virtuoso, e.g. <type>cast ('test' as nvarchar)</type>
or <type>SQLBindCol</type> of a <type>SQL_WCHAR</type> column to a <type>SQL_C_CHAR</type>
in the ODBC client.
The character set is defined as an array of 255 (without the zero) Unicode codes
describing the location of each character from the narrow caret into the Unicode
space. It has a "primary" or "preferred" name and a list of aliases.
</para>
<para>
Character sets in Virtuoso are kept inside the system table SYS_CHARSETS. It's layout is :
</para>
<programlisting>
CREATE TABLE SYS_CHARSETS (
CS_NAME varchar, -- The "preferred" charset name
CS_TABLE long nvarchar, -- the mapping table of length 255 Wide chars
CS_ALIASES long varchar -- serialized vector of aliases
);
</programlisting>
<para>
The CS_NAME and CS_ALIASES are granted for SELECT to PUBLIC.
To simplify the retrieval of all the official & unofficial names of carets there is a
system function :
</para>
&charsets_list;
<para>
which returns the list of all "official" names and aliases either as a vector
(make_resultset = 0) or as a resultset (make_resultset = 1).
</para>
<para>
There are a number of character set definitions preloaded into this table.
Currently these are:
</para>
<simplelist>
<member>GOST19768-87</member>
<member>IBM437</member>
<member>IBM850</member>
<member>IBM855</member>
<member>IBM866</member>
<member>IBM874</member>
<member>ISO-8859-1</member>
<member>ISO-8859-10</member>
<member>ISO-8859-11</member>
<member>ISO-8859-13</member>
<member>ISO-8859-14</member>
<member>ISO-8859-15</member>
<member>ISO-8859-2</member>
<member>ISO-8859-3</member>
<member>ISO-8859-4</member>
<member>ISO-8859-5</member>
<member>ISO-8859-6</member>
<member>ISO-8859-7</member>
<member>ISO-8859-8</member>
<member>ISO-8859-9</member>
<member>KOI-0</member>
<member>KOI-7</member>
<member>KOI8-A</member>
<member>KOI8-B</member>
<member>KOI8-E</member>
<member>KOI8-F</member>
<member>KOI8-R</member>
<member>KOI8-U</member>
<member>MAC-UKRAINIAN</member>
<member>MIK</member>
<member>WINDOWS-1250</member>
<member>WINDOWS-1251</member>
<member>WINDOWS-1252</member>
<member>WINDOWS-1257</member>
</simplelist>
<para>
New carets can be defined using the function:
</para>
CHARSET_DEFINE (in NAME varchar, in CHARSET_STRING any, in ALIASES any)
<para>
User defined carets can be dropped by deleting the row from the SYS_CHARSETS table
and restarting the server.
</para>
<para>
NAME is the "preferred" name of the caret
CHARSET_STRING an 255 bytes long NCHAR defining the Unicode codes for narrow chars 1-255
ALIASES - vector of aliases or NULL;
</para>
<para>
Each translation is done in accordance to a "current character set".
This is a connection attribute. It gets it's value as follows:
</para>
<simplelist>
<member>
1. if the client supplies a CHARSET ODBC Connect string attribute (either from the
DSN definition or to a SQLDriverConnect) the name is searched into the currently
defined carets in the SYS_CHARSETS and if there is a match that character set becomes
a default.
</member>
<member>
2. If the database default character set ('Charset' parameter in the
'Parameters' section of virtuoso.ini) is defined it becomes current.
</member>
<member>
3. If this is not defined then the default character set becomes ISO-8859-1 which maps
the narrow chars as wide using equality. At any time the user can explicitly set
the character set either with SQLSetConnectAttribute (..., SQL_CHARSET (=5002), ...) or
by executing the command SET CHARSET='name|alias'.
</member>
</simplelist>
<para>
The current charset "preferred" name (as a string) is returned by the system function:
</para>
CURRENT_CHARSET()
<para>
There is a notion of a Database default character set which gets used if the client
doesn't supply it's own and in some special cases (like XML Views & FOR XML AUTO).
</para>
<para>
Types of translations from Unicode char to Narrow char supported by Virtuoso
</para>
<itemizedlist mark="bullet">
<listitem>
<para>String translation:</para>
<simplelist>
<member>If the Unicode represents a part of the US-ASCII (0-127) then it's value gets used</member>
<member>If a mapping to narrow of that Unicode is found in the character set then use it</member>
<member>If none of the above then the narrow '?' is returned.</member>
</simplelist>
</listitem>
<listitem>
<para>Command translation:</para>
<simplelist>
<member>If the Unicode represents a part of the US-ASCII (0-127) then it's value gets used</member>
<member>If a mapping to narrow of that Unicode is found then use it</member>
<member>If none of the above then the Unicode gets escaped by using the form \xNNNN (hexadecimal)</member>
</simplelist>
</listitem>
<listitem>
<para>HTTP/XML translation:</para>
<simplelist>
<member>If the Unicode represents a part of the US-ASCII (0-127) then it's value
gets used after escaping the special symbols (<, >, & etc.) with their escapes</member>
<member>If a mapping to narrow of that Unicode is found in the character set then use it.
The narrow char is checked if needs to be escaped</member>
<member>If none of the above then the Unicode gets escaped by using the form &#DDDDDD; (decimal)</member>
</simplelist>
</listitem>
</itemizedlist>
<sect2 id="charsetclientusage">
<title>Usage of the Character Set in the ODBC/UDBC/CLI Clients</title>
<para>
This describes where exactly a translation is done in the case of an ODBC/UDBC/CLI client.
These are described as one as the Virtuoso CLI is the same as the ODBC/UDBC.
</para>
<para>
These are described as one as the Virtuoso CLI is the same as the ODBC/UDBC.
</para>
<para>
1.SQLPrepareW, SQLExecDirectW, SQLNativeSQLW
The Unicode arguments of these will become narrow strings by using the "Command encoding".
This will then get recognized by the SQL Parser and converted to Unicode if they are
in a NCHAR constant (N'...');
</para>
<para>
2.SQL_C_WCHAR -> SQL_xxx and SQL_Nxxx -> SQL_C_xxx (except SQL_C_WCHAR) bindings
(SQLBindCol, SQLBindParam, SQLGetData ...)
The Unicode strings of these will become narrow strings by using the "String encoding";
</para>
</sect2>
<sect2 id="charsetserverusage">
<title>Usage of the Character Sets in the ODBC/UDBC/CLI Server</title>
<para>
The server uses the character set in the CAST operator when converting NCHAR/LONG NVARCHAR
to any other type.
</para>
</sect2>
<sect2 id="charsethttpusage">
<title>Usage of the Character Set in the HTTP Server</title>
<para>
The HTTP server, unless otherwise specified (by http_header for example) when
returning the HTTP header to the client appends and charset=xxxx to the
Content-Type: HTTP header field. It uses the charset mainly to format correctly
values using the http_value or it's VSP equivalent <?= ...>.
In there the wide values & XML entities (the result of any XML processing function
like xpath_contains) get represented using the "HTTP/XML transformation".
Same goes for the FOR XML AUTO, XML Views and WebDAV content.
</para>
</sect2>
<sect2 id="charsetxmlproc">
<title>Usage of the Character Sets in the XML Processors</title>
<para>
The Virtuoso embedded XML parser processes correctly all the encodings defined
in the SYS_CHARSETS table + UTF8
</para>
</sect2>
<sect2 id="gensql">
<title>The Generation of the SQL</title>
<sect3 id="inputproc">
<title>Input Processing</title>
<para>The XPATH & xpath_contains translate their expressions as follows:</para>
<simplelist>
<member>
Narrow strings: these get translated to Unicode as per the character set
and then to UTF-8 which is the internal encoding used by the Virtuoso XML tools.
</member>
<member>
SQL Views and FOR XML AUTO take their values from narrow columns by firstly
converting them to Unicode based on the DATABASE character set and then to UTF-8
</member>
</simplelist>
</sect3>
<sect3 id="outputproc">
<title>Output Processing</title>
<simplelist>
<member>
Almost all of the XML processors/generators return their values as type
DV_XML_ENTITY (__tag() 230). If such a value is cased to get it's character
representation either by CAST or by http_value the it gets converted to narrow
using "HTTP/XML translation".
</member>
<member>
In the special case when some XPATH expressions return string values they
get returned as NCHAR values to the clients which then convert these to
narrow (if needed) as described earlier.
</member>
</simplelist>
</sect3>
</sect2>
</sect1>
<sect1 id="intlcollationsdef">
<title>Creating a collation</title>
<para>
The collations are stored into the SYS_COLLATIONS system table.
A collation can be created by supplying a "collation definition" text file to
the "collation_define" SQL function. The collation definition file contains a list of
the "exceptions" to the binary collation order represented with the character
code = collation weight pairs. For example a case insensitive collation can be defined by specifying
all the lower case letters to have the same collation weights as the corresponding uppercase ones.
</para>
<sect2 id="coldeffile">
<title>Collation Definition File</title>
<para>
The format of the collation definition file should follow the following guidelines:
</para>
<itemizedlist mark="bullet">
<listitem>
<para>Each definition should reside on a separate line.</para>
</listitem>
<listitem>
<para>The format of the definition is: CHAR=CODE, where CHAR and CODE an be either the
letters themselves, or their decimal codes. (Example: 67=68 is equal to C=D using the ASCII character
set). Note that the codes can exceed the byte boundary (for Unicode collations).
</para>
</listitem>
</itemizedlist>
</sect2>
<sect2 id="coll_define">
<title>coll_define SQL Function</title>
<programlisting>
collation_define (COLLATION_NAME, FILE_PATH, ADD_TYPE)
</programlisting>
<para>
COLLATION_NAME is the name to be assigned to the collation.
</para>
<para>
FILE_PATH is the path (in the server's OS) of the collation definition file.
</para>
<para>
ADD_TYPE is the type of the new collation: 1 for 8-bit collation (256 bytes blob); 2 for UNICODE collation
(65536 Unicode blob). There is a special value of 0 for that parameter instructing the function to only
check the validity of the definition file and to return a resultset containing the codes of the valid exception
definitions.
</para>
</sect2>
<sect2 id="intlsys_collations">
<title>Collations system table</title>
<para>
The SYS_COLLATIONS system table holds the data for all the defined collations. It has the following structure:
</para>
<programlisting>
CREATE TABLE SYS_COLLATIONS (
COLL_NAME VARCHAR,
COLL_TABLE LONG VARBINARY,
COLL_IS_WIDE INTEGER);
</programlisting>
<para>
COLL_NAME is the fully qualified name of the collation (it's identifier)
</para>
<para>
COLL_TABLE holds the collation table (256 bytes or 65536 wide chars)
</para>
<para>
COLL_IS_WIDE show the collation's type (0 for CHAR and 1 for NCHAR). Note that a 8bit collation
cannot be used by a NCHAR data and vice versa.
</para>
<para>
Collation can be deleted by deleting its row from SYS_COLLATIONS.
</para>
<note><title>Note</title>
<para>The collation will still be available until the server is restarted, as it's definition is cached into memory.
</para>
</note>
</sect2>
</sect1>
<sect1 id="8BIT">
<title>8 bit characters</title>
<sect2 id="collation">
<title>Collations</title>
<para>
Virtuoso supports collation orders different from the binary for CHAR and VARCHAR fields as per ANSI SQL-92.
</para>
<para>
When comparing strings using a collation Virtuoso compares the "weights" of the characters
instead of their codes. This allows making different characters compare like equal (example:
case-insensitive comparisons).
</para>
<para>
The collation is a property of the column holding the data. This means that all the comparisons including
that column will use it's collation. Note that the SQL functions will strip the collation data from the
column (example : if a Column "CompanyName" has an assigned collation "Spanish"
then LEFT (CompanyName, 10) will use the default collation).
</para>
<para>
The collation can be defined per column (on table creation) and per database (as a configuration parameter).
There is a user-accessible collation definition procedure.
There is a special form of the CAST operator which allows casting a column to collation.
</para>
<para>
A collation identifier has the same form as any other SQL identifier (qualifier.owner.name) and it can be
escaped as the other identifiers.
</para>
<sect3 id="tablecoll">
<title>Defining collation for a table column</title>
<para>
The collation for a column is assigned on table creation using the following syntax:
</para>
<programlisting>
create table TABLE_NAME (
...
COLLATED VARCHAR(50) COLLATE Spanish,
COLLATED CHAR(20) COLLATE DB.DBA.Spanish,
....
)
</programlisting>
<para>
Note that assigning a collation to a non-character column gives an error.
</para>
<para>
If the COLLATE is omitted, the default database collation is used.
</para>
<para>
On database start-up the collation for each table's column is looked into the SYS_COLLATIONS TABLE
and if not found, the COLLATE attribute is ignored until the next restart.
</para>
</sect3>
<sect3 id="dbcoll">
<title>Defining database wide collation</title>
<para>
The database's default collation is defined by the configuration parameter "Collation" in
the "Parameters" section of virtuoso.ini
</para>
</sect3>
</sect2>
<sect2 id="8BITBIF">
<title>8 bit utility system functions</title>
&charset_recode;
&charsets_list;
</sect2>
</sect1>
<sect1 id="twobyteunicode">
<title>Unicode characters support</title>
<para>
Virtuoso allows 16-bit Unicode data to be stored and retrieved from the database fields.
The data are stored internally as UTF-8 encoded strings for storage space optimization.
The Unicode fields are easily intermixable with other character data as all the SQL functions
support wide string case and convert to the most wide character representation on demand.
</para>
<sect2 id="intldatatypes">
<title>Unicode data types</title>
<para>
There are 3 additional data types to enable storing of Unicode data
</para>
<itemizedlist mark="bullet">
<listitem>
<para>NCHAR</para>
</listitem>
<listitem>
<para>NVARCHAR</para>
</listitem>
<listitem>
<para>LONG NVARCHAR</para>
</listitem>
</itemizedlist>
<para>
All the Unicode types are equivalent of the corresponding "narrow" type (CHAR,
VARCHAR and LONG VARCHAR), except that instead of storing data as one byte they allow
Unicode characters. Their lengths are defined and returned in characters instead of bytes.
They collate according to the active wide char collation (if any).
These types can be used anywhere the narrow char types can be used, except in LIKE conditions.
</para>
<para>
When there is a need to convert a wide string to a narrow one and vice versa, then a character set is
used. A character set returns a wide string code for a wide char. For example there can be a definition of
ISO-8859-5 "narrow" character set which describes mapping of NON-ASCII character codes
to their Unicode equivalents. Virtuoso relies on the fact that the ASCII character codes are represented
in Unicode by just type-casting and in UTF8 as one-byte tokens with the same value as in ASCII.
</para>
<para>
When conversion is done server-side (using cast, some of the SQL built-in functions, etc.) the wide
characters are converted to narrow using a system-independent server-side character set. In it's
absence the Latin1 character set is used which project narrow character codes into the Unicode space
as equally-valued wide character codes.
</para>
<para>
Things are different when the conversion is done client-side (for example when binding a VARCHAR to a
wide buffer). Then the default client's system character set is used.
</para>
<para>
The wide character literals have the ANSI SQL-92 syntax : N'xxx' (prefixing normal literals
with the letter N). These strings process escapes with a values large enough to represent all the
Unicode characters.
</para>
</sect2>
<sect2 id="widefunc">
<title>Built-in SQL Functions and Wide Characters</title>
<para>
All the built in SQL functions which do take a character attribute(s) and do have a character input are
calculating their output type as follows:
</para>
<para>
If there is any attribute which is a wide string or a wide blob, then the result is a wide string. Otherwise the output character type is narrow.
</para>
<para>
There are some functions, like make_string, which do have character result type, but do not have a
character parameters. This is resolved by leaving the make_string to produce narrow strings, and
providing a make_wstring equivalent for wide output.
</para>
</sect2>
<sect2 id="wideodbc">
<title>Client-side changes to support wide characters</title>
<para>
The ODBC client implements the SQL...W functions (like SQLConnectW) that do take Unicode arguments.
This contributes to a faster wide character processing and allows binding of SQL_C_WCHAR output type.
As the SQL parser does not allow Unicode data in SQL commands, they should be bound as parameters
or should be represented as escapes.
</para>
</sect2>
<sect2 id="nvdb">
<title>Virtual database and national language support</title>
<para>
For narrow strings attached tables are using always the default collation of the data source.
</para>
<para>
Remote wide string columns are mapped to the appropriate wide character type in Virtuoso.
The data are then passed as intact in case of wide-to-wide mapping. When the data are converted
client-side in the VDB the Server's system character set is used (where available).
</para>
</sect2>
</sect1>
<sect1 id="xmlprocandftencoding">
<title>XML Processing & Free Text Encoding Issues</title>
<para>
Being an international standard, XML notation is self-protected from encoding errors. Most of the common
errors may easily be eliminated by writing proper XML prologs in documents.</para>
<sect2 id="encodingsvscharsets">
<title>Encodings: The Difference Between Encodings & Character Sets</title>
<para>
Not all documents may be converted to Unicode by using simple character sets.
Some of them are stored in so-called "multibyte" encodings.
It means that every letter (or ideograph) is represented as a sequence of
one or more bytes, not by exactly one byte. The conversion from such representation
to Unicode and back is usually significantly slower than simple transformation
via character sets, so they are supported by data import operations only, but not by
internal RDBMS routines.</para>
<para>
The Virtuoso Server "knows" some number of built-in encodings, such as UTF-8,
UTF-16BE and UTF-16LE. It can load additional encoding descriptions from
a "UCM" file, and can automatically create a new encoding from a known charset with
the same name. See the <link linkend="ucmencodings">UCM Encodings</link> section
for more details.</para>
<para>
An encoding may be used in the following places:
</para>
<simplelist>
<member>The XML/HTML parser to convert source text to Unicode.</member>
<member>The free-text indexing engine to convert plain-text or XML documents
to Unicode during the indexing.</member>
<member>It may be used by the compiler of free-text search expressions
to convert string constants of the expression to Unicode.</member>
<member>It may be used to convert string constants of XPath/XQuery expressions.</member>
</simplelist>
<para>You can only use character sets, not encodings as an ODBC connection
character set, as a character set attribute of a column of a database table,
as an output encoding of the built-in XSLT processor (it is for future versions).
UTF-8 is an exception, it is supported in many places where other encodings are not.</para>
<!-- not clear -->
<tip><title>Security Note:</title>
<para>Two strings converted to Unicode may be identical, but this does not
guarantee that their source strings were equal byte-by-byte due to the
nature of some encodings. For this reason you should avoid processing
authorization data that are neither in Unicode nor in one of the standard
character sets (single-byte encodings). Multibyte encodings and user-defined
character sets may be unsafe for such purposes.
</para></tip>
<sect3 id="ucmencodings">
<title>UCM Encodings</title>
<para>
The description of a multibyte encodings is much longer than the description
of a character set. It is inconvenient to keep such amounts of data inside
the executable. Virtuoso can load descriptions of required encodings
from external files in UCM format. Every UCM file describes one encoding.</para>
<para>
Virtuoso loads UCM files at system initialization. The list of UCM files
is kept in the <link linkend="VIRTINI">Virtuoso INI file</link> under a
section called [Ucms]. This section should contain a UcmPath parameter and
one or more parameters with names Ucm1, Ucm2, Ucm3 and so on (up to Ucm99).</para>
<para>
The UcmPath parameter specifies the directory where UCM files are located,
and every UcmNN parameter specifies the name of a UCM file to load and
a list of names that the encoding can be identified by in the
<?xml ... encoding="..." ?> XML preamble. A vertical bar character is
used to delimit names in the list.</para>
<example id="ucmdefininifile">
<title>Sample [Ucms] Section</title>
<programlisting>
[Ucms]
UcmPath = /usr/local/javalib/ucm
Ucm1 = java-Cp933-1.3-P.ucm,Cp933
Ucm2 = java-Cp949-1.3-P.ucm,Cp949|Korean
</programlisting>
<para>This section describes two UCM files located in /usr/local/javalib/ucm
directory: data from java-Cp933-1.3-P.ucm will be used for documents in the
'Cp933' encoding; data from java-Cp949-1.3-P.ucm will be used for documents
in the 'Cp949' encoding and for documents in the 'Korean' encoding
(because these two names refers to the same encoding). </para>
</example>
<note><title>Note:</title>
<para>The encoding name specified inside the UCM file itself is not used.</para>
</note>
<para>
The Virtuoso server will log the results of processing each UCM file specified
in the Virtuoso INI file. If a UCM file specified is not found or contains
syntax errors, the error is logged, otherwise only the type and name(s) of
the encoding are logged.</para>
<note><title>Note:</title>
<para>If the virtuoso.ini contains a misspelled name of a parameter or section,
the parameter (or a whole section) is ignored without being reported as an error.
It is always wise to verify that the log contains a record about the encoding(s)
you load.</para></note>
<tip><title>See Also:</title>
<para>UCM files can be found freely from various sites concerning the "International Components
for Unicode" project, such as: <ulink url="http://www-124.ibm.com/icu">IBM ICU Homepage</ulink>
or the <ulink url="http://oss.software.ibm.com/cvs/icu/charset/data/ucm/">IBM UCM files directory</ulink>.</para>
<para>The <link linkend="cinterface">C Interface</link> chapter contains
further information regarding user customizable support for new encodings and
languages. For almost all tasks, it is enough to define a new charset or to
load an additional UCM file, but some special tasks may require writing
additional C code.</para>
</tip>
</sect3>
<sect3 id="encodingattr"><title>The <parameter>Encoding</parameter> Attribute</title>
<para>If an XML document contains the <parameter>encoding</parameter>
parameter in its</para>
<programlisting><?xml ... ?></programlisting>
<para>prolog declaration, it will be properly decoded and converted into
UTF-8, so the application code is free from encoding problems. If the
value of this attribute is the name of a pre-set or user-defined character
set, that character set will be used. Virtuoso will recognize names such as
<parameter>UTF-8</parameter> and <parameter>UTF8</parameter> as multi-character
or special encodings. Virtuoso recognizes both official names and aliases.</para>
<para>If an encoding is not specified in an XML prolog, or if the document
contains no prolog, the default encoding will be used to read the
document. If a built-in SQL function invokes the XML parser, it will
have an optional argument <parameter>parser_mode</parameter> to
specify whether source text should be parsed as strict XML or as HTML.
If the source text is 8-bit, then UTF-8 will be used as the default
encoding for "XML mode", and ISO-8859-1 (Latin-1) will be the
default for "HTML mode". If the source text is of some
wide-character type, Unicode is the default. To make another encoding
the default, you may specify its official or alias name as the
<parameter>content_encoding</parameter> argument of a built-in
function you call.</para>
</sect3>
</sect2>
<sect2 id="encodingxpathexp">
<title>Encoding in XPath Expressions</title>
<para>
Sometimes applications should perform XPath queries using the encoding specified
by client. For example, a search engine may ask a user to specify a pattern to
search and use the browser's current encoding as a hint to parse the pattern
properly. In such cases you may wish to use the <parameter>__enc</parameter>
XPath option to specify the encoding used for the rest of XPath string:</para>
<example><title>Specifying Search Encodings in XPath</title>
<para>Create a sample table and store an XML with non-Latin-1 characters</para>
<programlisting>
create table ENC_XML_SAMPLE (
ID integer,
XPER long varchar,
primary key (ID)
);
insert into ENC_XML_SAMPLE (ID, XPER)<!-- values (1, xml_persistent ('<?xml version="1.0" encoding="WINDOWS-1251" ?><book><cit> , (.)</cit></book>')); -->
values (
1,
xml_persistent ('<?xml version="1.0" encoding="WINDOWS-1251" ?>
<book><cit>Îí äîáàâèë
êàðòîøêè,
ïîñîëèë è
ïîñòàâèë
àêâàðèóì íà
îãîíü
(Ì.Æâàíåöêèé
)</cit></book>'
)
);
...
</programlisting>
<para>Find the IDs of all XML documents whose texts contain a specified
phrase. Note that there are pairs of single quotes (not double
quotes) around <parameter>KOI8-R</parameter>. The encoding name should
be in single quotes, but because it is inside a string constant the
quotes must be duplicated.
</para>
<programlisting>
select ID from ENC_XML_SAMPLE where
xcontains (XPER, '[__enc ''KOI8-R''] //cit[text-contains(.,
"''ÐÏÓÔÁ×ÉÌ
ÁË×ÁÒÉÕÍ
ÎÁ ÏÇÏÎØ''")]');
<!-- select ID from ENC_XML_SAMPLE where xcontains (XPER, '[__enc ''KOI8-R''] //cit[text-contains(., "'' ''")]'); -->
</programlisting>
</example>
</sect2>
<sect2 id="encodinginfttsp">
<title>Encoding in Free Text Search Indexes & Patterns</title>
<para>
Like XML applications, free text searching may have encoding problems,
and Virtuoso offers a similar solution for them.</para>
<para>
Both the CREATE TEXT INDEX statement and vt_create_text_index() Virtuoso/PL procedure
have an optional argument to specify the encoding of the indexed data.
The specified encoding will be applied to all source text documents
(if the TEXT INDEX was created), or to all XML documents that have no encoding
attribute of the sort <?xml ... encoding="..." ?>
(if the TEXT XML INDEX was created).</para>
<para>
The option <parameter>__enc</parameter> may be specified at the
beginning of free text search pattern, even if the pattern is inside
an XPath statement:</para>
<example>
<title>Specifying an Encoding for Free Text Searching</title>
<para>
Create a sample table and store a sample of text with non-Latin-1 characters
(assuming that client encoding is Windows-1251)
</para>
<programlisting>
create table ENC_TEXT_SAMPLE (
ID integer,
TEXT long nvarchar,
primary key (ID)
);
insert into ENC_TEXT_SAMPLE (ID, XPER)<!-- values (1, N' , (.)')); -->
values (
1,
'<?xml version="1.0" encoding="WINDOWS-1251" ?>
Îí äîáàâèë
êàðòîøêè,
ïîñîëèë è
ïîñòàâèë
àêâàðèóì
íà îãîíü
(Ì.Æâàíåöêèé')
);
...
</programlisting>
<para>Find the IDs of all text documents whose texts contain a specified phrase.
</para>
<!-- select ID from ENC_TEXT_SAMPLE where contains (TEXT, '[__enc ''KOI8-R''] " "'); -->
<programlisting>
select ID from ENC_SAMPLE where
contains (TEXT, '[__enc ''KOI8-R'']
"ÐÏÓÔÁ×ÉÌ
ÁË×ÁÒÉÕÍ
ÎÁ ÏÇÏÎØ"'
);
</programlisting>
<para>Encoding may be applied locally to an argument of the text-search predicate.
It may be used if the document contains citations in different encodings or if
the XML document contains non-ASCII characters in names of tags or attributes,
or if the encoding affects character codes of ASCII symbols such as '/' or '['.
</para>
<programlisting>
select ID from ENC_XML_SAMPLE where
xcontains (XPER, '//cit[text-contains(., "[__enc ''KOI8-R'']
''ÐÏÓÔÁ×ÉÌ
ÁË×ÁÒÉÕÍ ÎÁ
ÏÇÏÎØ''")]'
);
</programlisting>
</example>
<note><title>Note:</title>
<para>
You may have free-text a expression written as a literal constant:
e.g. if the argument of text-contains XPath function is a literal constant.
Be careful to not declare the __enc twice, once in the beginning of the whole
XPath expression and then again in the beginning of the free-text expression
constant, because words of the text expression will thus be converted twice.</para></note>
</sect2>
</sect1>
</chapter>
|