1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224
|
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=9"/>
<title>wxWidgets: Unicode Support in wxWidgets</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="dynsections.js"></script>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
<link href="extra_stylesheet.css" rel="stylesheet" type="text/css"/>
</head>
<body>
<div id="page_container">
<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0" style="width: 100%;">
<tbody>
<tr>
<td id="projectlogo">
<a href="http://www.wxwidgets.org/" target="_new">
<img alt="wxWidgets" src="logo.png"/>
</a>
</td>
<td style="padding-left: 0.5em; text-align: right;">
<span id="projectnumber">Version: 3.0.2</span>
</td>
</tr>
</tbody>
</table>
</div>
<!-- Generated by Doxygen 1.8.2 -->
<div id="navrow1" class="tabs">
<ul class="tablist">
<li><a href="index.html"><span>Main Page</span></a></li>
<li class="current"><a href="pages.html"><span>Related Pages</span></a></li>
<li><a href="modules.html"><span>Categories</span></a></li>
<li><a href="annotated.html"><span>Classes</span></a></li>
<li><a href="files.html"><span>Files</span></a></li>
</ul>
</div>
<div id="nav-path" class="navpath">
<ul>
<li class="navelem"><a class="el" href="index.html">Documentation</a></li><li class="navelem"><a class="el" href="page_topics.html">Programming Guides</a></li> </ul>
</div>
</div><!-- top -->
<div class="header">
<div class="headertitle">
<div class="title">Unicode Support in wxWidgets </div> </div>
</div><!--header-->
<div class="contents">
<div class="toc"><h3>Table of Contents</h3>
<ul><li class="level1"><a href="#overview_unicode_what">What is Unicode?</a></li>
<li class="level1"><a href="#overview_unicode_encodings">Unicode Representations and Terminology</a></li>
<li class="level1"><a href="#overview_unicode_supportin">Unicode Support in wxWidgets</a><ul><li class="level2"><a href="#overview_unicode_support_default">Unicode is Always Used by Default</a></li>
<li class="level2"><a href="#overview_unicode_support_utf">Choosing Unicode Representation</a></li>
<li class="level2"><a href="#overview_unicode_settings">Unicode Related Preprocessor Symbols</a></li>
</ul>
</li>
<li class="level1"><a href="#overview_unicode_pitfalls">Potential Unicode Pitfalls</a><ul><li class="level2"><a href="#overview_unicode_compilation_errors">Unicode-Related Compilation Errors</a></li>
<li class="level2"><a href="#overview_unicode_data_loss">Data Loss due To Unicode Conversion Errors</a></li>
<li class="level2"><a href="#overview_unicode_performance">Performance Implications of Using UTF-8</a></li>
</ul>
</li>
<li class="level1"><a href="#overview_unicode_supportout">Unicode and the Outside World</a></li>
</ul>
</div>
<div class="textblock"><p>This section describes how does wxWidgets support Unicode and how can it affect your programs.</p>
<p>Notice that Unicode support has changed radically in wxWidgets 3.0 and a lot of existing material pertaining to the previous versions of the library is not correct any more. Please see <a class="el" href="overview_changes_since28.html#overview_changes_unicode">Unicode-related Changes</a> for the details of these changes.</p>
<p>You can skip the first two sections if you're already familiar with Unicode and wish to jump directly in the details of its support in the library.</p>
<h1><a class="anchor" id="overview_unicode_what"></a>
What is Unicode?</h1>
<p>Unicode is a standard for character encoding which addresses the shortcomings of the previous standards (e.g. the ASCII standard), by using 8, 16 or 32 bits for encoding each character. This allows enough code points (see below for the definition) sufficient to encode all of the world languages at once. More details about Unicode may be found at <a href="http://www.unicode.org/">http://www.unicode.org/</a>.</p>
<p>From a practical point of view, using Unicode is almost a requirement when writing applications for international audience. Moreover, any application reading files which it didn't produce or receiving data from the network from other services should be ready to deal with Unicode.</p>
<h1><a class="anchor" id="overview_unicode_encodings"></a>
Unicode Representations and Terminology</h1>
<p>When working with Unicode, it's important to define the meaning of some terms.</p>
<p>A <b><em>glyph</em></b> is a particular image (usually part of a font) that represents a character or part of a character. Any character may have one or more glyph associated; e.g. some of the possible glyphs for the capital letter 'A' are:</p>
<div class="image">
<img src="overview_unicode_glyphs.png" alt="overview_unicode_glyphs.png"/>
</div>
<p>Unicode assigns each character of almost any existing alphabet/script a number, which is called <b><em>code point</em></b>; it's typically indicated in documentation manuals and in the Unicode website as <code>U+xxxx</code> where <code>xxxx</code> is an hexadecimal number.</p>
<p>Note that typically one character is assigned exactly one code point, but there are exceptions; the so-called <em>precomposed characters</em> (see <a href="http://en.wikipedia.org/wiki/Precomposed_character">http://en.wikipedia.org/wiki/Precomposed_character</a>) or the <em>ligatures</em>. In these cases a single "character" may be mapped to more than one code point or vice versa more than one character may be mapped to a single code point.</p>
<p>The Unicode standard divides the space of all possible code points in <b><em>planes</em></b>; a plane is a range of 65,536 (1000016) contiguous Unicode code points. Planes are numbered from 0 to 16, where the first one is the <em>BMP</em>, or Basic Multilingual Plane. The BMP contains characters for all modern languages, and a large number of special characters. The other planes in fact contain mainly historic scripts, special-purpose characters or are unused.</p>
<p>Code points are represented in computer memory as a sequence of one or more <b><em>code units</em></b>, where a code unit is a unit of memory: 8, 16, or 32 bits. More precisely, a code unit is the minimal bit combination that can represent a unit of encoded text for processing or interchange.</p>
<p>The <b><em>UTF</em></b> or Unicode Transformation Formats are algorithms mapping the Unicode code points to code unit sequences. The simplest of them is <b>UTF-32</b> where each code unit is composed by 32 bits (4 bytes) and each code point is always represented by a single code unit (fixed length encoding). (Note that even UTF-32 is still not completely trivial as the mapping is different for little and big-endian architectures). UTF-32 is commonly used under Unix systems for internal representation of Unicode strings.</p>
<p>Another very widespread standard is <b>UTF-16</b> which is used by Microsoft Windows: it encodes the first (approximately) 64 thousands of Unicode code points (the BMP plane) using 16-bit code units (2 bytes) and uses a pair of 16-bit code units to encode the characters beyond this. These pairs are called <em>surrogate</em>. Thus UTF16 uses a variable number of code units to encode each code point.</p>
<p>Finally, the most widespread encoding used for the external Unicode storage (e.g. files and network protocols) is <b>UTF-8</b> which is byte-oriented and so avoids the endianness ambiguities of UTF-16 and UTF-32. UTF-8 uses code units of 8 bits (1 byte); code points beyond the usual english alphabet are represented using a variable number of bytes, which makes it less efficient than UTF-32 for internal representation.</p>
<p>As visual aid to understand the differences between the various concepts described so far, look at the different UTF representations of the same code point:</p>
<div class="image">
<img src="overview_unicode_codes.png" alt="overview_unicode_codes.png"/>
</div>
<p>In this particular case UTF8 requires more space than UTF16 (3 bytes instead of 2).</p>
<p>Note that from the C/C++ programmer perspective the situation is further complicated by the fact that the standard type <code>wchar_t</code> which is usually used to represent the Unicode ("wide") strings in C/C++ doesn't have the same size on all platforms. It is 4 bytes under Unix systems, corresponding to the tradition of using UTF-32, but only 2 bytes under Windows which is required by compatibility with the OS which uses UTF-16.</p>
<p>Typically when UTF8 is used, code units are stored into <code>char</code> types, since <code>char</code> are 8bit wide on almost all systems; when using UTF16 typically code units are stored into <code>wchar_t</code> types since <code>wchar_t</code> is at least 16bits on all systems. This is also the approach used by <a class="el" href="classwx_string.html" title="String class for passing textual data to or receiving it from wxWidgets.">wxString</a>. See <a class="el" href="overview_string.html">wxString Overview</a> for more info.</p>
<p>See also <a href="http://unicode.org/glossary/">http://unicode.org/glossary/</a> for the official definitions of the terms reported above.</p>
<h1><a class="anchor" id="overview_unicode_supportin"></a>
Unicode Support in wxWidgets</h1>
<h2><a class="anchor" id="overview_unicode_support_default"></a>
Unicode is Always Used by Default</h2>
<p>Since wxWidgets 3.0 Unicode support is always enabled and while building the library without it is still possible, it is not recommended any longer and will cease to be supported in the near future. This means that internally only Unicode strings are used and that, under Microsoft Windows, Unicode system API is used which means that wxWidgets programs require the Microsoft Layer for Unicode to run on Windows 95/98/ME.</p>
<p>However, unlike the Unicode build mode of the previous versions of wxWidgets, this support is mostly transparent: you can still continue to work with the <b>narrow</b> (i.e. current locale-encoded <code>char*</code>) strings even if <b>wide</b> (i.e. UTF16-encoded <code>wchar_t*</code> or UTF8-encoded <code>char*</code>) strings are also supported. Any wxWidgets function accepts arguments of either type as both kinds of strings are implicitly converted to <a class="el" href="classwx_string.html" title="String class for passing textual data to or receiving it from wxWidgets.">wxString</a>, so both </p>
<div class="fragment"><div class="line"><a class="code" href="group__group__funcmacro__dialog.html#ga193c64ed4802e379799cdb42de252647" title="Show a general purpose message dialog.">wxMessageBox</a>(<span class="stringliteral">"Hello, world!"</span>);</div>
</div><!-- fragment --><p> and the somewhat less usual </p>
<div class="fragment"><div class="line"><a class="code" href="group__group__funcmacro__dialog.html#ga193c64ed4802e379799cdb42de252647" title="Show a general purpose message dialog.">wxMessageBox</a>(L<span class="stringliteral">"Salut \u00E0 toi!"</span>); <span class="comment">// U+00E0 is "Latin Small Letter a with Grave"</span></div>
</div><!-- fragment --><p> work as expected.</p>
<p>Notice that the narrow strings used with wxWidgets are <em>always</em> assumed to be in the current locale encoding, so writing </p>
<div class="fragment"><div class="line"><a class="code" href="group__group__funcmacro__dialog.html#ga193c64ed4802e379799cdb42de252647" title="Show a general purpose message dialog.">wxMessageBox</a>(<span class="stringliteral">"Salut à toi!"</span>);</div>
</div><!-- fragment --><p> wouldn't work if the encoding used on the user system is incompatible with ISO-8859-1 (or even if the sources were compiled under different locale in the case of gcc). In particular, the most common encoding used under modern Unix systems is UTF-8 and as the string above is not a valid UTF-8 byte sequence, nothing would be displayed at all in this case. Thus it is important to <b>never use 8-bit (instead of 7-bit) characters directly in the program source</b> but use wide strings or, alternatively, write: </p>
<div class="fragment"><div class="line"><a class="code" href="group__group__funcmacro__dialog.html#ga193c64ed4802e379799cdb42de252647" title="Show a general purpose message dialog.">wxMessageBox</a>(<a class="code" href="classwx_string.html#a2ddc1b7c8e1eb9adbf5874dead5b180b" title="Converts C string encoded in UTF-8 to wxString.">wxString::FromUTF8</a>(<span class="stringliteral">"Salut \xC3\xA0 toi!"</span>));</div>
<div class="line"> <span class="comment">// in UTF8 the character U+00E0 is encoded as 0xC3A0</span></div>
</div><!-- fragment --><p>In a similar way, <a class="el" href="classwx_string.html" title="String class for passing textual data to or receiving it from wxWidgets.">wxString</a> provides access to its contents as either <code>wchar_t</code> or <code>char</code> character buffer. Of course, the latter only works if the string contains data representable in the current locale encoding. This will always be the case if the string had been initially constructed from a narrow string or if it contains only 7-bit ASCII data but otherwise this conversion is not guaranteed to succeed. And as with <a class="el" href="classwx_string.html#a2ddc1b7c8e1eb9adbf5874dead5b180b" title="Converts C string encoded in UTF-8 to wxString.">wxString::FromUTF8()</a> example above, you can always use <a class="el" href="classwx_string.html#ac923e0bcfda57ec5064dcade9808db94" title="Same as utf8_str().">wxString::ToUTF8()</a> to retrieve the string contents in UTF-8 encoding – this, unlike converting to <code>char*</code> using the current locale, never fails.</p>
<p>For more info about how <a class="el" href="classwx_string.html" title="String class for passing textual data to or receiving it from wxWidgets.">wxString</a> works, please see the <a class="el" href="overview_string.html">wxString Overview</a>.</p>
<p>To summarize, Unicode support in wxWidgets is mostly <b>transparent</b> for the application and if you use <a class="el" href="classwx_string.html" title="String class for passing textual data to or receiving it from wxWidgets.">wxString</a> objects for storing all the character data in your program there is really nothing special to do. However you should be aware of the potential problems covered by the following section.</p>
<h2><a class="anchor" id="overview_unicode_support_utf"></a>
Choosing Unicode Representation</h2>
<p>wxWidgets uses the system <code>wchar_t</code> in <a class="el" href="classwx_string.html" title="String class for passing textual data to or receiving it from wxWidgets.">wxString</a> implementation by default under all systems. Thus, under Microsoft Windows, UCS-2 (simplified version of UTF-16 without support for surrogate characters) is used as <code>wchar_t</code> is 2 bytes on this platform. Under Unix systems, including Mac OS X, UCS-4 (also known as UTF-32) is used by default, however it is also possible to build wxWidgets to use UTF-8 internally by passing <code>–enable-utf8</code> option to configure.</p>
<p>The interface provided by <a class="el" href="classwx_string.html" title="String class for passing textual data to or receiving it from wxWidgets.">wxString</a> is the same independently of the format used internally. However different formats have specific advantages and disadvantages. Notably, under Unix, the underlying graphical toolkit (e.g. GTK+) usually uses UTF-8 encoded strings and using the same representations for the strings in wxWidgets allows to avoid conversion from UTF-32 to UTF-8 and vice versa each time a string is shown in the UI or retrieved from it. The overhead of such conversions is usually negligible for small strings but may be important for some programs. If you believe that it would be advantageous to use UTF-8 for the strings in your particular application, you may rebuild wxWidgets to use UTF-8 as explained above (notice that this is currently not supported under Microsoft Windows and arguably doesn't make much sense there as Windows itself uses UTF-16 and not UTF-8) but be sure to be aware of the performance implications (see <a class="el" href="overview_unicode.html#overview_unicode_performance">Performance Implications of Using UTF-8</a>) of using UTF-8 in <a class="el" href="classwx_string.html" title="String class for passing textual data to or receiving it from wxWidgets.">wxString</a> before doing this!</p>
<p>Generally speaking you should only use non-default UTF-8 build in specific circumstances e.g. building for resource-constrained systems where the overhead of conversions (and also reduced memory usage of UTF-8 compared to UTF-32 for the European languages) can be important. If the environment in which your program is running is under your control – as is quite often the case in such scenarios – consider ensuring that the system always uses UTF-8 locale and use <code>–enable-utf8only</code> configure option to disable support for the other locales and consider all strings to be in UTF-8. This further reduces the code size and removes the need for conversions in more cases.</p>
<h2><a class="anchor" id="overview_unicode_settings"></a>
Unicode Related Preprocessor Symbols</h2>
<p><code>wxUSE_UNICODE</code> is defined as 1 now to indicate Unicode support. It can be explicitly set to 0 in <code>setup.h</code> under MSW or you can use <code>–disable-unicode</code> under Unix but doing this is strongly discouraged. By default, <code>wxUSE_UNICODE_WCHAR</code> is also defined as 1, however in UTF-8 build (described in the previous section), it is set to 0 and <code>wxUSE_UNICODE_UTF8</code>, which is usually 0, is set to 1 instead. In the latter case, <code>wxUSE_UTF8_LOCALE_ONLY</code> can also be set to 1 to indicate that all strings are considered to be in UTF-8.</p>
<h1><a class="anchor" id="overview_unicode_pitfalls"></a>
Potential Unicode Pitfalls</h1>
<p>The problems can be separated into three broad classes:</p>
<h2><a class="anchor" id="overview_unicode_compilation_errors"></a>
Unicode-Related Compilation Errors</h2>
<p>Because of the need to support implicit conversions to both <code>char</code> and <code>wchar_t</code>, <a class="el" href="classwx_string.html" title="String class for passing textual data to or receiving it from wxWidgets.">wxString</a> implementation is rather involved and many of its operators don't return the types which they could be naively expected to return. For example, the <code>operator</code>[] doesn't return neither a <code>char</code> nor a <code>wchar_t</code> but an object of a helper class <a class="el" href="classwx_uni_char.html" title="This class represents a single Unicode character.">wxUniChar</a> or <a class="el" href="classwx_uni_char_ref.html" title="Writeable reference to a character in wxString.">wxUniCharRef</a> which is implicitly convertible to either. Usually you don't need to worry about this as the conversions do their work behind the scenes however in some cases it doesn't work. Here are some examples, using a <a class="el" href="classwx_string.html" title="String class for passing textual data to or receiving it from wxWidgets.">wxString</a> object <code>s</code> and some integer <code>n:</code> </p>
<ul>
<li>Writing <div class="fragment"><div class="line"><span class="keywordflow">switch</span> ( s[n] ) </div>
</div><!-- fragment --> doesn't work because the argument of the switch statement must be an integer expression so you need to replace <code>s</code>[n] with <div class="fragment"><div class="line">s[n].GetValue() </div>
</div><!-- fragment -->. You may also force the conversion to <code>char</code> or <code>wchar_t</code> by using an explicit cast but beware that converting the value to char uses the conversion to current locale and may return 0 if it fails. Finally notice that writing <div class="fragment"><div class="line">(<a class="code" href="group__group__funcmacro__string.html#gad42f64d8c82f1ce4ae58773a89b2d6a7" title="wxChar is defined to bechar when wxUSE_UNICODE==0wchar_t when wxUSE_UNICODE==1 (the default)...">wxChar</a>)s[n] </div>
</div><!-- fragment --> works both with wxWidgets 3.0 and previous library versions and so should be used for writing code which should be compatible with both 2.8 and 3.0.</li>
</ul>
<ul>
<li>Similarly, <div class="fragment"><div class="line">&s[n] </div>
</div><!-- fragment --> doesn't yield a pointer to char so you may not pass it to functions expecting <code>char*</code> or <code>wchar_t*</code>. Consider using string iterators instead if possible or replace this expression with <div class="fragment"><div class="line">s.<a class="code" href="classwx_string.html#a6418ec90c6d4ffe0b05702be1b35df4f" title="Returns a lightweight intermediate class which is in turn implicitly convertible to both const char* ...">c_str</a>() + n </div>
</div><!-- fragment --> otherwise.</li>
</ul>
<p>Another class of problems is related to the fact that the value returned by <code>c_str()</code> itself is also not just a pointer to a buffer but a value of helper class wxCStrData which is implicitly convertible to both narrow and wide strings. Again, this mostly will be unnoticeable but can result in some problems:</p>
<ul>
<li><p class="startli">You shouldn't pass <code>c_str()</code> result to vararg functions such as standard <code>printf()</code>. Some compilers (notably g++) warn about this but even if they don't, this </p>
<div class="fragment"><div class="line">printf(<span class="stringliteral">"Hello, %s"</span>, s.<a class="code" href="classwx_string.html#a6418ec90c6d4ffe0b05702be1b35df4f" title="Returns a lightweight intermediate class which is in turn implicitly convertible to both const char* ...">c_str</a>()) </div>
</div><!-- fragment --><p> is not going to work. It can be corrected in one of the following ways:</p>
<ul>
<li>Preferred: <div class="fragment"><div class="line">wxPrintf(<span class="stringliteral">"Hello, %s"</span>, s) </div>
</div><!-- fragment --> (notice the absence of <code>c_str()</code>, it is not needed at all with wxWidgets functions)</li>
<li>Compatible with wxWidgets 2.8: <div class="fragment"><div class="line">wxPrintf(<span class="stringliteral">"Hello, %s"</span>, s.<a class="code" href="classwx_string.html#a6418ec90c6d4ffe0b05702be1b35df4f" title="Returns a lightweight intermediate class which is in turn implicitly convertible to both const char* ...">c_str</a>()) </div>
</div><!-- fragment --></li>
<li>Using an explicit conversion to narrow, multibyte, string: <div class="fragment"><div class="line">printf(<span class="stringliteral">"Hello, %s"</span>, (<span class="keyword">const</span> <span class="keywordtype">char</span> *)s.<a class="code" href="classwx_string.html#adcfd12e6d0765b1d74bccc3d63d02e98" title="Returns the multibyte (C string) representation of the string using conv's wxMBConv::cWC2MB method an...">mb_str</a>()) </div>
</div><!-- fragment --></li>
<li>Using a cast to force the issue (listed only for completeness): <div class="fragment"><div class="line">printf(<span class="stringliteral">"Hello, %s"</span>, (<span class="keyword">const</span> <span class="keywordtype">char</span> *)s.<a class="code" href="classwx_string.html#a6418ec90c6d4ffe0b05702be1b35df4f" title="Returns a lightweight intermediate class which is in turn implicitly convertible to both const char* ...">c_str</a>()) </div>
</div><!-- fragment --></li>
</ul>
</li>
</ul>
<ul>
<li>The result of <code>c_str()</code> cannot be cast to <code>char*</code> but only to <code>const</code> <code><code>char*</code>.</code> Of course, modifying the string via the pointer returned by this method has never been possible but unfortunately it was occasionally useful to use a <code>const_cast</code> here to pass the value to const-incorrect functions. This can be done either using new <a class="el" href="classwx_string.html#aedcaea87fc347a940263a533bd56846f" title="Returns an object with string data that is implicitly convertible to char* pointer.">wxString::char_str()</a> (and matching wchar_str()) method or by writing a double cast: <div class="fragment"><div class="line">(<span class="keywordtype">char</span> *)(<span class="keyword">const</span> <span class="keywordtype">char</span> *)s.<a class="code" href="classwx_string.html#a6418ec90c6d4ffe0b05702be1b35df4f" title="Returns a lightweight intermediate class which is in turn implicitly convertible to both const char* ...">c_str</a>() </div>
</div><!-- fragment --></li>
</ul>
<ul>
<li>One of the unfortunate consequences of the possibility to pass <a class="el" href="classwx_string.html" title="String class for passing textual data to or receiving it from wxWidgets.">wxString</a> to <code>wxPrintf()</code> without using <code>c_str()</code> is that it is now impossible to pass the elements of unnamed enumerations to <code>wxPrintf()</code> and other similar vararg functions, i.e. <div class="fragment"><div class="line"><span class="keyword">enum</span> { Red, Green, Blue };</div>
<div class="line">wxPrintf(<span class="stringliteral">"Red is %d"</span>, Red);</div>
</div><!-- fragment --> doesn't compile. The easiest workaround is to give a name to the enum.</li>
</ul>
<p>Other unexpected compilation errors may arise but they should happen even more rarely than the above-mentioned ones and the solution should usually be quite simple: just use the explicit methods of <a class="el" href="classwx_uni_char.html" title="This class represents a single Unicode character.">wxUniChar</a> and wxCStrData classes instead of relying on their implicit conversions if the compiler can't choose among them.</p>
<h2><a class="anchor" id="overview_unicode_data_loss"></a>
Data Loss due To Unicode Conversion Errors</h2>
<p><a class="el" href="classwx_string.html" title="String class for passing textual data to or receiving it from wxWidgets.">wxString</a> API provides implicit conversion of the internal Unicode string contents to narrow, char strings. This can be very convenient and is absolutely necessary for backwards compatibility with the existing code using wxWidgets however it is a rather dangerous operation as it can easily give unexpected results if the string contents isn't convertible to the current locale.</p>
<p>To be precise, the conversion will always succeed if the string was created from a narrow string initially. It will also succeed if the current encoding is UTF-8 as all Unicode strings are representable in this encoding. However initializing the string using <a class="el" href="classwx_string.html#a2ddc1b7c8e1eb9adbf5874dead5b180b" title="Converts C string encoded in UTF-8 to wxString.">wxString::FromUTF8()</a> method and then accessing it as a char string via its <a class="el" href="classwx_string.html#a6418ec90c6d4ffe0b05702be1b35df4f" title="Returns a lightweight intermediate class which is in turn implicitly convertible to both const char* ...">wxString::c_str()</a> method is a recipe for disaster as the program may work perfectly well during testing on Unix systems using UTF-8 locale but completely fail under Windows where UTF-8 locales are never used because <a class="el" href="classwx_string.html#a6418ec90c6d4ffe0b05702be1b35df4f" title="Returns a lightweight intermediate class which is in turn implicitly convertible to both const char* ...">wxString::c_str()</a> would return an empty string.</p>
<p>The simplest way to ensure that this doesn't happen is to avoid conversions to <code>char*</code> completely by using <a class="el" href="classwx_string.html" title="String class for passing textual data to or receiving it from wxWidgets.">wxString</a> throughout your program. However if the program never manipulates 8 bit strings internally, using <code>char*</code> pointers is safe as well. So the existing code needs to be reviewed when upgrading to wxWidgets 3.0 and the new code should be used with this in mind and ideally avoiding implicit conversions to <code>char*</code>.</p>
<h2><a class="anchor" id="overview_unicode_performance"></a>
Performance Implications of Using UTF-8</h2>
<p>As mentioned above, under Unix systems <a class="el" href="classwx_string.html" title="String class for passing textual data to or receiving it from wxWidgets.">wxString</a> class can use variable-width UTF-8 encoding for internal representation. In this case it can't guarantee constant-time access to N-th element of the string any longer as to find the position of this character in the string we have to examine all the preceding ones. Usually this doesn't matter much because most algorithms used on the strings examine them sequentially anyhow and because <a class="el" href="classwx_string.html" title="String class for passing textual data to or receiving it from wxWidgets.">wxString</a> implements a cache for iterating over the string by index but it can have serious consequences for algorithms using random access to string elements as they typically acquire O(N^2) time complexity instead of O(N) where N is the length of the string.</p>
<p>Even despite caching the index, indexed access should be replaced with sequential access using string iterators. For example a typical loop: </p>
<div class="fragment"><div class="line"><a class="code" href="classwx_string.html" title="String class for passing textual data to or receiving it from wxWidgets.">wxString</a> s(<span class="stringliteral">"hello"</span>);</div>
<div class="line"><span class="keywordflow">for</span> ( <span class="keywordtype">size_t</span> i = 0; i < s.<a class="code" href="classwx_string.html#af63f200410b56436a830550905e20539">length</a>(); i++ )</div>
<div class="line">{</div>
<div class="line"> <span class="keywordtype">wchar_t</span> ch = s[i];</div>
<div class="line"></div>
<div class="line"> <span class="comment">// do something with it</span></div>
<div class="line">}</div>
</div><!-- fragment --><p> should be rewritten as </p>
<div class="fragment"><div class="line"><a class="code" href="classwx_string.html" title="String class for passing textual data to or receiving it from wxWidgets.">wxString</a> s(<span class="stringliteral">"hello"</span>);</div>
<div class="line"><span class="keywordflow">for</span> ( wxString::const_iterator i = s.<a class="code" href="classwx_string.html#ad59ca2dd208720b3cce07d90bcb90093">begin</a>(); i != s.<a class="code" href="classwx_string.html#a6a0f235fff88df5e6b16b5f0e1e719cc">end</a>(); ++i )</div>
<div class="line">{</div>
<div class="line"> <span class="keywordtype">wchar_t</span> ch = *i</div>
<div class="line"></div>
<div class="line"> <span class="comment">// do something with it</span></div>
<div class="line">}</div>
</div><!-- fragment --><p>Another, similar, alternative is to use pointer arithmetic: </p>
<div class="fragment"><div class="line"><a class="code" href="classwx_string.html" title="String class for passing textual data to or receiving it from wxWidgets.">wxString</a> s(<span class="stringliteral">"hello"</span>);</div>
<div class="line"><span class="keywordflow">for</span> ( <span class="keyword">const</span> <span class="keywordtype">wchar_t</span> *p = s.<a class="code" href="classwx_string.html#a6cd4782263a3ed4064eca915eb6e27e6" title="Converts the strings contents to the wide character representation and returns it as a temporary wxWC...">wc_str</a>(); *p; p++ )</div>
<div class="line">{</div>
<div class="line"> <span class="keywordtype">wchar_t</span> ch = *i</div>
<div class="line"></div>
<div class="line"> <span class="comment">// do something with it</span></div>
<div class="line">}</div>
</div><!-- fragment --><p> however this doesn't work correctly for strings with embedded <code>NUL</code> characters and the use of iterators is generally preferred as they provide some run-time checks (at least in debug build) unlike the raw pointers. But if you do use them, it is better to use <code>wchar_t</code> pointers rather than <code>char</code> ones to avoid the data loss problems due to conversion as discussed in the previous section.</p>
<h1><a class="anchor" id="overview_unicode_supportout"></a>
Unicode and the Outside World</h1>
<p>Even though wxWidgets always uses Unicode internally, not all the other libraries and programs do and even those that do use Unicode may use a different encoding of it. So you need to be able to convert the data to various representations and the <a class="el" href="classwx_string.html" title="String class for passing textual data to or receiving it from wxWidgets.">wxString</a> methods <a class="el" href="classwx_string.html#a2fec30dc8959d4fd3a56cd148bf1e57a" title="Converts the string to an ASCII, 7-bit string in the form of a wxCharBuffer (Unicode builds only) or ...">wxString::ToAscii()</a>, <a class="el" href="classwx_string.html#ac923e0bcfda57ec5064dcade9808db94" title="Same as utf8_str().">wxString::ToUTF8()</a> (or its synonym <a class="el" href="classwx_string.html#ad71e3ded85939db8af9eeadfa02719ac" title="Converts the strings contents to UTF-8 and returns it either as a temporary wxCharBuffer object or as...">wxString::utf8_str()</a>), <a class="el" href="classwx_string.html#adcfd12e6d0765b1d74bccc3d63d02e98" title="Returns the multibyte (C string) representation of the string using conv's wxMBConv::cWC2MB method an...">wxString::mb_str()</a>, <a class="el" href="classwx_string.html#a6418ec90c6d4ffe0b05702be1b35df4f" title="Returns a lightweight intermediate class which is in turn implicitly convertible to both const char* ...">wxString::c_str()</a> and <a class="el" href="classwx_string.html#a6cd4782263a3ed4064eca915eb6e27e6" title="Converts the strings contents to the wide character representation and returns it as a temporary wxWC...">wxString::wc_str()</a> can be used for this.</p>
<p>The first of them should be only used for the string containing 7-bit ASCII characters only, anything else will be replaced by some substitution character. <a class="el" href="classwx_string.html#adcfd12e6d0765b1d74bccc3d63d02e98" title="Returns the multibyte (C string) representation of the string using conv's wxMBConv::cWC2MB method an...">wxString::mb_str()</a> converts the string to the encoding used by the current locale and so can return an empty string if the string contains characters not representable in it as explained in <a class="el" href="overview_unicode.html#overview_unicode_data_loss">Data Loss due To Unicode Conversion Errors</a>. The same applies to <a class="el" href="classwx_string.html#a6418ec90c6d4ffe0b05702be1b35df4f" title="Returns a lightweight intermediate class which is in turn implicitly convertible to both const char* ...">wxString::c_str()</a> if its result is used as a narrow string. Finally, <a class="el" href="classwx_string.html#ac923e0bcfda57ec5064dcade9808db94" title="Same as utf8_str().">wxString::ToUTF8()</a> and <a class="el" href="classwx_string.html#a6cd4782263a3ed4064eca915eb6e27e6" title="Converts the strings contents to the wide character representation and returns it as a temporary wxWC...">wxString::wc_str()</a> functions never fail and always return a pointer to char string containing the UTF-8 representation of the string or <code>wchar_t</code> string.</p>
<p><a class="el" href="classwx_string.html" title="String class for passing textual data to or receiving it from wxWidgets.">wxString</a> also provides two convenience functions: <a class="el" href="classwx_string.html#a5aedc23e9cc2774237d99148d0622661" title="Converts given buffer of binary data from 8-bit string to wxString.">wxString::From8BitData()</a> and <a class="el" href="classwx_string.html#afa91a632574bcbba1bf35b54f2c5562a" title="Converts the string to an 8-bit string in ISO-8859-1 encoding in the form of a wxCharBuffer (Unicode ...">wxString::To8BitData()</a>. They can be used to create a <a class="el" href="classwx_string.html" title="String class for passing textual data to or receiving it from wxWidgets.">wxString</a> from arbitrary binary data without supposing that it is in current locale encoding, and then get it back, again, without any conversion or, rather, undoing the conversion used by <a class="el" href="classwx_string.html#a5aedc23e9cc2774237d99148d0622661" title="Converts given buffer of binary data from 8-bit string to wxString.">wxString::From8BitData()</a>. Because of this you should only use <a class="el" href="classwx_string.html#a5aedc23e9cc2774237d99148d0622661" title="Converts given buffer of binary data from 8-bit string to wxString.">wxString::From8BitData()</a> for the strings created using <a class="el" href="classwx_string.html#afa91a632574bcbba1bf35b54f2c5562a" title="Converts the string to an 8-bit string in ISO-8859-1 encoding in the form of a wxCharBuffer (Unicode ...">wxString::To8BitData()</a>. Also notice that in spite of the availability of these functions, <a class="el" href="classwx_string.html" title="String class for passing textual data to or receiving it from wxWidgets.">wxString</a> is not the ideal class for storing arbitrary binary data as they can take up to 4 times more space than needed (when using <code>wchar_t</code> internal representation on the systems where size of wide characters is 4 bytes) and you should consider using <a class="el" href="classwx_memory_buffer.html" title="A wxMemoryBuffer is a useful data structure for storing arbitrary sized blocks of memory...">wxMemoryBuffer</a> instead.</p>
<p>Final word of caution: most of these functions may return either directly the pointer to internal string buffer or a temporary <a class="el" href="classwx_char_buffer.html" title="This is a specialization of wxCharTypeBuffer<T> for char type.">wxCharBuffer</a> or <a class="el" href="classwx_w_char_buffer.html" title="This is a specialization of wxCharTypeBuffer<T> for wchar_t type.">wxWCharBuffer</a> object. Such objects are implicitly convertible to <code>char</code> and <code>wchar_t</code> pointers, respectively, and so the result of, for example, <a class="el" href="classwx_string.html#ac923e0bcfda57ec5064dcade9808db94" title="Same as utf8_str().">wxString::ToUTF8()</a> can always be passed directly to a function taking <code>const char*</code>. However code such as </p>
<div class="fragment"><div class="line"><span class="keyword">const</span> <span class="keywordtype">char</span> *p = s.<a class="code" href="classwx_string.html#ac923e0bcfda57ec5064dcade9808db94" title="Same as utf8_str().">ToUTF8</a>();</div>
<div class="line">...</div>
<div class="line">puts(p); <span class="comment">// or call any other function taking const char *</span></div>
</div><!-- fragment --><p> does <b>not</b> work because the temporary buffer returned by <a class="el" href="classwx_string.html#ac923e0bcfda57ec5064dcade9808db94" title="Same as utf8_str().">wxString::ToUTF8()</a> is destroyed and <code>p</code> is left pointing nowhere. To correct this you should use </p>
<div class="fragment"><div class="line"><span class="keyword">const</span> <a class="code" href="classwx_scoped_char_type_buffer.html">wxScopedCharBuffer</a> p(s.<a class="code" href="classwx_string.html#ac923e0bcfda57ec5064dcade9808db94" title="Same as utf8_str().">ToUTF8</a>());</div>
<div class="line">puts(p);</div>
</div><!-- fragment --><p> which does work.</p>
<p>Similarly, wxWX2WCbuf can be used for the return type of <a class="el" href="classwx_string.html#a6cd4782263a3ed4064eca915eb6e27e6" title="Converts the strings contents to the wide character representation and returns it as a temporary wxWC...">wxString::wc_str()</a>. But, once again, none of these cryptic types is really needed if you just pass the return value of any of the functions mentioned in this section to another function directly. </p>
</div></div><!-- contents -->
<address class="footer">
<small>
Generated on Thu Nov 27 2014 13:46:42 for wxWidgets by <a href="http://www.doxygen.org/index.html" target="_new">Doxygen</a> 1.8.2
</small>
</address>
<script src="wxwidgets.js" type="text/javascript"></script>
</div><!-- #page_container -->
</body>
</html>
|