1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665
|
#LyX 1.5.5 created this file. For more info see http://www.lyx.org/
\lyxformat 276
\begin_document
\begin_header
\textclass report
\begin_preamble
\usepackage{hyperref}
\end_preamble
\language english
\inputencoding auto
\font_roman default
\font_sans default
\font_typewriter default
\font_default_family default
\font_sc false
\font_osf false
\font_sf_scale 100
\font_tt_scale 100
\graphics default
\paperfontsize default
\spacing single
\papersize a4paper
\use_geometry false
\use_amsmath 1
\use_esint 1
\cite_engine basic
\use_bibtopic false
\paperorientation portrait
\secnumdepth 3
\tocdepth 3
\paragraph_separation skip
\defskip medskip
\quotes_language english
\papercolumns 1
\papersides 1
\paperpagestyle default
\tracking_changes false
\output_changes false
\author ""
\author ""
\end_header
\begin_body
\begin_layout Title
The new string unicode type
\end_layout
\begin_layout Author
Marco van de Voort
\end_layout
\begin_layout Standard
Version: 0.02
\begin_inset Graphics
filename unicode_small.jpg
\end_inset
\end_layout
\begin_layout Standard
\begin_inset LatexCommand tableofcontents
\end_inset
\end_layout
\begin_layout Section
Introduction
\end_layout
\begin_layout Standard
Lately there has been some discussion about a new unicode type, mostly due
to a request from the Lazarus team to support unicode in file operations
(for filenames, not handling of unicode files).
A few proposals were made on the fpc-pascal maillist, abd some discussion
followed, but it died out, and there a lot of details of the proposals
were only discussed on subthreads.
\end_layout
\begin_layout Standard
I decided to try to summarize all positions and requirements, at least as
I saw them as a kind of a discussion document.
During the discussions I also detailed the requirements I had in mind a
bit more, so I decided to write them down too.
\end_layout
\begin_layout Standard
Versioning:
\end_layout
\begin_layout Itemize
First version mostly my own writeup.
Was originally meant to highlight the flaws that I saw in Florian's original
proposal.
There might still be some of the negative sentiment left, please skip it.
\end_layout
\begin_layout Itemize
Second version mostly Florian's feedback which I commented on
\end_layout
\begin_layout Itemize
Third version vastly expanded the Tiburón paragraph when CG lifted the veil
a bit late July/early August, and the hybrid model.
\end_layout
\begin_layout Itemize
Fourth version added the
\begin_inset Quotes eld
\end_inset
economics
\begin_inset Quotes erd
\end_inset
paragraph, expanded the hybrid model and mentions Yury's proposal and wiki
page.
\end_layout
\begin_layout Subsection
Tiburón
\end_layout
\begin_layout Standard
Tiburón is the codename for what is supposed to be the next version of Delphi
( version 2008?), and is supposed to have unicode.
While we currently do not follow Delphi compatability slavishly, it should
only be broken if there are good reasons.
A main reason for this is to not make life too hard on Delphi open source
projects that also want to support FPC/Lazarus.
Slowly details about Tiburón are starting to appear in CG oriented blogs.
(e.g.
Andreas Bauer's)
\end_layout
\begin_layout Itemize
A new utf-16 ref counted unicode stringtype is added.
\end_layout
\begin_deeper
\begin_layout Itemize
s[x] doesn't take care of surrogates.
\end_layout
\begin_layout Itemize
It is not yet clear if and how it supports endianness.
\end_layout
\end_deeper
\begin_layout Itemize
Ansistring becomes a basetype for all 1 byte based encodings (ansi, codepages,UT
F-8), based on the fact that for internal windows functions, UTF-8 is treated
as a codepage.
\end_layout
\begin_deeper
\begin_layout Itemize
To define a stringtype for a certain (Windows) codepage enumeration value,
type mycodepagestring = type ansistring (1251);
\end_layout
\begin_layout Itemize
Conversions that write a non UTF-8 codepage can be lossy.
\end_layout
\begin_layout Itemize
UTF-8 is codepage 65001 (ident CP_UTF8)
\end_layout
\begin_layout Itemize
codepage $FFFF is used for an
\begin_inset Quotes eld
\end_inset
Rawbyte
\begin_inset Quotes erd
\end_inset
ansistring that is never converted, it's binary copied into the target.
\end_layout
\begin_layout Itemize
probably value codepage $0 is used for the old ansistring.
The conversions to and from this type (which codepage?) are not clear.
\end_layout
\begin_layout Itemize
It seems that the typing of ansistring has become stronger, and honor TYPE
(as in something = TYPE ansistring) is now really an incompatible type.
\end_layout
\begin_layout Itemize
Conversions are done over UTF-16, but this might be a Windows implementation
detail.
(IOW on Unix use UTF-8)
\end_layout
\end_deeper
\begin_layout Standard
This quick summary has three aspects I don't like for porting to FPC:
\end_layout
\begin_layout Enumerate
The use of windows specific codepage enumeration values in language syntax.
However maybe they are really serious about the constants use, and this
is livable.
In my opinion it is the VCLs job to encapsulate the winapi gore, and if
it can't be avoided, at least encourage a clean use.
Daniel notes that there are not much platform independant choices to begin
with.
\end_layout
\begin_layout Enumerate
The fact that conversions between codepages are automated and can fail.
(see also the discussion about codepages in the critique of Florian's proposal)
This means that if you use codepage strings, you must be very careful with
your codepaths, so that you can be pretty sure that there aren't alternate
paths that mutilate data.
\end_layout
\begin_layout Enumerate
UTF-8 and UTF-16 are scattered over two different types.
This solution is non-orthogonal.
\end_layout
\begin_layout Standard
They probably had the same as we problem for multiple-encodings types (see
\begin_inset Quotes eld
\end_inset
Granularity of []
\begin_inset Quotes erd
\end_inset
) , but chose to keep this compiletime by dividing the types according to
1 or 2 byte granularity.
Maybe this also has some advantages in the compiler (being able to treat
tunicodestring and twidestring the same here and there).
And they don't support UTF-32, probably because windows doesn't (or it
isn't used)
\end_layout
\begin_layout Standard
Another question mark is the fact that a lot of new ansistring variants
are introduced that are apparantly type safe.
The question begs what stringtype is in e.g.
variant (my guess: all non ansi ansistrings are converted to either widestring
or a new tunicodestring field)
\end_layout
\begin_layout Subsection
The encodings
\end_layout
\begin_layout Standard
The three main encodings are UTF-8, UTF-16 and UTF-32.
An important property of these is that they are basically different ways
to describe the same, so they can be convererted to eachother pretty easily
and safely.
Note that the multi byte encodings (16 and 32 (?)) also have big endian
and little endian variants.
\end_layout
\begin_layout Standard
However for now I'm going to
\series bold
forget the big endian and little endianess
\series default
.
This kind of cross-platform compability is fairly rarely a problem.
Only files that share that between different architectures need to insert
conversions, and this can be better done manually.
The same goes for arbitrary other sources that might have a different encoding.
\end_layout
\begin_layout Standard
Besides these three main encodings, conversions of the string type to and
from the older codepages could be useful too, because the world won't become
unicode instantly, and ansistrings are here to stay for a while.
Most notably Florian's proposal has some (potential) support for other
codepages too, though not many details.
\end_layout
\begin_layout Subsection
Economics of the encodings
\end_layout
\begin_layout Standard
In one of the unicode discussions Daniel posted this link:
\begin_inset LatexCommand url
name " http://unicode.org/notes/tn12/ "
target " http://unicode.org/notes/tn12/ "
\end_inset
.
I just had some discussion about this in a different maillist on the subject
of which is the ideal encoding, and here is my opinion some comments about
encoding enconomics.
Note that not all points are meant as arguments in favour of UTF-8 per
se, just observations.
\end_layout
\begin_layout Itemize
First and for all, the question is mostly irrelevant since the choice of
primary encoding (and endianness if>8) for a platform/target has been made
already by the OS and the general ABI.
Deviating from this to simply possible multiplatform programmers at the
expensive of people programming for the platform natively is IMHO not an
option.
\end_layout
\begin_layout Itemize
An often misinformed statement is that everything but ansi is worse in utf-8.
This is not true, everything up from ascii to codepoint $0800 is equal
in size between UTF-8 and UTF-16.
This plane contains Cyrillic as well as several popular languages from
the Semetic group like Hebrew and Arabic.
\end_layout
\begin_layout Itemize
The simplicity of UTF-16 is quoted in a lot of place, the above link inclusive.
While some may see it acceptable to cut corners in applications, it is
IMHO not acceptable to break full unicode compliance in a serious library,
and most of all, a RTL.
This means that most speeddependant routines in an app must be able to
handle UTF-16 surrogates and maybe also endianness.
I personally think that serious applications shouldn't cut corners either.
Note though that surrogates don't hinder all string routines.
\end_layout
\begin_layout Itemize
Routines that don't need to process UTF-8 surrogates and encounter mostly
Latin scripts are faster in UTF-8.
(less bytes to move)
\end_layout
\begin_layout Standard
Btw I use
\begin_inset LatexCommand url
name "http://www.unicode.org/roadmaps/bmp/"
target "http://www.unicode.org/roadmaps/bmp/"
\end_inset
to quickly see what language groups are where in the BMP.
\end_layout
\begin_layout Subsection
Granularity of []
\begin_inset LatexCommand index
name "par:granularity"
\end_inset
\end_layout
\begin_layout Standard
One of the benefits of the discussion was that it called some attention
to the s[] operator.
First because it was a possible weakness of Florian's proposal (that got
remedied later), but the more important one from a design perspective is
what c:=s[5]; is supposed to mean with (s in [UTF8,UTF16,UTF32]).
\end_layout
\begin_layout Standard
Let's take utf16 for a moment, and assume we have 10 codepoints, and every
second is a surrogate.
Then there are two possible meanings:
\end_layout
\begin_layout Subsubsection
Meaning 1: index means codepoints
\end_layout
\begin_layout Standard
In this meaning, a string is (a view on) an array of codepoints.
So
\end_layout
\begin_layout Standard
c:=s[5]; means the 5th codepoint.
A codepoint can be >2 bytes, so type of
\begin_inset Quotes eld
\end_inset
c
\begin_inset Quotes erd
\end_inset
must be able to contain a 32-bit value.
The first 5 codepoints have two with surrogates so the address of the first
char is @s[1]+5*2 + 2*2=@s[1]+14 (all in bytes)
\end_layout
\begin_layout Subsubsection
Meaning II: index means granularity of the encoding
\end_layout
\begin_layout Standard
In this meaning the string is (a view on) an array with the ganularity of
the encoding.
So 1 in the case of UTF-8, 2 in the case of UTF-16 etc.
\end_layout
\begin_layout Standard
c:=s[5]; in UTF-16 means s[1]+5*2 =@s[1]+10 (all in bytes)
\end_layout
\begin_layout Subsubsection
Granularity conclusion
\end_layout
\begin_layout Standard
(Note that the same problem also goes for Length(s).
codepoints or elements in the granularity of the encoding?)
\end_layout
\begin_layout Standard
The problem with the array of codepoints is that typical code like
\end_layout
\begin_layout LyX-Code
for i:=1 to length(s) do
\end_layout
\begin_layout LyX-Code
s[i]:=' ';
\end_layout
\begin_layout Standard
is very expensive since
\end_layout
\begin_layout Itemize
the address of s[x] depends on all codepoints before codepoints x.
This can make the above loop quadratic in the number of codepoints jumps
( on average (n^2)/2).
Most platforms also use a procedure to iterate over codepoints.
\end_layout
\begin_layout Itemize
each codepoint assignment can possibly be an insertion or deletion of bytes,
since the assigned codepoint can be smaller or larger than the codepoint
already in place.
\end_layout
\begin_layout Standard
IMHO this opens a can of worms where we don't want to go
\begin_inset Foot
status collapsed
\begin_layout Standard
Since we don't have any optimizations that optimize loops in an advance
way, I don't think it is acceptable to waive this point in the hope that
future optimizations will solve this.
\end_layout
\end_inset
.
However it might be an argument to (also) support UTF-32, since that does
allow fairly easy char manipulation, with minimal limitations: If it is
a routine that is not really much used, the simplest way to convert would
be to do something like
\end_layout
\begin_layout LyX-Code
procedure dosomething (var s:utf16string);
\end_layout
\begin_layout LyX-Code
var internals: utf32string;
\end_layout
\begin_layout LyX-Code
begin
\end_layout
\begin_layout LyX-Code
internals:=s; // force conversion to utf32.
\end_layout
\begin_layout LyX-Code
<<insert old ansistring code here, but only update
\begin_inset Quotes eld
\end_inset
char
\begin_inset Quotes erd
\end_inset
to a 32-bits type>>
\end_layout
\begin_layout LyX-Code
s:=internals; // convert back.
\end_layout
\begin_layout LyX-Code
end;
\end_layout
\begin_layout Standard
Of course this is not perfect (e.g.
charsets won't work because even a charset for the defined codepoints would
be in the magnitude of 125k), but it is easy, and avoids messing too much
with working code.
\end_layout
\begin_layout Section
Requirements
\end_layout
\begin_layout Standard
The requirements are a bit of a problem because there are several factors
that are not compatible to each other (e.g.
speed and ease of use), and tradeoffs vary.
Anyway the main requirements in a very broad definition are:
\end_layout
\begin_layout Itemize
Ease of use
\end_layout
\begin_layout Itemize
Reasonable to good performance
\end_layout
\begin_layout Itemize
Compatible with normal ansistring handling as much as reasonably possible.
\end_layout
\begin_layout Itemize
Compatible with Tiburón
\end_layout
\begin_layout Itemize
Multi platform aspects.
\end_layout
\begin_layout Itemize
Respect certain FPC traditions, most notably the need to combine code from
different origins/styles into one program.
(e.g shortstring TP and ansistring Delphi code) code are currently combinable
in one program, and a single directive controls the meaning of the
\begin_inset Quotes eld
\end_inset
string
\begin_inset Quotes erd
\end_inset
type to make it compatible on a per unit basis with both)
\end_layout
\begin_layout Standard
Most of the unices use UTF-8, Windows use UTF-16.
\end_layout
\begin_layout Subsection
Required
\begin_inset Quotes eld
\end_inset
new
\begin_inset Quotes erd
\end_inset
functions.
\end_layout
\begin_layout Enumerate
Regardless which choice is made for the default (see
\begin_inset LatexCommand vref
reference "par:granularity"
\end_inset
), Length(s) should be available in both meanings: length in codepoints
and in granularity length.
\end_layout
\begin_layout Enumerate
charat(n) - returns codepoint [n]...
assuming we chose the encoding granularity.
\end_layout
\begin_layout Enumerate
charnext (strnext out of delphi compat?)
\end_layout
\begin_layout Standard
How much of these will/should be (partially) inlinable? Is it worth it?
It seems that most libc's use functions, not macro's, which might be an
indicator that procedural overhead is less than the actual operation.
\end_layout
\begin_layout Subsection
The Windows
\begin_inset Quotes eld
\end_inset
W
\begin_inset Quotes erd
\end_inset
problem
\end_layout
\begin_layout Standard
Sideways related is the windows problem that on NT special functions must
be called for unicode strings, all these functions end on -W instead of
-A.
Also all these symbols (and their record definitions) are typically organized
in the windows header source as
\end_layout
\begin_layout LyX-Code
{$ifdef unicode}
\end_layout
\begin_layout LyX-Code
procedure xxx; (arguments);stdcall; external 'kernel32.dll' name 'xxxW';
\end_layout
\begin_layout LyX-Code
{$else}
\end_layout
\begin_layout LyX-Code
procedure xxx; (arguments);stdcall; external 'kernel32.dll' name 'xxxA';
\end_layout
\begin_layout LyX-Code
{$endif}
\end_layout
\begin_layout Standard
The actual problem is that these calls don't exist (or work) on windows
9x.
There are several solutions for this problem:
\end_layout
\begin_layout Enumerate
A combination of runtime OS detection and loading.
Problem is that the windows header sets are huge, and there is a great
potential for error.
\end_layout
\begin_layout Enumerate
Splitting the win32 target over unicode support..
So the current implementation is parameterized and move to a shared dir,
and win9x target sets some types and defines, and imports these includefiles,
as well as the NT-unicode target that defines UNICODE.
\end_layout
\begin_layout Standard
Personally I like the splitting.
Note that the
\begin_inset Quotes eld
\end_inset
win9x
\begin_inset Quotes erd
\end_inset
target will still work on win NT/2k/XP, and is in fact a
\begin_inset Quotes eld
\end_inset
real
\begin_inset Quotes erd
\end_inset
win32.
Note that the target names were picked in a hurry, maybe
\begin_inset Quotes eld
\end_inset
win32
\begin_inset Quotes erd
\end_inset
and
\begin_inset Quotes eld
\end_inset
winnt
\begin_inset Quotes erd
\end_inset
are better target names.
\end_layout
\begin_layout Standard
\series bold
NOTE:
\series default
I haven't seen any clear evidence yet to which set of API functions UTF_8
needs to be pasted (NT Only).
If any.
\end_layout
\begin_layout Section
The proposals
\end_layout
\begin_layout Standard
In the maillist discussion there were 3 proposals that I'll summarize shortly
below.
\end_layout
\begin_layout Subsection
Felipe's proposal.
\end_layout
\begin_layout Standard
Felipe's proposal was the first, and was mostly still oriented towards the
direct File I/O problem.
He proposed to use UTF-16 exclusively.
Period.
\end_layout
\begin_layout Standard
Advantages
\end_layout
\begin_layout Enumerate
Simplicity
\end_layout
\begin_layout Standard
Disadvantages
\end_layout
\begin_layout Enumerate
No way to support UTF-8, this means that all dealing with UTF-8 (the main
encoding on Unix) must be manual on pchar level or through careful use
of ansistring workarounds, or face heavy repeated conversion penalties.
This also means code must be written to pass a readonly unicode string
to a library on unix, instead of simply passing pwidechar(s).
It is a windows centric proposition
\end_layout
\begin_layout Enumerate
No utf-32, so also no simple way
\end_layout
\begin_layout Standard
Keep in mind that this also means some complications for e.g.
standard file I/O, that must change from UTF-8 to UTF-16.
\end_layout
\begin_layout Subsection
Marco's proposal
\end_layout
\begin_layout Standard
This proposal was more in line with earlier discussions on core, simply
have three separate types for the three encodings, that autoconvert reasonably,
and the implementation is nearly the same.
To keep RTL size down, most system calls would only accept strings in the
system encoding, except for VAR parameters that need to be wrapped or double
implemented.
\end_layout
\begin_layout Standard
So for clarity: an utf8string, utf16string and a utf32string type.
\end_layout
\begin_layout Standard
Advantages
\end_layout
\begin_layout Enumerate
The string types that a routine use signal the encodings it accepts/returns.
\end_layout
\begin_layout Enumerate
Maximum speed for code that uses only one encoding, no conversion, no runtime
behaviour.
\end_layout
\begin_layout Enumerate
The fact that the types have exactly the same content in a different representat
ion (4 types, together with UTF-32 and the COM widestring) made me hope
that the implementation would not be that much more complicated than one
+ a bunch of special options and directives.
\end_layout
\begin_layout Enumerate
Interfacing with systems with a different encoding is simple.
Convert to correct type if not already, and then typecast.
\end_layout
\begin_layout Enumerate
Tiburón code could simply use UTF16 string everywhere (a simple {$H like
directive), and be very to totally compatible, and yet mixable.
\end_layout
\begin_layout Standard
Disadvantage
\end_layout
\begin_layout Enumerate
Most new types, thus also the most conversions.
\end_layout
\begin_layout Enumerate
Separate types, so one can't pass UTF-8 string to a procedure with a var
or out parameter of UTF-16 type.
\end_layout
\begin_layout Enumerate
Only overloading and conversion as instrument for routines that must accept
multiple encodings.
Not unlike ansistring and shortstring IOW with the same problems.
\end_layout
\begin_layout Enumerate
More types also means a lot more vt<x> constants in tvarrecs, variants etc.
\end_layout
\begin_layout Enumerate
Prefix records of types can't be Tiburón compatible
\end_layout
\begin_layout Subsubsection
Aliases
\end_layout
\begin_layout Standard
To make this work properly, there will be some additional aliases:
\end_layout
\begin_layout Itemize
An alias to a type that always is the same as the system encoding.
If you use this you are always safe performance wise.
\end_layout
\begin_layout Itemize
An alias to utf16string of whatever identifier Tiburón uses for
\end_layout
\begin_layout Standard
This also means that encoding agnostic code should use the system encoding,
since the average string will be probably in the system encoding
\end_layout
\begin_layout Subsection
Florian's proposal.
\end_layout
\begin_layout Standard
Florian proposed to have a single unicode type that can represent the three
encodings (UTF-x), and maybe others too (the old ascii codepages as well
as LE vs BE).
The principle is the same as ansistring, additional needed info is prefixed
at addresses before s[1].
Currently it is only the encoding type, but it could be expanded.
\end_layout
\begin_layout Standard
There are a lot more implementation details to be resolved in this proposal.
\end_layout
\begin_layout Standard
Florian says the following about the granularity of the type.
\end_layout
\begin_layout Quote
to overcome the indexing problem efficiently when using an encoding field
(this is not about surrogates), we could do the following: introduce a
compiler switch {$unicodestringindex default,byte,word,dword}.
In default mode the compiler gets a shifting value from the encoding field
(this is 4 bytes anyways and could be split into 1 byte shifting, 2 bytes
encoding, 1 bytes reserved).
In the other modes the compiler uses the given size when indexing.
For example, a Tiberion (or how is it called?) switch could set this to
word.
\end_layout
\begin_layout Standard
Later however he says (in response to the below granularity challenge)
\end_layout
\begin_layout Quote
I described this already in detail in my first mail: just in one of the
four bytes available for storing the encoding.
\end_layout
\begin_layout Standard
Now I'm confused :)
\end_layout
\begin_layout Standard
Anyway about the performance he says:
\end_layout
\begin_layout Quote
The approach has the big advantage, that you really need all procedures
only once if desired.
For example e.g.
linux would get only utf-8 routines by default, utf-16 is converted to
utf-8 at the entry of the helper procedures if needed.
Usually, no conversion would be necessary because you see seldomly utf-16
in linux applications so only the check if the input strings are really
utf-8 is necessary, this is very cheap because the data is anyways already
in a cache line.
\end_layout
\begin_layout Standard
He also says
\end_layout
\begin_layout Quote
Keep in mind in your response, that we want also handle other formats than
utf-8 or utf-16 if needed :)
\end_layout
\begin_layout Standard
Michael says:
\end_layout
\begin_layout Quote
For the LCL/fpGUI/MSEGui programmers, nothing changes, > you can even throw
away your own conversion routines.
> You need only a single call just prior to passing a string > to the OS/GUI
system: ForceEncoding().
No ifdefs needed, > all is transparant.
\end_layout
\begin_layout Standard
The type is a bit too complex to have a series of simple advantages and
disadvantages, so I just going to describe some of the problems, and ask
for clarification.
\end_layout
\begin_layout Subsubsection
The biggest problem: can't declare what type to expect.
\end_layout
\begin_layout Standard
My initial reaction was
\begin_inset Quotes eld
\end_inset
oh my, a runtime type in Pascal, what about performance? It will be pretty
much likes variants, and they are known to be slow.
We will become Perl/Python
\begin_inset Quotes erd
\end_inset
\end_layout
\begin_layout Standard
However while I still have serious doubts about performance, that's not
the bigger problem.
Since with pretty much any solution you can always isolate the speed dependant
part, force the encoding to be constant (preferably the system encoding),
and be done with it.
Moreover, there is much to
\end_layout
\begin_layout Standard
The
\emph on
bigger
\emph default
problem however is that you don't declare the type of the encoding anymore
in parameters, local variables and return type.
This means manual insertion of Michael's Enforceencoding calls everywhere,
also in existing Tiburón code.
It invalidates my own (but agreed: not Florian's requirements) that existing
code remains running with only some global mode settings.
(assuming Tiburón is
\begin_inset Quotes eld
\end_inset
existing code
\begin_inset Quotes erd
\end_inset
)
\end_layout
\begin_layout Standard
I can illustrate that with two examples or thoughexperiment:
\end_layout
\begin_layout Subsubsection
Existing code
\end_layout
\begin_layout Standard
Assume I have a unit with UTF-16 Tiburón code.
And and some unit with UTF-8 code of Lazarus descent where I globally replaced
\begin_inset Quotes eld
\end_inset
ansistring
\begin_inset Quotes erd
\end_inset
by unicodestring (or whatever identifier for the native type) to upgrade
it to
\begin_inset Quotes eld
\end_inset
native
\begin_inset Quotes erd
\end_inset
unicode on an Unix target.
\end_layout
\begin_layout Standard
Now we want these to work call eachother, and neither of these is prepared
for the polymorphic type to contain the wrong encoding.
Worse, literals in the Tiburón code will probably be created in the native
(UTF-8) encoding.
In turn, the UTF-8 routines might receive occasionally a string that has
passed the Tiburón code and contains UTF-16 encoding.
The only solution is to hunt the _entire_ source code for all these points,
and insert ForceEncodings() statements for all parameters and after assignment
of a literal.
Here another potential problem surfaces, an empty string might not be forcable.
\end_layout
\begin_layout Standard
This is an extremely hard sell to Delphi users, and IMHO not necessary anyway.
Something will have to be done about this.
\end_layout
\begin_layout Standard
A solution would be the hybrid proposal, see the separate paragraph further
down.
It is more or less the declarative behaviour of my proposal combined with
the implementation of Florian's.
\end_layout
\begin_layout Subsubsection
The granularity
\end_layout
\begin_layout Standard
The problem with the granularity lies a bit in the same region as the last:
if you have a procedure you must be prepared to handle all types.
Now assume I honour that, and I am trying to make a procedure that understands
both encodings, e.g.
a dual encoding version of the
\begin_inset Quotes eld
\end_inset
granularity
\begin_inset Quotes erd
\end_inset
problem above.
Then according to Florian's first quote above
\emph on
I only have one compiletime granularity while the type of my unicodestring
is defined runtime !
\emph default
\end_layout
\begin_layout LyX-Code
{$unicodestringindex <what to put here?>}
\end_layout
\begin_layout LyX-Code
\end_layout
\begin_layout LyX-Code
procedure myuniversalstringroutine(s:tunicodestring);
\end_layout
\begin_layout LyX-Code
begin
\end_layout
\begin_layout LyX-Code
if encodingof(s)=utf_8 Then
\end_layout
\begin_layout LyX-Code
begin
\end_layout
\begin_layout LyX-Code
for i:=1 to length(s) do // s in single bytes
\end_layout
\begin_layout LyX-Code
s[i]:='a'; // s[i] in single byte values.
type of literal?
\end_layout
\begin_layout LyX-Code
end
\end_layout
\begin_layout LyX-Code
else
\end_layout
\begin_layout LyX-Code
begin // utf 16
\end_layout
\begin_layout LyX-Code
for i:=1 to length(s) do // length(s) in 2 byte values
\end_layout
\begin_layout LyX-Code
s[i]:='a'; // s[i] in two byte values.
type of literal?
\end_layout
\begin_layout LyX-Code
end
\end_layout
\begin_layout LyX-Code
\end_layout
\begin_layout LyX-Code
end;
\end_layout
\begin_layout LyX-Code
\end_layout
\begin_layout LyX-Code
begin
\end_layout
\begin_layout LyX-Code
myunversialstringroutine(getutf16stringroutine);
\end_layout
\begin_layout LyX-Code
myunversialstringroutine(getutf8stringroutine);
\end_layout
\begin_layout LyX-Code
end;
\end_layout
\begin_layout LyX-Code
\end_layout
\begin_layout Standard
The conclusion of this is IMHO that shift size should be part of the runtime
string too, iow a value of 1,2,4 somewhere at negative offset of the pointer.
This is a performance penalty, since s[4] is then a more runtime construct.
\end_layout
\begin_layout Subsubsection
Performance
\end_layout
\begin_layout Standard
A runtime solution is always slower as a compiletime one.
While performance isn't my biggest gripe, the problem is that I only see
a small advantage in return: working VAR parameters and a lower need for
overloading.
For that we see a lot more checks done (because the encoding check must
be after the nil check which will complicate codegeneration).
\end_layout
\begin_layout Standard
Florian claims to partially earn this back with less conversions in all,
but I don't buy that.
Simply having an type-alias for whatever encoding is the system encoding
will achieve the same.
Moreover, the decision which type to convert lies with the compiler which
has generally more information at its disposal than the runtime library.
Take for instance the following example:
\end_layout
\begin_layout LyX-Code
var s1: utf8string; // utf-8 is the system encoding, we're on unix
\end_layout
\begin_layout LyX-Code
s2: utf16string;
\end_layout
\begin_layout LyX-Code
s1:=someinit8();
\end_layout
\begin_layout LyX-Code
s2:=someinit16();
\end_layout
\begin_layout LyX-Code
s1:=s2+s1;
\end_layout
\begin_layout LyX-Code
utf8routine(s1);
\end_layout
\begin_layout Standard
(note that for florian's example, all string types are the same, in his
case, read the declarations of s1 and s2 as
\begin_inset Quotes eld
\end_inset
strings initialised filled with a utf-8/16 value)
\end_layout
\begin_layout Standard
Now the runtime libs can probably not exploit the fact that the system encoding
is more useful, and s1:=s2+s1; might end up converting the utf-8 type to
utf-16, and storing the utf-16 result in s1.
And then the check in utf8routine() will have to change the encoding again.
\end_layout
\begin_layout Standard
Also the
\begin_inset Quotes eld
\end_inset
leave out routines
\begin_inset Quotes erd
\end_inset
argument is IMHO bogus, since if the types of my proposal autoconvert (not
unlike uniquestring()), the more complex routines like the bulky floating
point and datetimeformatters could also be available only in the system
encoding (which is most likely to happen), give or take a few small wrappers
to work around VAR parameter problems.
\end_layout
\begin_layout Subsubsection
Alternate encodings.
\end_layout
\begin_layout Standard
Florian also mentioned an interest in supporting the old codepages as part
of the requirements.
I don't know if that was only a teaser because his proposal had more leeway
for that or because he
\emph on
really
\emph default
saw a case and a need for that.
\end_layout
\begin_layout Standard
However while I entertained the idea as interesting for a while, I'm not
so convinced this is doable for two main reasons,
\end_layout
\begin_layout Itemize
the UTF-x to UTF-y conversions are guaranteed to work if not corrupt, and
if there are corner cases, they are far and few.
But the codepages only accept a real small set of the possible codepoint
set of the UTF-encodings and also eachother.
The errorhandling is IMHO a problem.
\end_layout
\begin_layout Itemize
Because the type of the polymorphic doesn't change unless forced, these
strange encodings could penetrate everwhere in your codebase when simply
strings are passed on unmodified.
The amount of exceptions of unexpected encodings, and conversion failures
all over your (till now working) code is confusing, unless you want to
manually try except all string code in case some conversion goes wrong.
\end_layout
\begin_layout Subsubsection
Florian's response
\end_layout
\begin_layout Standard
The discussion about this article doesn't seem to have changed much about
each parties viewpoint.
Except maybe the
\begin_inset Quotes eld
\end_inset
existing code
\begin_inset Quotes erd
\end_inset
problem,
\begin_inset Foot
status collapsed
\begin_layout Standard
Note that existing code is not only code that is
\begin_inset Quotes eld
\end_inset
old
\begin_inset Quotes erd
\end_inset
or
\begin_inset Quotes eld
\end_inset
Tiburón
\begin_inset Quotes erd
\end_inset
but in general all code that can only accept one encoding.
\end_layout
\end_inset
\end_layout
\begin_layout Standard
(quote Florian)
\end_layout
\begin_layout Standard
Indeed, it requires some work but there are several possibilities:
\end_layout
\begin_layout Enumerate
add a switch for runtime checks about string encoding
\end_layout
\begin_layout Enumerate
add a switch to enforce encoding at procedure entries and for function results
\end_layout
\begin_layout Standard
The code needs to be reworked anyways.
\end_layout
\begin_layout Standard
(...end quote..)
\end_layout
\begin_layout Standard
I think this is butt ugly, and overly complicated, but at least it fixes
my most major problem.
Maybe if we can predeclare a lot of these as types, we can actually confine
the clutter.
\end_layout
\begin_layout Standard
\series bold
note:
\series default
see also the
\begin_inset LatexCommand vref
reference "sub:Problems-of-hybrid:"
\end_inset
paragraph, and the hybrid paragraph in general.
There are complications.
\end_layout
\begin_layout Subsection
Yury's proposal
\end_layout
\begin_layout Standard
Yury wrote something up independantly at
\begin_inset LatexCommand htmlurl
name "FPC wiki about FPC Unicode support"
target "http://wiki.freepascal.org/FPC_Unicode_support"
\end_inset
.
It is the same basic idea as Florian's: encodingtype and granularity-of-encodin
g in the prefix of the string.
He goes a step further and also seems to hint on reimplementing existing
types on this scheme.
(which is not realistic for shortstring, and maybe widestring).
\end_layout
\begin_layout Standard
What I like in Yury's proposal is that he combines the implementation from
Florian with the declaration that shows real types that I favour, in short,
essentially it is the hybrid detail of the next paragraph in the rough.
The hybrid model does divide some of the types over two types, the new
unicodestring and ansistring (the codepage stuff, if we do that, there
is no need to be Tiburón incompatible)
\end_layout
\begin_layout Subsection
The hybrid model
\end_layout
\begin_layout Standard
This is just a short thought experiment, this part hasn't been discussed
with Florian and Michael much yet (though Yuri seems to come up with it
independantly).
The main reason is that the typing is my main grudge against Florian's
proposal, and the performance less.
It builds a bit on Florian's willingness to tackle some of those with directive
s.
If that gives enough leeway to define types, Florian's proposal morphs
into this hybrid model.
\end_layout
\begin_layout Standard
So assume we combine Florian's and some of the requirements (but not implementat
ion) that are the basis for Marco's example.
This means one base unicode type that can be parameterized to four types
for declaration purposes (a single implementation of generic runtime dependant
unicodestring as per Florian's proposal, but separate (sub)types per encoding
(TUtf8string,TUtfstring16 and TUtfstring32)).
These latter might be not real (compiler) types, but defined like below.
\end_layout
\begin_layout LyX-Code
Type
\end_layout
\begin_layout LyX-Code
tutf8string = type tunicodestring(Mandatory_UTF8); // or however we style
the modifier.
\end_layout
\begin_layout LyX-Code
\end_layout
\begin_layout LyX-Code
alternate syntax (?), more in Florian's style with directives
\end_layout
\begin_layout LyX-Code
\end_layout
\begin_layout LyX-Code
type
\end_layout
\begin_layout LyX-Code
{$unicodetype mandatory_utf8}
\end_layout
\begin_layout LyX-Code
tutf8string = tunicodestring;
\end_layout
\begin_layout LyX-Code
{$unicodetype general}
\end_layout
\begin_layout LyX-Code
\end_layout
\begin_layout Standard
However because these forced types are 100% compatible with the full type,
there is less of a multitude of overloads for VAR or overloading of helpers
(for e.g.
variant which only contains the general type).
\end_layout
\begin_layout Itemize
the desired compiletime declarative behaviour, to be able to declare when
a certain routine only accepts/expects a certain encoding.
\end_layout
\begin_layout Itemize
the ability to have compiletime type knowledge to rearrange expressions
to prefer a certain encoding result (see the performance paragraph) by
using a different declaration (much like the Tiburón ansistring), if all
components are typed.
\end_layout
\begin_layout Itemize
In Tiburón mode, the string type is equal to TUTF16string, but can be mixed
with any of the other types.
\end_layout
\begin_layout Itemize
On implementation level, a single runtime implementation.
No 3 ways of overloading.
\end_layout
\begin_layout Itemize
The whole situation is then a bit analog to shortstring vs shortstring[]
(from a typing point of view).
All RTL routines are var shortstring, and accept all.
However if you want to only support a certain size (like extensions), you
can declare it using var s:shortstring[2].
But the unicode equivalent would be expected encoding, not size.
Also e.g.
variant would hold an FPC unicodestring, which is compatible without conversion
to utf8string, utf16string,utf32string
\end_layout
\begin_layout Standard
The main advantage of would be keeping the number of type dependant (not
the more general routines) down, but to be able to retain the compiletime
typed behaviour.
Slowly I'm convinced this might be a doable way, but I need Florian's input
for that.
\end_layout
\begin_layout Standard
As a bonus, expanding this hybrid with Tiburón functionality is also possible,
with quite high Tiburón compat, at the expense of having two UTF-8 types:
\end_layout
\begin_layout Itemize
Implement the hybrid type as above.
Only TUnicodestring only has the base three encodings.
\end_layout
\begin_layout Itemize
Implement the Tiburón ansistring.
utf-8 inclusive.
This also includes the codepage support then.
\end_layout
\begin_layout Itemize
In Delphi (Tiburón?) mode, the default unicodestring is an alias for TFlorianStr
ing(talwaysutf16).
\end_layout
\begin_layout Standard
This trick allows to simply add Tiburón code under the relative IFDEFS,
and keep it working.
And to gain maximum performance (avoid too much conversions in one codepage)
on Unix utf-8 people would could remove the Tiburón flags on a per unit
basis after inspecting the encoding state of an unit.
\end_layout
\begin_layout Standard
The problems that I can think of, is that there is still a VAR problem,
and the type and conversion situation in the compiler might get complicated
(a lot of combinations), even though the number of overloads might be less.
\end_layout
\begin_layout Subsubsection
Problems of hybrid: var
\begin_inset LatexCommand label
name "sub:Problems-of-hybrid:"
\end_inset
\end_layout
\begin_layout Standard
VAR remains a problem, but afaik it is fixable.
\end_layout
\begin_layout Standard
Assume we have RTL routine that does
\end_layout
\begin_layout LyX-Code
procedure stringroutine (var s:TUNICODESTRING);
\end_layout
\begin_layout LyX-Code
begin
\end_layout
\begin_layout LyX-Code
forceencoding(s,utf16); // code only can deal with utf16
\end_layout
\begin_layout LyX-Code
process; // the utf16 processing code.
\end_layout
\begin_layout LyX-Code
end;
\end_layout
\begin_layout LyX-Code
\end_layout
\begin_layout LyX-Code
and
\end_layout
\begin_layout LyX-Code
\end_layout
\begin_layout LyX-Code
var n : tUTF8String;
\end_layout
\begin_layout LyX-Code
begin
\end_layout
\begin_layout LyX-Code
{assign n}
\end_layout
\begin_layout LyX-Code
stringroutine(n); // we can pass, since this is not a fully different
type, but a TUNICODESTRING with a bit of afinity
\end_layout
\begin_layout LyX-Code
// BUT: here n would be UTF16, a violation of the type declaration.
\end_layout
\begin_layout LyX-Code
end
\end_layout
\begin_layout Standard
This means that the compiler should insert a forceencoding after passing
a string with encoding affinity to a generic VAR parameter.
I hope that is doable.
\end_layout
\begin_layout LyX-Code
\end_layout
\end_body
\end_document
|