1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235
|
// © 2016 and later: Unicode, Inc. and others.
// License & terms of use: http://www.unicode.org/copyright.html
/*
*******************************************************************************
* Copyright (C) 2009-2014, International Business Machines Corporation and
* others. All Rights Reserved.
*******************************************************************************
*/
#include <_foundation_unicode/utypes.h>
#if !UCONFIG_NO_COLLATION
#include <_foundation_unicode/alphaindex.h>
#include <_foundation_unicode/coll.h>
#include <_foundation_unicode/localpointer.h>
#include <_foundation_unicode/normalizer2.h>
#include <_foundation_unicode/tblcoll.h>
#include <_foundation_unicode/uchar.h>
#include <_foundation_unicode/ulocdata.h>
#include <_foundation_unicode/uniset.h>
#include <_foundation_unicode/uobject.h>
#include <_foundation_unicode/usetiter.h>
#include <_foundation_unicode/utf16.h>
#include "cmemory.h"
#include "cstring.h"
#include "uassert.h"
#include "uvector.h"
#include "uvectr64.h"
//#include <string>
//#include <iostream>
U_NAMESPACE_BEGIN
namespace {
/**
* Prefix string for Chinese index buckets.
* See http://unicode.org/repos/cldr/trunk/specs/ldml/tr35-collation.html#Collation_Indexes
*/
const char16_t BASE[1] = { 0xFDD0 };
const int32_t BASE_LENGTH = 1;
UBool isOneLabelBetterThanOther(const Normalizer2 &nfkdNormalizer,
const UnicodeString &one, const UnicodeString &other);
} // namespace
static int32_t U_CALLCONV
collatorComparator(const void *context, const void *left, const void *right);
static int32_t U_CALLCONV
recordCompareFn(const void *context, const void *left, const void *right);
// UVector<Record *> support function, delete a Record.
static void U_CALLCONV
alphaIndex_deleteRecord(void *obj) {
delete static_cast<AlphabeticIndex::Record *>(obj);
}
namespace {
UnicodeString *ownedString(const UnicodeString &s, LocalPointer<UnicodeString> &owned,
UErrorCode &errorCode) {
if (U_FAILURE(errorCode)) { return nullptr; }
if (owned.isValid()) {
return owned.orphan();
}
UnicodeString *p = new UnicodeString(s);
if (p == nullptr) {
errorCode = U_MEMORY_ALLOCATION_ERROR;
}
return p;
}
inline UnicodeString *getString(const UVector &list, int32_t i) {
return static_cast<UnicodeString *>(list[i]);
}
inline AlphabeticIndex::Bucket *getBucket(const UVector &list, int32_t i) {
return static_cast<AlphabeticIndex::Bucket *>(list[i]);
}
inline AlphabeticIndex::Record *getRecord(const UVector &list, int32_t i) {
return static_cast<AlphabeticIndex::Record *>(list[i]);
}
/**
* Like Java Collections.binarySearch(List, String, Comparator).
*
* @return the index>=0 where the item was found,
* or the index<0 for inserting the string at ~index in sorted order
*/
int32_t binarySearch(const UVector &list, const UnicodeString &s, const Collator &coll) {
if (list.size() == 0) { return ~0; }
int32_t start = 0;
int32_t limit = list.size();
for (;;) {
int32_t i = (start + limit) / 2;
const UnicodeString *si = static_cast<UnicodeString *>(list.elementAt(i));
UErrorCode errorCode = U_ZERO_ERROR;
UCollationResult cmp = coll.compare(s, *si, errorCode);
if (cmp == UCOL_EQUAL) {
return i;
} else if (cmp < 0) {
if (i == start) {
return ~start; // insert s before *si
}
limit = i;
} else {
if (i == start) {
return ~(start + 1); // insert s after *si
}
start = i;
}
}
}
} // namespace
// The BucketList is not in the anonymous namespace because only Clang
// seems to support its use in other classes from there.
// However, we also don't need U_I18N_API because it is not used from outside the i18n library.
class BucketList : public UObject {
public:
BucketList(UVector *bucketList, UVector *publicBucketList)
: bucketList_(bucketList), immutableVisibleList_(publicBucketList) {
int32_t displayIndex = 0;
for (int32_t i = 0; i < publicBucketList->size(); ++i) {
getBucket(*publicBucketList, i)->displayIndex_ = displayIndex++;
}
}
// The virtual destructor must not be inline.
// See ticket #8454 for details.
virtual ~BucketList();
int32_t getBucketCount() const {
return immutableVisibleList_->size();
}
int32_t getBucketIndex(const UnicodeString &name, const Collator &collatorPrimaryOnly,
UErrorCode &errorCode) {
// binary search
int32_t start = 0;
int32_t limit = bucketList_->size();
while ((start + 1) < limit) {
int32_t i = (start + limit) / 2;
const AlphabeticIndex::Bucket *bucket = getBucket(*bucketList_, i);
UCollationResult nameVsBucket =
collatorPrimaryOnly.compare(name, bucket->lowerBoundary_, errorCode);
if (nameVsBucket < 0) {
limit = i;
} else {
start = i;
}
}
const AlphabeticIndex::Bucket *bucket = getBucket(*bucketList_, start);
if (bucket->displayBucket_ != nullptr) {
bucket = bucket->displayBucket_;
}
return bucket->displayIndex_;
}
/** All of the buckets, visible and invisible. */
UVector *bucketList_;
/** Just the visible buckets. */
UVector *immutableVisibleList_;
};
BucketList::~BucketList() {
delete bucketList_;
if (immutableVisibleList_ != bucketList_) {
delete immutableVisibleList_;
}
}
AlphabeticIndex::ImmutableIndex::~ImmutableIndex() {
delete buckets_;
delete collatorPrimaryOnly_;
}
int32_t
AlphabeticIndex::ImmutableIndex::getBucketCount() const {
return buckets_->getBucketCount();
}
int32_t
AlphabeticIndex::ImmutableIndex::getBucketIndex(
const UnicodeString &name, UErrorCode &errorCode) const {
return buckets_->getBucketIndex(name, *collatorPrimaryOnly_, errorCode);
}
const AlphabeticIndex::Bucket *
AlphabeticIndex::ImmutableIndex::getBucket(int32_t index) const {
if (0 <= index && index < buckets_->getBucketCount()) {
return icu::getBucket(*buckets_->immutableVisibleList_, index);
} else {
return nullptr;
}
}
AlphabeticIndex::AlphabeticIndex(const Locale &locale, UErrorCode &status)
: inputList_(nullptr),
labelsIterIndex_(-1), itemsIterIndex_(0), currentBucket_(nullptr),
maxLabelCount_(99),
initialLabels_(nullptr), firstCharsInScripts_(nullptr),
collator_(nullptr), collatorPrimaryOnly_(nullptr),
buckets_(nullptr) {
init(&locale, status);
}
AlphabeticIndex::AlphabeticIndex(RuleBasedCollator *collator, UErrorCode &status)
: inputList_(nullptr),
labelsIterIndex_(-1), itemsIterIndex_(0), currentBucket_(nullptr),
maxLabelCount_(99),
initialLabels_(nullptr), firstCharsInScripts_(nullptr),
collator_(collator), collatorPrimaryOnly_(nullptr),
buckets_(nullptr) {
init(nullptr, status);
}
AlphabeticIndex::~AlphabeticIndex() {
delete collator_;
delete collatorPrimaryOnly_;
delete firstCharsInScripts_;
delete buckets_;
delete inputList_;
delete initialLabels_;
}
AlphabeticIndex &AlphabeticIndex::addLabels(const UnicodeSet &additions, UErrorCode &status) {
if (U_FAILURE(status)) {
return *this;
}
initialLabels_->addAll(additions);
clearBuckets();
return *this;
}
AlphabeticIndex &AlphabeticIndex::addLabels(const Locale &locale, UErrorCode &status) {
addIndexExemplars(locale, status);
clearBuckets();
return *this;
}
AlphabeticIndex::ImmutableIndex *AlphabeticIndex::buildImmutableIndex(UErrorCode &errorCode) {
if (U_FAILURE(errorCode)) { return nullptr; }
// In C++, the ImmutableIndex must own its copy of the BucketList,
// even if it contains no records, for proper memory management.
// We could clone the buckets_ if they are not nullptr,
// but that would be worth it only if this method is called multiple times,
// or called after using the old-style bucket iterator API.
LocalPointer<BucketList> immutableBucketList(createBucketList(errorCode));
LocalPointer<RuleBasedCollator> coll(collatorPrimaryOnly_->clone());
if (immutableBucketList.isNull() || coll.isNull()) {
errorCode = U_MEMORY_ALLOCATION_ERROR;
return nullptr;
}
ImmutableIndex *immIndex = new ImmutableIndex(immutableBucketList.getAlias(), coll.getAlias());
if (immIndex == nullptr) {
errorCode = U_MEMORY_ALLOCATION_ERROR;
return nullptr;
}
// The ImmutableIndex adopted its parameter objects.
immutableBucketList.orphan();
coll.orphan();
return immIndex;
}
int32_t AlphabeticIndex::getBucketCount(UErrorCode &status) {
initBuckets(status);
if (U_FAILURE(status)) {
return 0;
}
return buckets_->getBucketCount();
}
int32_t AlphabeticIndex::getRecordCount(UErrorCode &status) {
if (U_FAILURE(status) || inputList_ == nullptr) {
return 0;
}
return inputList_->size();
}
void AlphabeticIndex::initLabels(UVector &indexCharacters, UErrorCode &errorCode) const {
U_ASSERT(indexCharacters.hasDeleter());
const Normalizer2 *nfkdNormalizer = Normalizer2::getNFKDInstance(errorCode);
if (U_FAILURE(errorCode)) { return; }
const UnicodeString &firstScriptBoundary = *getString(*firstCharsInScripts_, 0);
const UnicodeString &overflowBoundary =
*getString(*firstCharsInScripts_, firstCharsInScripts_->size() - 1);
// We make a sorted array of elements.
// Some of the input may be redundant.
// That is, we might have c, ch, d, where "ch" sorts just like "c", "h".
// We filter out those cases.
UnicodeSetIterator iter(*initialLabels_);
while (U_SUCCESS(errorCode) && iter.next()) {
const UnicodeString *item = &iter.getString();
LocalPointer<UnicodeString> ownedItem;
UBool checkDistinct;
int32_t itemLength = item->length();
if (!item->hasMoreChar32Than(0, itemLength, 1)) {
checkDistinct = false;
} else if(item->charAt(itemLength - 1) == 0x2a && // '*'
item->charAt(itemLength - 2) != 0x2a) {
// Use a label if it is marked with one trailing star,
// even if the label string sorts the same when all contractions are suppressed.
ownedItem.adoptInstead(new UnicodeString(*item, 0, itemLength - 1));
item = ownedItem.getAlias();
if (item == nullptr) {
errorCode = U_MEMORY_ALLOCATION_ERROR;
return;
}
checkDistinct = false;
} else {
checkDistinct = true;
}
if (collatorPrimaryOnly_->compare(*item, firstScriptBoundary, errorCode) < 0) {
// Ignore a primary-ignorable or non-alphabetic index character.
} else if (collatorPrimaryOnly_->compare(*item, overflowBoundary, errorCode) >= 0) {
// Ignore an index character that will land in the overflow bucket.
} else if (checkDistinct &&
collatorPrimaryOnly_->compare(*item, separated(*item), errorCode) == 0) {
// Ignore a multi-code point index character that does not sort distinctly
// from the sequence of its separate characters.
} else {
int32_t insertionPoint = binarySearch(indexCharacters, *item, *collatorPrimaryOnly_);
if (insertionPoint < 0) {
indexCharacters.insertElementAt(
ownedString(*item, ownedItem, errorCode), ~insertionPoint, errorCode);
} else {
const UnicodeString &itemAlreadyIn = *getString(indexCharacters, insertionPoint);
if (isOneLabelBetterThanOther(*nfkdNormalizer, *item, itemAlreadyIn)) {
indexCharacters.setElementAt(
ownedString(*item, ownedItem, errorCode), insertionPoint);
}
}
}
}
if (U_FAILURE(errorCode)) { return; }
// if the result is still too large, cut down to maxLabelCount_ elements, by removing every nth element
int32_t size = indexCharacters.size() - 1;
if (size > maxLabelCount_) {
int32_t count = 0;
int32_t old = -1;
for (int32_t i = 0; i < indexCharacters.size();) {
++count;
int32_t bump = count * maxLabelCount_ / size;
if (bump == old) {
indexCharacters.removeElementAt(i);
} else {
old = bump;
++i;
}
}
}
}
namespace {
const UnicodeString &fixLabel(const UnicodeString ¤t, UnicodeString &temp) {
if (!current.startsWith(BASE, BASE_LENGTH)) {
return current;
}
char16_t rest = current.charAt(BASE_LENGTH);
if (0x2800 < rest && rest <= 0x28FF) { // stroke count
int32_t count = rest-0x2800;
temp.setTo((char16_t)(0x30 + count % 10));
if (count >= 10) {
count /= 10;
temp.insert(0, (char16_t)(0x30 + count % 10));
if (count >= 10) {
count /= 10;
temp.insert(0, (char16_t)(0x30 + count));
}
}
return temp.append((char16_t)0x5283);
}
return temp.setTo(current, BASE_LENGTH);
}
UBool hasMultiplePrimaryWeights(
const RuleBasedCollator &coll, uint32_t variableTop,
const UnicodeString &s, UVector64 &ces, UErrorCode &errorCode) {
ces.removeAllElements();
coll.internalGetCEs(s, ces, errorCode);
if (U_FAILURE(errorCode)) { return false; }
UBool seenPrimary = false;
for (int32_t i = 0; i < ces.size(); ++i) {
int64_t ce = ces.elementAti(i);
uint32_t p = (uint32_t)(ce >> 32);
if (p > variableTop) {
// not primary ignorable
if (seenPrimary) {
return true;
}
seenPrimary = true;
}
}
return false;
}
} // namespace
BucketList *AlphabeticIndex::createBucketList(UErrorCode &errorCode) const {
// Initialize indexCharacters.
UVector indexCharacters(errorCode);
indexCharacters.setDeleter(uprv_deleteUObject);
initLabels(indexCharacters, errorCode);
if (U_FAILURE(errorCode)) { return nullptr; }
// Variables for hasMultiplePrimaryWeights().
UVector64 ces(errorCode);
uint32_t variableTop;
if (collatorPrimaryOnly_->getAttribute(UCOL_ALTERNATE_HANDLING, errorCode) == UCOL_SHIFTED) {
variableTop = collatorPrimaryOnly_->getVariableTop(errorCode);
} else {
variableTop = 0;
}
UBool hasInvisibleBuckets = false;
// Helper arrays for Chinese Pinyin collation.
Bucket *asciiBuckets[26] = {
nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr,
nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr
};
Bucket *pinyinBuckets[26] = {
nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr,
nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr
};
UBool hasPinyin = false;
LocalPointer<UVector> bucketList(new UVector(errorCode), errorCode);
if (U_FAILURE(errorCode)) {
return nullptr;
}
bucketList->setDeleter(uprv_deleteUObject);
// underflow bucket
LocalPointer<Bucket> bucket(new Bucket(getUnderflowLabel(), emptyString_, U_ALPHAINDEX_UNDERFLOW), errorCode);
if (U_FAILURE(errorCode)) {
return nullptr;
}
bucketList->adoptElement(bucket.orphan(), errorCode);
if (U_FAILURE(errorCode)) { return nullptr; }
UnicodeString temp;
// fix up the list, adding underflow, additions, overflow
// Insert inflow labels as needed.
int32_t scriptIndex = -1;
const UnicodeString *scriptUpperBoundary = &emptyString_;
for (int32_t i = 0; i < indexCharacters.size(); ++i) {
UnicodeString ¤t = *getString(indexCharacters, i);
if (collatorPrimaryOnly_->compare(current, *scriptUpperBoundary, errorCode) >= 0) {
// We crossed the script boundary into a new script.
const UnicodeString &inflowBoundary = *scriptUpperBoundary;
UBool skippedScript = false;
for (;;) {
scriptUpperBoundary = getString(*firstCharsInScripts_, ++scriptIndex);
if (collatorPrimaryOnly_->compare(current, *scriptUpperBoundary, errorCode) < 0) {
break;
}
skippedScript = true;
}
if (skippedScript && bucketList->size() > 1) {
// We are skipping one or more scripts,
// and we are not just getting out of the underflow label.
bucket.adoptInsteadAndCheckErrorCode(
new Bucket(getInflowLabel(), inflowBoundary, U_ALPHAINDEX_INFLOW), errorCode);
bucketList->adoptElement(bucket.orphan(), errorCode);
if (U_FAILURE(errorCode)) { return nullptr; }
}
}
// Add a bucket with the current label.
bucket.adoptInsteadAndCheckErrorCode(
new Bucket(fixLabel(current, temp), current, U_ALPHAINDEX_NORMAL), errorCode);
bucketList->adoptElement(bucket.orphan(), errorCode);
if (U_FAILURE(errorCode)) { return nullptr; }
// Remember ASCII and Pinyin buckets for Pinyin redirects.
char16_t c;
if (current.length() == 1 && 0x41 <= (c = current.charAt(0)) && c <= 0x5A) { // A-Z
asciiBuckets[c - 0x41] = (Bucket *)bucketList->lastElement();
} else if (current.length() == BASE_LENGTH + 1 && current.startsWith(BASE, BASE_LENGTH) &&
0x41 <= (c = current.charAt(BASE_LENGTH)) && c <= 0x5A) {
pinyinBuckets[c - 0x41] = (Bucket *)bucketList->lastElement();
hasPinyin = true;
}
// Check for multiple primary weights.
if (!current.startsWith(BASE, BASE_LENGTH) &&
hasMultiplePrimaryWeights(*collatorPrimaryOnly_, variableTop, current,
ces, errorCode) &&
current.charAt(current.length() - 1) != 0xFFFF /* !current.endsWith("\uffff") */) {
// "AE-ligature" or "Sch" etc.
for (int32_t j = bucketList->size() - 2;; --j) {
Bucket *singleBucket = getBucket(*bucketList, j);
if (singleBucket->labelType_ != U_ALPHAINDEX_NORMAL) {
// There is no single-character bucket since the last
// underflow or inflow label.
break;
}
if (singleBucket->displayBucket_ == nullptr &&
!hasMultiplePrimaryWeights(*collatorPrimaryOnly_, variableTop,
singleBucket->lowerBoundary_,
ces, errorCode)) {
// Add an invisible bucket that redirects strings greater than the expansion
// to the previous single-character bucket.
// For example, after ... Q R S Sch we add Sch\uFFFF->S
// and after ... Q R S Sch Sch\uFFFF St we add St\uFFFF->S.
bucket.adoptInsteadAndCheckErrorCode(new Bucket(emptyString_,
UnicodeString(current).append((char16_t)0xFFFF),
U_ALPHAINDEX_NORMAL),
errorCode);
if (U_FAILURE(errorCode)) {
return nullptr;
}
bucket->displayBucket_ = singleBucket;
bucketList->adoptElement(bucket.orphan(), errorCode);
if (U_FAILURE(errorCode)) { return nullptr; }
hasInvisibleBuckets = true;
break;
}
}
}
}
if (U_FAILURE(errorCode)) { return nullptr; }
if (bucketList->size() == 1) {
// No real labels, show only the underflow label.
BucketList *bl = new BucketList(bucketList.getAlias(), bucketList.getAlias());
if (bl == nullptr) {
errorCode = U_MEMORY_ALLOCATION_ERROR;
return nullptr;
}
bucketList.orphan();
return bl;
}
// overflow bucket
bucket.adoptInsteadAndCheckErrorCode(
new Bucket(getOverflowLabel(), *scriptUpperBoundary, U_ALPHAINDEX_OVERFLOW), errorCode);
bucketList->adoptElement(bucket.orphan(), errorCode); // final
if (U_FAILURE(errorCode)) { return nullptr; }
if (hasPinyin) {
// Redirect Pinyin buckets.
Bucket *asciiBucket = nullptr;
for (int32_t i = 0; i < 26; ++i) {
if (asciiBuckets[i] != nullptr) {
asciiBucket = asciiBuckets[i];
}
if (pinyinBuckets[i] != nullptr && asciiBucket != nullptr) {
pinyinBuckets[i]->displayBucket_ = asciiBucket;
hasInvisibleBuckets = true;
}
}
}
if (U_FAILURE(errorCode)) { return nullptr; }
if (!hasInvisibleBuckets) {
BucketList *bl = new BucketList(bucketList.getAlias(), bucketList.getAlias());
if (bl == nullptr) {
errorCode = U_MEMORY_ALLOCATION_ERROR;
return nullptr;
}
bucketList.orphan();
return bl;
}
// Merge inflow buckets that are visually adjacent.
// Iterate backwards: Merge inflow into overflow rather than the other way around.
int32_t i = bucketList->size() - 1;
Bucket *nextBucket = getBucket(*bucketList, i);
while (--i > 0) {
Bucket *bucket = getBucket(*bucketList, i);
if (bucket->displayBucket_ != nullptr) {
continue; // skip invisible buckets
}
if (bucket->labelType_ == U_ALPHAINDEX_INFLOW) {
if (nextBucket->labelType_ != U_ALPHAINDEX_NORMAL) {
bucket->displayBucket_ = nextBucket;
continue;
}
}
nextBucket = bucket;
}
LocalPointer<UVector> publicBucketList(new UVector(errorCode), errorCode);
if (U_FAILURE(errorCode)) {
return nullptr;
}
// Do not call publicBucketList->setDeleter():
// This vector shares its objects with the bucketList.
for (int32_t j = 0; j < bucketList->size(); ++j) {
Bucket *bucket = getBucket(*bucketList, j);
if (bucket->displayBucket_ == nullptr) {
publicBucketList->addElement(bucket, errorCode);
}
}
if (U_FAILURE(errorCode)) { return nullptr; }
BucketList *bl = new BucketList(bucketList.getAlias(), publicBucketList.getAlias());
if (bl == nullptr) {
errorCode = U_MEMORY_ALLOCATION_ERROR;
return nullptr;
}
bucketList.orphan();
publicBucketList.orphan();
return bl;
}
/**
* Creates an index, and buckets and sorts the list of records into the index.
*/
void AlphabeticIndex::initBuckets(UErrorCode &errorCode) {
if (U_FAILURE(errorCode) || buckets_ != nullptr) {
return;
}
buckets_ = createBucketList(errorCode);
if (U_FAILURE(errorCode) || inputList_ == nullptr || inputList_->isEmpty()) {
return;
}
// Sort the records by name.
// Stable sort preserves input order of collation duplicates.
inputList_->sortWithUComparator(recordCompareFn, collator_, errorCode);
// Now, we traverse all of the input, which is now sorted.
// If the item doesn't go in the current bucket, we find the next bucket that contains it.
// This makes the process order n*log(n), since we just sort the list and then do a linear process.
// However, if the user adds an item at a time and then gets the buckets, this isn't efficient, so
// we need to improve it for that case.
Bucket *currentBucket = getBucket(*buckets_->bucketList_, 0);
int32_t bucketIndex = 1;
Bucket *nextBucket;
const UnicodeString *upperBoundary;
if (bucketIndex < buckets_->bucketList_->size()) {
nextBucket = getBucket(*buckets_->bucketList_, bucketIndex++);
upperBoundary = &nextBucket->lowerBoundary_;
} else {
nextBucket = nullptr;
upperBoundary = nullptr;
}
for (int32_t i = 0; i < inputList_->size(); ++i) {
Record *r = getRecord(*inputList_, i);
// if the current bucket isn't the right one, find the one that is
// We have a special flag for the last bucket so that we don't look any further
while (upperBoundary != nullptr &&
collatorPrimaryOnly_->compare(r->name_, *upperBoundary, errorCode) >= 0) {
currentBucket = nextBucket;
// now reset the boundary that we compare against
if (bucketIndex < buckets_->bucketList_->size()) {
nextBucket = getBucket(*buckets_->bucketList_, bucketIndex++);
upperBoundary = &nextBucket->lowerBoundary_;
} else {
upperBoundary = nullptr;
}
}
// now put the record into the bucket.
Bucket *bucket = currentBucket;
if (bucket->displayBucket_ != nullptr) {
bucket = bucket->displayBucket_;
}
if (bucket->records_ == nullptr) {
LocalPointer<UVector> records(new UVector(errorCode), errorCode);
if (U_FAILURE(errorCode)) {
return;
}
bucket->records_ = records.orphan();
}
bucket->records_->addElement(r, errorCode);
}
}
void AlphabeticIndex::clearBuckets() {
if (buckets_ != nullptr) {
delete buckets_;
buckets_ = nullptr;
internalResetBucketIterator();
}
}
void AlphabeticIndex::internalResetBucketIterator() {
labelsIterIndex_ = -1;
currentBucket_ = nullptr;
}
void AlphabeticIndex::addIndexExemplars(const Locale &locale, UErrorCode &status) {
LocalULocaleDataPointer uld(ulocdata_open(locale.getName(), &status));
if (U_FAILURE(status)) {
return;
}
UnicodeSet exemplars;
ulocdata_getExemplarSet(uld.getAlias(), exemplars.toUSet(), 0, ULOCDATA_ES_INDEX, &status);
if (U_SUCCESS(status)) {
initialLabels_->addAll(exemplars);
return;
}
status = U_ZERO_ERROR; // Clear out U_MISSING_RESOURCE_ERROR
// The locale data did not include explicit Index characters.
// Synthesize a set of them from the locale's standard exemplar characters.
ulocdata_getExemplarSet(uld.getAlias(), exemplars.toUSet(), 0, ULOCDATA_ES_STANDARD, &status);
if (U_FAILURE(status)) {
return;
}
// question: should we add auxiliary exemplars?
if (exemplars.containsSome(0x61, 0x7A) /* a-z */ || exemplars.isEmpty()) {
exemplars.add(0x61, 0x7A);
}
if (exemplars.containsSome(0xAC00, 0xD7A3)) { // Hangul syllables
// cut down to small list
exemplars.remove(0xAC00, 0xD7A3).
add(0xAC00).add(0xB098).add(0xB2E4).add(0xB77C).
add(0xB9C8).add(0xBC14).add(0xC0AC).add(0xC544).
add(0xC790).add(0xCC28).add(0xCE74).add(0xD0C0).
add(0xD30C).add(0xD558);
}
if (exemplars.containsSome(0x1200, 0x137F)) { // Ethiopic block
// cut down to small list
// make use of the fact that Ethiopic is allocated in 8's, where
// the base is 0 mod 8.
UnicodeSet ethiopic(UnicodeString(u"[ሀለሐመሠረሰሸቀቈቐቘበቨተቸኀኈነኘአከኰኸዀወዐዘዠየደዸጀገጐጘጠጨጰጸፀፈፐፘ]"), status);
ethiopic.retainAll(exemplars);
exemplars.remove(u'ሀ', 0x137F).addAll(ethiopic);
}
// Upper-case any that aren't already so.
// (We only do this for synthesized index characters.)
UnicodeSetIterator it(exemplars);
UnicodeString upperC;
while (it.next()) {
const UnicodeString &exemplarC = it.getString();
upperC = exemplarC;
upperC.toUpper(locale);
initialLabels_->add(upperC);
}
}
UBool AlphabeticIndex::addChineseIndexCharacters(UErrorCode &errorCode) {
UnicodeSet contractions;
collatorPrimaryOnly_->internalAddContractions(BASE[0], contractions, errorCode);
if (U_FAILURE(errorCode) || contractions.isEmpty()) { return false; }
initialLabels_->addAll(contractions);
UnicodeSetIterator iter(contractions);
while (iter.next()) {
const UnicodeString &s = iter.getString();
U_ASSERT (s.startsWith(BASE, BASE_LENGTH));
char16_t c = s.charAt(s.length() - 1);
if (0x41 <= c && c <= 0x5A) { // A-Z
// There are Pinyin labels, add ASCII A-Z labels as well.
initialLabels_->add(0x41, 0x5A); // A-Z
break;
}
}
return true;
}
/*
* Return the string with interspersed CGJs. Input must have more than 2 codepoints.
*/
static const char16_t CGJ = 0x034F;
UnicodeString AlphabeticIndex::separated(const UnicodeString &item) {
UnicodeString result;
if (item.length() == 0) {
return result;
}
int32_t i = 0;
for (;;) {
UChar32 cp = item.char32At(i);
result.append(cp);
i = item.moveIndex32(i, 1);
if (i >= item.length()) {
break;
}
result.append(CGJ);
}
return result;
}
bool AlphabeticIndex::operator==(const AlphabeticIndex& /* other */) const {
return false;
}
bool AlphabeticIndex::operator!=(const AlphabeticIndex& /* other */) const {
return false;
}
const RuleBasedCollator &AlphabeticIndex::getCollator() const {
return *collator_;
}
const UnicodeString &AlphabeticIndex::getInflowLabel() const {
return inflowLabel_;
}
const UnicodeString &AlphabeticIndex::getOverflowLabel() const {
return overflowLabel_;
}
const UnicodeString &AlphabeticIndex::getUnderflowLabel() const {
return underflowLabel_;
}
AlphabeticIndex &AlphabeticIndex::setInflowLabel(const UnicodeString &label, UErrorCode &/*status*/) {
inflowLabel_ = label;
clearBuckets();
return *this;
}
AlphabeticIndex &AlphabeticIndex::setOverflowLabel(const UnicodeString &label, UErrorCode &/*status*/) {
overflowLabel_ = label;
clearBuckets();
return *this;
}
AlphabeticIndex &AlphabeticIndex::setUnderflowLabel(const UnicodeString &label, UErrorCode &/*status*/) {
underflowLabel_ = label;
clearBuckets();
return *this;
}
int32_t AlphabeticIndex::getMaxLabelCount() const {
return maxLabelCount_;
}
AlphabeticIndex &AlphabeticIndex::setMaxLabelCount(int32_t maxLabelCount, UErrorCode &status) {
if (U_FAILURE(status)) {
return *this;
}
if (maxLabelCount <= 0) {
status = U_ILLEGAL_ARGUMENT_ERROR;
return *this;
}
maxLabelCount_ = maxLabelCount;
clearBuckets();
return *this;
}
//
// init() - Common code for constructors.
//
void AlphabeticIndex::init(const Locale *locale, UErrorCode &status) {
if (U_FAILURE(status)) { return; }
if (locale == nullptr && collator_ == nullptr) {
status = U_ILLEGAL_ARGUMENT_ERROR;
return;
}
initialLabels_ = new UnicodeSet();
if (initialLabels_ == nullptr) {
status = U_MEMORY_ALLOCATION_ERROR;
return;
}
inflowLabel_.setTo((char16_t)0x2026); // Ellipsis
overflowLabel_ = inflowLabel_;
underflowLabel_ = inflowLabel_;
if (collator_ == nullptr) {
Collator *coll = Collator::createInstance(*locale, status);
if (U_FAILURE(status)) {
delete coll;
return;
}
if (coll == nullptr) {
status = U_MEMORY_ALLOCATION_ERROR;
return;
}
collator_ = dynamic_cast<RuleBasedCollator *>(coll);
if (collator_ == nullptr) {
delete coll;
status = U_UNSUPPORTED_ERROR;
return;
}
}
collatorPrimaryOnly_ = collator_->clone();
if (collatorPrimaryOnly_ == nullptr) {
status = U_MEMORY_ALLOCATION_ERROR;
return;
}
collatorPrimaryOnly_->setAttribute(UCOL_STRENGTH, UCOL_PRIMARY, status);
firstCharsInScripts_ = firstStringsInScript(status);
if (U_FAILURE(status)) { return; }
firstCharsInScripts_->sortWithUComparator(collatorComparator, collatorPrimaryOnly_, status);
// Guard against a degenerate collator where
// some script boundary strings are primary ignorable.
for (;;) {
if (U_FAILURE(status)) { return; }
if (firstCharsInScripts_->isEmpty()) {
// AlphabeticIndex requires some non-ignorable script boundary strings.
status = U_ILLEGAL_ARGUMENT_ERROR;
return;
}
if (collatorPrimaryOnly_->compare(
*static_cast<UnicodeString *>(firstCharsInScripts_->elementAt(0)),
emptyString_, status) == UCOL_EQUAL) {
firstCharsInScripts_->removeElementAt(0);
} else {
break;
}
}
// Chinese index characters, which are specific to each of the several Chinese tailorings,
// take precedence over the single locale data exemplar set per language.
if (!addChineseIndexCharacters(status) && locale != nullptr) {
addIndexExemplars(*locale, status);
}
}
//
// Comparison function for UVector<UnicodeString *> sorting with a collator.
//
static int32_t U_CALLCONV
collatorComparator(const void *context, const void *left, const void *right) {
const UElement *leftElement = static_cast<const UElement *>(left);
const UElement *rightElement = static_cast<const UElement *>(right);
const UnicodeString *leftString = static_cast<const UnicodeString *>(leftElement->pointer);
const UnicodeString *rightString = static_cast<const UnicodeString *>(rightElement->pointer);
if (leftString == rightString) {
// Catches case where both are nullptr
return 0;
}
if (leftString == nullptr) {
return 1;
}
if (rightString == nullptr) {
return -1;
}
const Collator *col = static_cast<const Collator *>(context);
UErrorCode errorCode = U_ZERO_ERROR;
return col->compare(*leftString, *rightString, errorCode);
}
//
// Comparison function for UVector<Record *> sorting with a collator.
//
static int32_t U_CALLCONV
recordCompareFn(const void *context, const void *left, const void *right) {
const UElement *leftElement = static_cast<const UElement *>(left);
const UElement *rightElement = static_cast<const UElement *>(right);
const AlphabeticIndex::Record *leftRec = static_cast<const AlphabeticIndex::Record *>(leftElement->pointer);
const AlphabeticIndex::Record *rightRec = static_cast<const AlphabeticIndex::Record *>(rightElement->pointer);
const Collator *col = static_cast<const Collator *>(context);
UErrorCode errorCode = U_ZERO_ERROR;
return col->compare(leftRec->name_, rightRec->name_, errorCode);
}
UVector *AlphabeticIndex::firstStringsInScript(UErrorCode &status) {
if (U_FAILURE(status)) {
return nullptr;
}
LocalPointer<UVector> dest(new UVector(status), status);
if (U_FAILURE(status)) {
return nullptr;
}
dest->setDeleter(uprv_deleteUObject);
// Fetch the script-first-primary contractions which are defined in the root collator.
// They all start with U+FDD1.
UnicodeSet set;
collatorPrimaryOnly_->internalAddContractions(0xFDD1, set, status);
if (U_FAILURE(status)) {
return nullptr;
}
if (set.isEmpty()) {
status = U_UNSUPPORTED_ERROR;
return nullptr;
}
UnicodeSetIterator iter(set);
while (iter.next()) {
const UnicodeString &boundary = iter.getString();
uint32_t gcMask = U_GET_GC_MASK(boundary.char32At(1));
if ((gcMask & (U_GC_L_MASK | U_GC_CN_MASK)) == 0) {
// Ignore boundaries for the special reordering groups.
// Take only those for "real scripts" (where the sample character is a Letter,
// and the one for unassigned implicit weights (Cn).
continue;
}
LocalPointer<UnicodeString> s(new UnicodeString(boundary), status);
dest->adoptElement(s.orphan(), status);
if (U_FAILURE(status)) {
return nullptr;
}
}
return dest.orphan();
}
namespace {
/**
* Returns true if one index character string is "better" than the other.
* Shorter NFKD is better, and otherwise NFKD-binary-less-than is
* better, and otherwise binary-less-than is better.
*/
UBool isOneLabelBetterThanOther(const Normalizer2 &nfkdNormalizer,
const UnicodeString &one, const UnicodeString &other) {
// This is called with primary-equal strings, but never with one.equals(other).
UErrorCode status = U_ZERO_ERROR;
UnicodeString n1 = nfkdNormalizer.normalize(one, status);
UnicodeString n2 = nfkdNormalizer.normalize(other, status);
if (U_FAILURE(status)) { return false; }
int32_t result = n1.countChar32() - n2.countChar32();
if (result != 0) {
return result < 0;
}
result = n1.compareCodePointOrder(n2);
if (result != 0) {
return result < 0;
}
return one.compareCodePointOrder(other) < 0;
}
} // namespace
//
// Constructor & Destructor for AlphabeticIndex::Record
//
// Records are internal only, instances are not directly surfaced in the public API.
// This class is mostly struct-like, with all public fields.
AlphabeticIndex::Record::Record(const UnicodeString &name, const void *data)
: name_(name), data_(data) {}
AlphabeticIndex::Record::~Record() {
}
AlphabeticIndex & AlphabeticIndex::addRecord(const UnicodeString &name, const void *data, UErrorCode &status) {
if (U_FAILURE(status)) {
return *this;
}
if (inputList_ == nullptr) {
LocalPointer<UVector> inputList(new UVector(status), status);
if (U_FAILURE(status)) {
return *this;
}
inputList_ = inputList.orphan();
inputList_->setDeleter(alphaIndex_deleteRecord);
}
LocalPointer<Record> r(new Record(name, data), status);
inputList_->adoptElement(r.orphan(), status);
if (U_FAILURE(status)) {
return *this;
}
clearBuckets();
//std::string ss;
//std::string ss2;
//std::cout << "added record: name = \"" << r->name_.toUTF8String(ss) << "\"" <<
// " sortingName = \"" << r->sortingName_.toUTF8String(ss2) << "\"" << std::endl;
return *this;
}
AlphabeticIndex &AlphabeticIndex::clearRecords(UErrorCode &status) {
if (U_SUCCESS(status) && inputList_ != nullptr && !inputList_->isEmpty()) {
inputList_->removeAllElements();
clearBuckets();
}
return *this;
}
int32_t AlphabeticIndex::getBucketIndex(const UnicodeString &name, UErrorCode &status) {
initBuckets(status);
if (U_FAILURE(status)) {
return 0;
}
return buckets_->getBucketIndex(name, *collatorPrimaryOnly_, status);
}
int32_t AlphabeticIndex::getBucketIndex() const {
return labelsIterIndex_;
}
UBool AlphabeticIndex::nextBucket(UErrorCode &status) {
if (U_FAILURE(status)) {
return false;
}
if (buckets_ == nullptr && currentBucket_ != nullptr) {
status = U_ENUM_OUT_OF_SYNC_ERROR;
return false;
}
initBuckets(status);
if (U_FAILURE(status)) {
return false;
}
++labelsIterIndex_;
if (labelsIterIndex_ >= buckets_->getBucketCount()) {
labelsIterIndex_ = buckets_->getBucketCount();
return false;
}
currentBucket_ = getBucket(*buckets_->immutableVisibleList_, labelsIterIndex_);
resetRecordIterator();
return true;
}
const UnicodeString &AlphabeticIndex::getBucketLabel() const {
if (currentBucket_ != nullptr) {
return currentBucket_->label_;
} else {
return emptyString_;
}
}
UAlphabeticIndexLabelType AlphabeticIndex::getBucketLabelType() const {
if (currentBucket_ != nullptr) {
return currentBucket_->labelType_;
} else {
return U_ALPHAINDEX_NORMAL;
}
}
int32_t AlphabeticIndex::getBucketRecordCount() const {
if (currentBucket_ != nullptr && currentBucket_->records_ != nullptr) {
return currentBucket_->records_->size();
} else {
return 0;
}
}
AlphabeticIndex &AlphabeticIndex::resetBucketIterator(UErrorCode &status) {
if (U_FAILURE(status)) {
return *this;
}
internalResetBucketIterator();
return *this;
}
UBool AlphabeticIndex::nextRecord(UErrorCode &status) {
if (U_FAILURE(status)) {
return false;
}
if (currentBucket_ == nullptr) {
// We are trying to iterate over the items in a bucket, but there is no
// current bucket from the enumeration of buckets.
status = U_INVALID_STATE_ERROR;
return false;
}
if (buckets_ == nullptr) {
status = U_ENUM_OUT_OF_SYNC_ERROR;
return false;
}
if (currentBucket_->records_ == nullptr) {
return false;
}
++itemsIterIndex_;
if (itemsIterIndex_ >= currentBucket_->records_->size()) {
itemsIterIndex_ = currentBucket_->records_->size();
return false;
}
return true;
}
const UnicodeString &AlphabeticIndex::getRecordName() const {
const UnicodeString *retStr = &emptyString_;
if (currentBucket_ != nullptr && currentBucket_->records_ != nullptr &&
itemsIterIndex_ >= 0 &&
itemsIterIndex_ < currentBucket_->records_->size()) {
Record *item = static_cast<Record *>(currentBucket_->records_->elementAt(itemsIterIndex_));
retStr = &item->name_;
}
return *retStr;
}
const void *AlphabeticIndex::getRecordData() const {
const void *retPtr = nullptr;
if (currentBucket_ != nullptr && currentBucket_->records_ != nullptr &&
itemsIterIndex_ >= 0 &&
itemsIterIndex_ < currentBucket_->records_->size()) {
Record *item = static_cast<Record *>(currentBucket_->records_->elementAt(itemsIterIndex_));
retPtr = item->data_;
}
return retPtr;
}
AlphabeticIndex & AlphabeticIndex::resetRecordIterator() {
itemsIterIndex_ = -1;
return *this;
}
AlphabeticIndex::Bucket::Bucket(const UnicodeString &label,
const UnicodeString &lowerBoundary,
UAlphabeticIndexLabelType type)
: label_(label), lowerBoundary_(lowerBoundary), labelType_(type),
displayBucket_(nullptr), displayIndex_(-1),
records_(nullptr) {
}
AlphabeticIndex::Bucket::~Bucket() {
delete records_;
}
U_NAMESPACE_END
#endif // !UCONFIG_NO_COLLATION
|