1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348
  
     | 
    
      # Copyright (C) 2016 and later: Unicode, Inc. and others.
# License & terms of use: http://www.unicode.org/copyright.html
# Copyright (C) 2010-2014, International Business Machines Corporation and others.
# All Rights Reserved.                  
#
# Commands for regenerating ICU4C locale data (.txt files) from CLDR,
# updated to apply to CLDR 37 / ICU 67 and later versions.
#
# The process requires local copies of
#    - CLDR (the source of most of the data, and some Java tools)
#    - The complete ICU source tree, including:
#      tools - includes the LdmlConverter build tool and associated config files
#      icu4c - the target for converted CLDR data, and source for ICU4J data;
#              includes tests for the converted data
#      icu4j - the target for updated data jars; includes tests for the converted
#              data
#
# For an official CLDR data integration into ICU, these should be clean, freshly
# checked-out. For released CLDR sources, an alternative to checking out sources
# for a given version is downloading the zipped sources for the common (core.zip)
# and tools (tools.zip) directory subtrees from the Data column in
# [http://cldr.unicode.org/index/downloads].
#
# The versions of each of these must match. Included with the release notes for
# ICU is the version number and/or a CLDR git tag name for the revision of CLDR
# that was the source of the data for that release of ICU.
#
# Besides a standard JDK, the process also requires ant and maven
# (http://ant.apache.org/),
# plus the xml-apis.jar from the Apache xalan package
# (http://xml.apache.org/xalan-j/downloads.html).
#
# You will also need to have performed the CLDR Maven setup (non-Eclipse version)
# per http://cldr.unicode.org/development/maven 
#
# Note: Enough things can (and will) fail in this process that it is best to
#   run the commands separately from an interactive shell. They should all
#   copy and paste without problems.
#
# It is often useful to save logs of the output of many of the steps in this
# process. The commands below put log files in /tmp; you may want to put them
# somewhere else.
#
#----
#
# There are several environment variables that need to be defined.
#
# a) Java- and ant-related variables
#
# JAVA_HOME:     Path to JDK (a directory, containing e.g. bin/java, bin/javac,
#                etc.); on many systems this can be set using
#                `/usr/libexec/java_home`.
#
# ANT_OPTS:      You may want to set:
#
#                -Xmx4096m, to give Java more memory; otherwise it may run out
#                 of heap.
#
# b) CLDR-related variables
#
# CLDR_DIR:      This is the path to the to root of standard CLDR sources, below
#                which are the common and tools directories.
#
# CLDR_CLASSES:  Path to the CLDR Tools classes directory. If not set, defaults
#                to $CLDR_DIR/tools/java/classes
#
# CLDR_TMP_DIR:  Parent of temporary CLDR production data.
#                Defaults to $CLDR_DIR/../cldr-aux (sibling to CLDR_DIR).
#
#                *** NOTE ***: In CLDR 36 and 37, the GenerateProductionData tool
#                no longer generates data by default into $CLDR_TMP_DIR/production;
#                instead it generates data into $CLDR_DIR/../cldr-staging/production
#                (though there is a command-line option to override this). However
#                the rest of the build still assumes that the generated data is in
#                $CLDR_TMP_DIR/production. So CLDR_TMP_DIR must be defined to be
#                $CLDR_DIR/../cldr-staging
#
# c) ICU-related variables
# These variables only need to be set if you're directly reusing the
# commands below.
#
# ICU4C_DIR:     Path to root of ICU4C sources, below which is the source dir.
#
# ICU4J_ROOT:    Path to root of ICU4J sources, below which is the main dir.
#
# TOOLS_ROOT:    Path to root of ICU tools directory, below which is (e.g.) the
#                cldr and unicodetools dirs.
#
#----
#
# If you are adding or removing locales, or specific kinds of locale data,
# there are some xml files in the ICU sources that need to be updated (these xml
# files are used in addition to the CLDR files as inputs to the CLDR data build
# process for ICU):
#
# The primary file to edit for ICU 67 and later is
#
#    $TOOLS_ROOT/cldr/cldr-to-icu/build-icu-data.xml
#
#----
#
# For an official CLDR data integration into ICU, there are some additional
# considerations:
#
# a) Don't commit anything in ICU sources (and possibly any changes in CLDR
#    sources, depending on their nature) until you have finished testing and
#    resolving build issues and test failures for both ICU4C and ICU4J.
#
# b) There are version numbers that may need manual updating in CLDR (other
#    version numbers get updated automatically, based on these):
#
#    common/dtd/ldml.dtd                            - update cldrVersion
#    common/dtd/ldmlBCP47.dtd                       - update cldrVersion
#    common/dtd/ldmlSupplemental.dtd                - update cldrVersion
#    common/dtd/ldmlSupplemental.dtd                - updateunicodeVersion
#    keyboards/dtd/ldmlKeyboard.dtd                 - update cldrVersion
#    tools/java/org/unicode/cldr/util/CLDRFile.java - update GEN_VERSION
#
# c) After everything is committed, you will need to tag the CLDR and ICU
#    sources that ended up being used for the integration; see step 16
#    below.
#
################################################################################
# 1a. Java and ant variables, adjust for your system
export JAVA_HOME=`/usr/libexec/java_home`
export ANT_OPTS="-Xmx4096m"
# 1b. CLDR variables, adjust for your setup; with cygwin it might be e.g.
# CLDR_DIR=`cygpath -wp /build/cldr`
export CLDR_DIR=$HOME/cldr-myfork
# 1c. ICU variables
export ICU4C_DIR=$HOME/icu-myfork/icu4c
export ICU4J_ROOT=$HOME/icu-myfork/icu4j
export TOOLS_ROOT=$HOME/icu-myfork/tools
# 2. Build and install the CLDR jar
cd $TOOLS_ROOT/cldr
ant install-cldr-libs
See the $TOOLS_ROOT/cldr/lib/README.txt file for more information on the CLDR
jar and the install-cldr-jars.sh script.
# 3. Configure ICU4C, build and test without new data first, to verify that
# there are no pre-existing errors. Here <platform> is the runConfigureICU
# code for the platform you are building, e.g. Linux, MacOSX, Cygwin.
cd $ICU4C_DIR/source
./runConfigureICU <platform>
make clean
make check 2>&1 | tee /tmp/icu4c-oldData-makeCheck.txt
# 4a. Generate the CLDR production data. This process uses ant with ICU's
# data/build.xml
#
# Running "ant cleanprod" is necessary to clean out the production data directory
# (usually $CLDR_TMP_DIR/production ), required if any CLDR data has changed.
#
# Running "ant setup" is not required, but it will print useful errors to
# debug issues with your path when it fails.
cd $ICU4C_DIR/source/data
ant cleanprod
ant setup
ant proddata 2>&1 | tee /tmp/cldr-newData-proddataLog.txt
# 4b. Build the new ICU4C data files; these include .txt files and .py files.
# These new files will replace whatever was already present in the ICU4C sources.
# This process uses the LdmlConverter in $TOOLS_ROOT/cldr/cldr-to-icu/;
# see $TOOLS_ROOT/cldr/cldr-to-icu/README.txt
#
# This process will take several minutes, during most of which there will be no log
# output (so do not assume nothing is happening). Keep a log so you can investigate
# anything that looks suspicious.
#
# Note that "ant clean" should not be run before this. The build-icu-data.xml process
# will automatically run its own "clean" step to delete files it cannot determine to
# be ones that it would generate, except for pasts listed in <retain> elements such as
# coll/de__PHONEBOOK.txt, coll/de_.txt, etc.
#
# Before running Ant to regenerate the data, make any necessary changes to the
# build-icu-data.xml file, such as adding new locales etc.
cd $TOOLS_ROOT/cldr/cldr-to-icu
ant -f build-icu-data.xml -DcldrDataDir="$CLDR_TMP_DIR/production" | tee /tmp/cldr-newData-builddataLog.txt
# 4c. Update the CLDR testData files needed by ICU4C and ICU4J tests, ensuring
# they're representative of the newest CLDR data.
cd $TOOLS_ROOT/cldr
ant copy-cldr-testdata
# 4d. Copy from CLDR common/testData/localeIdentifiers/localeCanonicalization.txt
# into icu4c/source/test/testdata/localeCanonicalization.txt
# and icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode/localeCanonicalization.txt
# and add the following line to the beginning of these two files
# # File copied from cldr common/testData/localeIdentifiers/localeCanonicalization.txt
# 4e. For the time being, manually re-add the lstm entries in data/brkitr/root.txt
open $ICU4C_DIR/source/data/brkitr/root.txt 
# paste the following block at the end, after the dictionaries block>
    lstm{
        Thai{"Thai_graphclust_model4_heavy.res"}
        Mymr{"Burmese_graphclust_model5_heavy.res"}
    }
# 5. Check which data files have modifications, which have been added or removed
# (if there are no changes, you may not need to proceed further). Make sure the
# list seems reasonable.
cd $ICU4C_DIR/source/data
git status
# 5a. You may also want to check which files were modified in CLDR production data:
cd $CLDR_TMP_DIR
git status
# 6. Fix any errors, investigate any warnings.
#
# Fixing may entail modifying CLDR source data or TOOLS_ROOT config files or
# tooling.
# 7. Now rebuild ICU4C with the new data and run make check tests.
# Again, keep a log so you can investigate the errors.
cd $ICU4C_DIR/source
# 7a. If any files were added or removed (likely), re-run configure:
./runConfigureICU <platform>
make clean
# 7b. Now do the rebuild.
make check 2>&1 | tee /tmp/icu4c-newData-makeCheck.txt
# 8. Investigate each test case failure. The first run processing new CLDR data
# from the Survey Tool can result in thousands of failures (in many cases, one
# CLDR data fix can resolve hundreds of test failures). If the error is caused
# by bad CLDR data, then file a CLDR bug, fix the data, and regenerate from
# step 4. If the data is OK but the testcase needs to be updated because the
# data has legitimately changed, then update the testcase. You will check in
# the updated testcases along with the new ICU data at the end of this process.
# Note that if the new data has any differences in structure, you will have to
# update test/testdata/structLocale.txt or /tsutil/cldrtest/TestLocaleStructure
# may fail.
# Repeat steps 4-7 until there are no errors.
# 9. You can also run the make check tests in exhaustive mode. As an alternative
# you can run them as part of the pre-merge tests by adding the following as a
# comment in the pull request: "/azp run CI-Exhaustive". You should do one or the
# other; the exhaustive tests are *not* run automatically on each pull request,
# and are only run occasionally on the default branch.
cd $ICU4C_DIR/source
export INTLTEST_OPTS="-e"
export CINTLTST_OPTS="-e"
make check 2>&1 | tee /tmp/icu4c-newData-makeCheckEx.txt
# 10. Again, investigate each failure, fixing CLDR data or ICU test cases as
# appropriate, and repeating steps 4-7 and 9 until there are no errors.
# 11. Now with ICU4J, build and test without new data first, to verify that
# there are no pre-existing errors (or at least to have the pre-existing errors
# as a base for comparison):
cd $ICU4J_ROOT
ant check 2>&1 | tee /tmp/icu4j-oldData-antCheck.txt
# 12. Transfer the data to ICU4J:
cd $ICU4C_DIR/source
# 12a. You need to reconfigure ICU4C to include the unicore data.
ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ./runConfigureICU <platform>
# 12b. Now build the jar files.
cd $ICU4C_DIR/source/data
# The following 2 lines are required to include the unicore data:
make clean
make -j6
make icu4j-data-install
cd $ICU4C_DIR/source/test/testdata
make icu4j-data-install
# 13. Now rebuild ICU4J with the new data and run tests:
# Keep a log so you can investigate the errors.
cd $ICU4J_ROOT
ant check 2>&1 | tee /tmp/icu4j-newData-antCheck.txt
# 14. Investigate test case failures; fix test cases and repeat from step 12,
# or fix CLDR data and repeat from step 4, as appropriate, until there are no
# more failures in ICU4C or ICU4J (except failures that were present before you
# began testing the new CLDR data).
# Note that certain data changes and related test failures may require the
# rebuilding of other kinds of data. For example:
# a) Changes to locale matching data may cause failures in e.g. the following:
#      com.ibm.icu.dev.test.util.LocaleDistanceTest (testLoadedDataSameAsBuiltFromScratch)
#      com.ibm.icu.dev.test.util.LocaleMatcherTest (testLikelySubtagsLoadedDataSameAsBuiltFromScratch)
#    To address these requires building and running the tool
#      icu4j/tools/misc/src/com/ibm/icu/dev/tool/locale/LocaleDistanceBuilder.java
#    to regenerate the file icu4c/source/data/misc/langInfo.txt and then regenerating
#    the ICU4J data jars.
# b) Changes to plurals data may cause failures in e.g. the following
#      com.ibm.icu.dev.test.format.PluralRulesTest (TestLocales)
#    To address these requires updating the LOCALE_SNAPSHOT data in
#      icu4j/main/tests/core/src/com/ibm/icu/dev/test/format/PluralRulesTest.java
#    by modifying the TestLocales() test there to run generateLOCALE_SNAPSHOT() and then
#    copying in the updated data.
# 15. Check the file changes; then git add or git rm as necessary, and
# commit the changes.
cd $HOME/icu/
cd ..
git status
# git add or remove as necessary
# commit
# 16. For an official CLDR data integration into ICU, now tag the CLDR and
# ICU sources with an appropriate CLDR milestone (you can check previous
# tags for format), e.g.:
cd $CLDR_DIR
git tag ...
git push --tags
cd $HOME/icu
git tag ...
git push --tags
# 17. You should also commit and tag the update production data in CLDR_TMP_DIR
# using the same tag as for CLDR_DIR above:
cd $CLDR_TMP_DIR
# git add or remove as necessary
# commit
git tag ...
git push --tags
 
     |