File: forder.c

package info (click to toggle)
r-cran-data.table 1.10.0-1~bpo8%2B1
  • links: PTS, VCS
  • area: main
  • in suites: jessie-backports
  • size: 12,816 kB
  • sloc: ansic: 9,751; sh: 10; makefile: 6
file content (1531 lines) | stat: -rw-r--r-- 77,283 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
#include "data.table.h"
// #define TIMING_ON

/* 
    - Only forder() and *twiddle() functions are meant for use by other C code in data.table, hence all other functions here except forder and *twiddle are static.
    - The static functions here make use of global variables, whose scope is limited to the same translation unit; therefore access from outside this file must be via forder and/or *twiddle only. 
    - The coding techniques deployed here are for efficiency; e.g. i) the static functions are recursive or called repetitively and we wish to minimise stack overhead, or ii) reach outside themselves to place results in the end result directly rather than returning small bits of memory.
*/

static int *gs[2] = {NULL};                                         // gs = groupsizes e.g. 23,12,87,2,1,34,...
static int flip = 0;                                                // two vectors flip flopped: flip and 1-flip
static int gsalloc[2] = {0};                                        // allocated stack size
static int gsngrp[2] = {0};                                         // used
static int gsmax[2] = {0};                                          // max grpn so far
static int gsmaxalloc = 0;                                          // max size of stack, set by forder to nrows
static Rboolean stackgrps = TRUE;                                   // switched off for last column when not needed by setkey
static Rboolean sortStr = TRUE;                                     // TRUE for setkey, FALSE for by=
static int *newo = NULL;                                            // used by forder and [i|d|c]sort to reorder order. not needed if length(by)==1

static int nalast = -1;                                             // =1, 0, -1 for TRUE, NA, FALSE respectively. Value rewritten inside forder().
                                                                    // note that na.last=NA (0) removes NAs, not retains them.
static int order = 1;                                               // =1, -1 for ascending and descending order respectively

#define N_SMALL 200                                                 // replaced n < 200 with n < N_SMALL. Easier to change later
#define N_RANGE 100000                                              // range limit for counting sort. UPDATE: should be less than INT_MAX (see setRange for details)

#define Error(...) do {savetl_end(); error(__VA_ARGS__);} while(0)  // http://gcc.gnu.org/onlinedocs/cpp/Swallowing-the-Semicolon.html#Swallowing-the-Semicolon
#undef warning
#define warning(...) Do not use warning in this file                // since it can be turned to error via warn=2
/* use malloc/realloc (not Calloc/Realloc) so we can trap errors 
and call savetl_end() before the error(). */

static void growstack(int newlen) {
    if (newlen==0) newlen=100000;                                   // no link to icount range restriction, just 100,000 seems a good minimum at 0.4MB.
    if (newlen>gsmaxalloc) newlen=gsmaxalloc;
    gs[flip] = realloc(gs[flip], newlen*sizeof(int));
    if (gs[flip] == NULL) Error("Failed to realloc working memory stack to %d*4bytes (flip=%d)", newlen, flip);
    gsalloc[flip] = newlen;
}

static void push(int x) {
    if (!stackgrps || x==0) return;
    if (gsalloc[flip] == gsngrp[flip]) growstack(gsngrp[flip]*2);
    gs[flip][gsngrp[flip]++] = x;
    if (x > gsmax[flip]) gsmax[flip] = x;
}

static void mpush(int x, int n) {
    if (!stackgrps || x==0) return;
    if (gsalloc[flip] < gsngrp[flip]+n) growstack((gsngrp[flip]+n)*2);
    for (int i=0; i<n; i++) gs[flip][gsngrp[flip]++] = x;
    if (x > gsmax[flip]) gsmax[flip] = x;
}

static void flipflop() {
    flip = 1-flip;
    gsngrp[flip] = 0;
    gsmax[flip] = 0;
    if (gsalloc[flip] < gsalloc[1-flip]) growstack(gsalloc[1-flip]*2);
}

static void gsfree() {
    free(gs[0]); free(gs[1]);
    gs[0] = NULL; gs[1]= NULL;
    flip = 0;
    gsalloc[0] = gsalloc[1] = 0;
    gsngrp[0] = gsngrp[1] = 0;
    gsmax[0] = gsmax[1] = 0;
    gsmaxalloc = 0;
}

#ifdef TIMING_ON
  // many calls to clock() can be expensive, hence compiled out rather than switch(verbose)
  #include <time.h>
  #define NBLOCK 20  
  static clock_t tblock[NBLOCK], tstart;
  static int nblock[NBLOCK];
  #define TBEG() tstart = clock();
  #define TEND(i) tblock[i] += clock()-tstart; nblock[i]++; tstart = clock();
#else
  #define TBEG()
  #define TEND(i)
#endif


/* 
   icount originally copied from do_radixsort in src/main/sort.c @ rev 51389. Then reworked here again in forder.c in v1.8.11
   base::sort.list(method="radix") turns out not to be a radix sort, but a counting sort, and we like it.
   See http://r.789695.n4.nabble.com/method-radix-in-sort-list-isn-t-actually-a-radix-sort-tp3309470p3309470.html
   Main changes :
     1. Negatives are fine. Btw, wish raised for simple change to base R : https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=15644
     2. Doesn't cummulate through 0's for speed in repeated calls of sparse counts by saving memset back to 0 of many 0
     3. Separated setRange so forder can redirect to iradix
*/

static int range, xmin;                                                  // used by both icount and forder
static void setRange(int *x, int n)
{
    int i, tmp;
    xmin = NA_INTEGER;     // used by forder
    int xmax = NA_INTEGER; // declared locally as we only need xmin outside
    double overflow;
    
    i = 0;
    while(i<n && x[i]==NA_INTEGER) i++;
    if (i<n) xmax = xmin = x[i];
    for(; i < n; i++) {
        tmp = x[i];
        if(tmp == NA_INTEGER) continue;
        if (tmp > xmax) xmax = tmp;
        else if (tmp < xmin) xmin = tmp;
    }
    if(xmin == NA_INTEGER) {range = NA_INTEGER; return;}                    // all NAs, nothing to do    
    
    overflow = (double)xmax - (double)xmin + 1;                             // ex: x=c(-2147483647L, NA_integer_, 1L) results in overflowing int range.
    if (overflow > INT_MAX) {range = INT_MAX; return;}                      // detect and force iradix here, since icount is out of the picture
    range = xmax-xmin+1;
    
    return;
}

// x*order results in integer overflow when -1*NA, so careful to avoid that here :
static inline int icheck(int x) {
    return ((nalast != 1) ? ((x != NA_INTEGER) ? x*order : x) : ((x != NA_INTEGER) ? (x*order)-1 : INT_MAX)); // if nalast==1, NAs must go last.
}


static void icount(int *x, int *o, int n)
/* Counting sort:
    1. Places the ordering into o directly, overwriting whatever was there
    2. Doesn't change x
    3. Pushes group sizes onto stack
*/
{
    int i=0, tmp;
    int napos = range;  // always count NA in last bucket and we'll account for nalast option in due course
    static unsigned int counts[N_RANGE+1] = {0};                            // static is IMPORTANT, counting sort is called repetitively.
    /* counts are set back to 0 at the end efficiently. 1e5 = 0.4MB i.e 
    tiny. We'll only use the front part of it, as large as range. So it's 
    just reserving space, not using it. Have defined N_RANGE to be 100000.*/
    if (range > N_RANGE) Error("Internal error: range = %d; isorted can't handle range > %d", range, N_RANGE);
    for(i=0; i<n; i++) {
        if (x[i] == NA_INTEGER) counts[napos]++;             // For nalast=NA case, we won't remove/skip NAs, rather set 'o' indices
        else counts[x[i]-xmin]++;                            // to 0. subset will skip them. We can't know how many NAs to skip 
    }                                                        // beforehand - i.e. while allocating "ans" vector
    // TO DO: at this point if the last count==n then it's all the same number and we can stop now.
    // Idea from Terdiman, then improved on that by not needing to loop through counts.
    
    tmp = 0;
    if (nalast!=1 && counts[napos]) {
        push(counts[napos]);
        tmp += counts[napos];
    }
    int w = (order==1) ? 0 : range-1;                                   // *** BLOCK 4 ***
    for (i=0; i<range; i++) 
    /* no point in adding tmp<n && i<=range, since range includes max, 
       need to go to max, unlike 256 loops elsewhere in forder.c */
    {
        if (counts[w]) {                                                    // cumulate but not through 0's. Helps resetting zeros when n<range, below.
            push(counts[w]);
            counts[w] = (tmp += counts[w]);
        }
        w += order; // order is +1 or -1
    }
    if (nalast==1 && counts[napos]) {
        push(counts[napos]);
        counts[napos] = (tmp += counts[napos]);
    }
    for(i=n-1; i>=0; i--) {
        o[--counts[(x[i] == NA_INTEGER) ? napos : x[i]-xmin]] = (int)(i+1);    // This way na.last=TRUE/FALSE cases will have just a single if-check overhead.
    }
    if (nalast == 0)                                                        // nalast = 1, -1 are both taken care already.
        for (i=0; i<n; i++) o[i] = (x[o[i]-1] == NA_INTEGER) ? 0 : o[i];    // nalast = 0 is dealt with separately as it just sets o to 0
                                                                            // at those indices where x is NA. x[o[i]-1] because x is not modifed here.

    /* counts were cumulated above so leaves non zero. 
    Faster to clear up now ready for next time. */
    if (n < range) {
        /* Many zeros in counts already. Loop through n instead, 
        doesn't matter if we set to 0 several times on any repeats */
        counts[napos]=0;
        for (i=0; i<n; i++) {
            if (x[i]!=NA_INTEGER) counts[x[i]-xmin]=0;
        }
    } else {
        memset(counts, 0, (range+1)*sizeof(int));                           // *** BLOCK 6 ***
    }
    return;
}

static void iinsert(int *x, int *o, int n)
/*  orders both x and o by reference in-place. Fast for small vectors, low overhead.
    don't be tempted to binsearch backwards here, have to shift anyway; 
    many memmove would have overhead and do the same thing. */
/*  when nalast == 0, iinsert will be called only from within iradix, where o[.] = 0 
    for x[.]=NA is already taken care of */
{
    int i, j, xtmp, otmp, tt;
    for (i=1; i<n; i++) {
        xtmp = x[i];
        if (xtmp < x[i-1]) {
            j = i-1;
            otmp = o[i];
            while (j>=0 && xtmp < x[j]) { 
                x[j+1] = x[j];
                o[j+1] = o[j];
                j--;
            }
            x[j+1] = xtmp;
            o[j+1] = otmp;
        }
    }    
    tt = 0;
    for (i=1; i<n; i++) if (x[i]==x[i-1]) tt++; else { push(tt+1); tt=0; }
    push(tt+1);
}

/*
  iradix is a counting sort performed forwards from MSB to LSB, with some tricks 
  and short circuits building on Terdiman and Herf.
    http://codercorner.com/RadixSortRevisited.htm 
    http://stereopsis.com/radix.html
  
    ~ Note they are LSD, but we do MSD here which is more complicated, for efficiency.
    ~ NAs need no special treatment as NA is the most negative integer in R (checked in init.c once, 
      for efficiency) so NA naturally sort to the front.
    ~ Using 4-pass 1-byte radix for the following reasons :
        *  11-bit (Herf) reduces to 3-passes (3*11=33) yes, and LSD need random access 
           to o vector in each pass 1:n so reduction in passes
        *  is good, but ...
        *  ... Terdiman's idea to skip a radix if all values are equal occurs less the wider the 
               radix. A narrower radix benefits more from that.
        *      That's detected here using a single 'if', an improvement on Terdiman's exposition 
               of a single loop to find if any count==n
        *  The pass through counts bites when radix is wider, because we repetitively call this 
           iradix from fastorder forwards.
        *  Herf's parallel histogramming is neat. In 4-pass 1-byte it needs 4*256 storage, that's 
           tiny, and can be static. 4*256 << 3*2048
        *  4-pass 1-byte is simpler and tighter code than 3-pass 11-bit, giving modern optimizers 
           and modern CPUs a better chance. We may get
        *  lucky anyway, if one or two of the 4-passes are skipped.

     Recall: there are no comparisons at all in counting and radix, there is wide random access 
     in each LSD radix pass, though.
*/

static unsigned int radixcounts[8][257] = {{0}};                            // 4 are used for iradix, 8 for dradix and i64radix
static int skip[8];
/* global because iradix and iradix_r interact and are called repetitively. 
   counts are set back to 0 after each use, to benefit from skipped radix. */
static void *radix_xsub=NULL;
static size_t radix_xsuballoc=0;

static int *otmp=NULL, otmp_alloc=0;
static void alloc_otmp(int n) {
    if (otmp_alloc >= n) return;
    otmp = (int *)realloc(otmp, n * sizeof(int));
    if (otmp == NULL) Error("Failed to allocate working memory for otmp. Requested %d * %d bytes", n, sizeof(int));
    otmp_alloc = n;
}

static void *xtmp=NULL; int xtmp_alloc=0;                                   // TO DO: save xtmp if possible, see allocs in forder
static void alloc_xtmp(int n) {                                             // TO DO: currently always the largest type (double) but 
                                                                            // could be int if that's all that's needed
    if (xtmp_alloc >= n) return;
    xtmp = (double *)realloc(xtmp, n * sizeof(double));
    if (xtmp == NULL) Error("Failed to allocate working memory for xtmp. Requested %d * %d bytes", n, sizeof(double));
    xtmp_alloc = n;
}

static void iradix_r(int *xsub, int *osub, int n, int radix);

static void iradix(int *x, int *o, int n)
   /* As icount :
       Places the ordering into o directly, overwriting whatever was there
       Doesn't change x
       Pushes group sizes onto stack */
{
    int i, j, radix, nextradix, itmp, thisgrpn, maxgrpn;
    unsigned int thisx=0, shift, *thiscounts;

    for (i=0;i<n;i++) {
        /* parallel histogramming pass; i.e. count occurrences of 
        0:255 in each byte.  Sequential so almost negligible. */
        thisx = (unsigned int)(icheck(x[i])) - INT_MIN;                     // relies on overflow behaviour. And shouldn't -INT_MIN be up in iradix?
        radixcounts[0][thisx & 0xFF]++;                                     // unrolled since inside n-loop
        radixcounts[1][thisx >> 8 & 0xFF]++;
        radixcounts[2][thisx >> 16 & 0xFF]++;
        radixcounts[3][thisx >> 24 & 0xFF]++;
    }
    for (radix=0; radix<4; radix++) {
        /* any(count == n) => all radix must have been that value => 
        last x (still thisx) was that value */
        i = thisx >> (radix*8) & 0xFF;
        skip[radix] = radixcounts[radix][i] == n;
        if (skip[radix]) radixcounts[radix][i] = 0;                         // clear it now, the other counts must be 0 already
    }
    
    radix = 3;  // MSD
    while (radix>=0 && skip[radix]) radix--;
    if (radix==-1) {                                                        // All radix are skipped; i.e. one number repeated n times.
        if (nalast == 0 && x[0] == NA_INTEGER)                              // all values are identical. return 0 if nalast=0 & all NA
            for (i=0; i<n; i++) o[i] = 0;                                   // because of 'return', have to take care of it here.
        else for (i=0; i<n; i++) o[i] = (i+1);
        push(n); 
        return; 
    }
    for (i=radix-1; i>=0; i--) {
        if (!skip[i]) memset(radixcounts[i], 0, 257*sizeof(unsigned int));
        /* clear the counts as we only needed the parallel pass for skip[] 
           and we're going to use radixcounts again below. Can't use parallel 
           lower counts in MSD radix, unlike LSD. */
    }
    thiscounts = radixcounts[radix];
    shift = radix * 8;
    
    itmp = thiscounts[0];
    maxgrpn = itmp;
    for (i=1; itmp<n && i<256; i++) {
        thisgrpn = thiscounts[i];
        if (thisgrpn) {                                                     // don't cummulate through 0s, important below.
            if (thisgrpn>maxgrpn) maxgrpn = thisgrpn;
            thiscounts[i] = (itmp += thisgrpn);
        }
    }
    for (i=n-1; i>=0; i--) {
        thisx = ((unsigned int)(icheck(x[i])) - INT_MIN) >> shift & 0xFF;
        o[--thiscounts[thisx]] = i+1;
    }
    
    if (radix_xsuballoc < maxgrpn) {
        // The largest group according to the first non-skipped radix, so could be big (if radix is needed on first column)
        // TO DO: could include extra bits to divide the first radix up more. Often the MSD has groups in just 0-4 out of 256.
        // free'd at the end of forder once we're done calling iradix repetitively
        radix_xsub = (int *)realloc(radix_xsub, maxgrpn*sizeof(double));    // realloc(NULL) == malloc
        if (!radix_xsub) Error("Failed to realloc working memory %d*8bytes (xsub in iradix), radix=%d", maxgrpn, radix);
        radix_xsuballoc = maxgrpn;
    }
    
    alloc_otmp(maxgrpn);   // TO DO: can we leave this to forder and remove these calls??
    alloc_xtmp(maxgrpn);   // TO DO: doesn't need to be sizeof(double) always, see inside
    
    nextradix = radix-1;
    while (nextradix>=0 && skip[nextradix]) nextradix--;
    if (thiscounts[0] != 0) Error("Internal error. thiscounts[0]=%d but should have been decremented to 0. dradix=%d", thiscounts[0], radix);
    thiscounts[256] = n;
    itmp = 0;
    for (i=1; itmp<n && i<=256; i++) {
        if (thiscounts[i] == 0) continue;
        thisgrpn = thiscounts[i] - itmp;                                    // undo cumulate; i.e. diff
        if (thisgrpn == 1 || nextradix==-1) {
            push(thisgrpn);
        } else {
            for (j=0; j<thisgrpn; j++)
                ((int *)radix_xsub)[j] = icheck(x[o[itmp+j]-1]);            // this is why this xsub here can't be the same memory as xsub in forder.
            iradix_r(radix_xsub, o+itmp, thisgrpn, nextradix);              // changes xsub and o by reference recursively.
        }
        itmp = thiscounts[i];
        thiscounts[i] = 0;
    }
    if (nalast == 0)                                                        // nalast = 1, -1 are both taken care already.
        for (i=0; i<n; i++) o[i] = (x[o[i]-1] == NA_INTEGER) ? 0 : o[i];    // nalast = 0 is dealt with separately as it just sets o to 0
                                                                            // at those indices where x is NA. x[o[i]-1] because x is not 
                                                                            // modified by reference unlike iinsert or iradix_r
}

static void iradix_r(int *xsub, int *osub, int n, int radix)
    // xsub is a recursive offset into xsub working memory above in iradix, reordered by reference.
    // osub is a an offset into the main answer o, reordered by reference.
    // radix iterates 3,2,1,0
{
    int i, j, itmp, thisx, thisgrpn, nextradix, shift;
    unsigned int *thiscounts;
    
    if (n < N_SMALL) {              // N_SMALL=200 is guess based on limited testing. Needs calibrate().
                                    // Was 50 based on sum(1:50)=1275 worst -vs- 256 cummulate + 256 memset + allowance since reverse order is unlikely.
        iinsert(xsub, osub, n);     // when nalast==0, iinsert will be called only from within iradix.
        return;
    }
    
    shift = radix*8;
    thiscounts = radixcounts[radix];
    
    for (i=0; i<n; i++) {
        thisx = (unsigned int)xsub[i] - INT_MIN;                                // sequential in xsub
        thiscounts[thisx >> shift & 0xFF]++;
    }
    itmp = thiscounts[0];
    for (i=1; itmp<n && i<256; i++)
        if (thiscounts[i]) thiscounts[i] = (itmp += thiscounts[i]);             // don't cummulate through 0s, important below
    for (i=n-1; i>=0; i--) {
        thisx = ((unsigned int)xsub[i] - INT_MIN) >> shift & 0xFF;
        j = --thiscounts[thisx];
        otmp[j] = osub[i];
        ((int *)xtmp)[j] = xsub[i];
    }
    memcpy(osub, otmp, n*sizeof(int));
    memcpy(xsub, xtmp, n*sizeof(int));
    
    nextradix = radix-1;
    while (nextradix>=0 && skip[nextradix]) nextradix--;
    /* TO DO:  If nextradix==-1 AND no further columns from forder AND !retGrp, we're 
               done. We have o. Remember to memset thiscounts before returning. */
    
    if (thiscounts[0] != 0) Error("Logical error. thiscounts[0]=%d but should have been decremented to 0. radix=%d", thiscounts[0], radix);
    thiscounts[256] = n;
    itmp = 0;
    for (i=1; itmp<n && i<=256; i++) {
        if (thiscounts[i] == 0) continue;
        thisgrpn = thiscounts[i] - itmp;  // undo cummulate; i.e. diff
        if (thisgrpn == 1 || nextradix==-1) {
            push(thisgrpn);
        } else {
            iradix_r(xsub+itmp, osub+itmp, thisgrpn, nextradix);
        }
        itmp = thiscounts[i];
        thiscounts[i] = 0;
    }
}

// dradix from Arun's fastradixdouble.c
// + changed to MSD and hooked into forder framework here.
// + replaced tolerance with rounding s.f.

// No rounding by default, for now. Handles #1642, #1728, #1463, #485
static int dround = 0;
static unsigned long long dmask1;
static unsigned long long dmask2;

SEXP setNumericRounding(SEXP droundArg)
// init.c has initial call with default of 2
{
    if (!isInteger(droundArg) || LENGTH(droundArg)!=1) error("Must an integer or numeric vector length 1");
    if (INTEGER(droundArg)[0] < 0 || INTEGER(droundArg)[0] > 2) error("Must be 2 (default) or 1 or 0");
    dround = INTEGER(droundArg)[0];
    dmask1 = dround ? 1 << (8*dround-1) : 0;
    dmask2 = 0xffffffffffffffff << dround*8;
    return R_NilValue;
}

SEXP getNumericRounding()
{
    return ScalarInteger(dround);
}

static union {double d;
              unsigned long long ull;} u;
            //  int i;
            //  unsigned int ui;} u;

unsigned long long dtwiddle(void *p, int i, int order)
{
    u.d = order*((double *)p)[i];                               // take care of 'order' right at the beginning
    if (R_FINITE(u.d)) {
        u.ull = (u.d) ? u.ull + ((u.ull & dmask1) << 1) : 0;    // handle 0, -0 case. Fix for issues/743.
                                                                // tested on vector length 100e6. was the fastest fix (see results at the bottom of page)
    } else if (ISNAN(u.d)) {
     /* 1. NA twiddled to all bits 0, sorts first.  R's value 1954 cleared.
        2. NaN twiddled to set just bit 13, sorts immediately after NA. 13th bit to be 
           consistent with "quiet" na bit but any bit outside last 2 bytes would do.
           (ref: http://r.789695.n4.nabble.com/Question-re-NA-NaNs-in-R-td4685014.html) 
        3. This also normalises a difference between NA on 32bit R (bit 13 set) and 64bit R (bit 13 not set) 
        4. -Inf twiddled to : 0 sign, exponent all 0, mantissa all 1, sorts after NaN
        5. +Inf twiddled to : 1 sign, exponent all 1, mantissa all 0, sorts last since finite 
           numbers are defined by not-all-1 in exponent */
        u.ull = (ISNA(u.d) ? 0 : (1ULL << 51));
        return (nalast == 1 ? ~u.ull : u.ull);        
    }
    unsigned long long mask = (u.ull & 0x8000000000000000) ? 
                     0xffffffffffffffff : 0x8000000000000000;       // always flip sign bit and if negative (sign bit was set) flip other bits too
    return( (u.ull ^ mask) & dmask2 );
}

unsigned long long i64twiddle(void *p, int i, int order)
// 'order' is in effect now - ascending and descending order implemented. Default 
// case (setkey) will not be affected much because nalast != 1 and order == 1 are 
// defaults. 
{
    u.d = ((double *)p)[i];
    u.ull ^= 0x8000000000000000;
    if (nalast != 1) {
        if (order != 1 && u.ull != 0) u.ull ^= 0xffffffffffffffff;
    } else {
        if ( (order == 1 && u.ull==0) || (order != 1) )
            u.ull ^= 0xffffffffffffffff;
    }
    // another way - equivalent as above.
    // if (order == 1) {// u.ull = 0 now means NA
    //     if (nalast == 1 && u.ull == 0) u.ull ^= 0xffffffffffffffff;
    // } else {
    //     if (!(nalast != 1 && u.ull == 0)) u.ull ^= 0xffffffffffffffff;
    // }
    return u.ull;
}

Rboolean dnan(void *p, int i) {
    u.d = ((double *)p)[i];
    return (ISNAN(u.d));
}

Rboolean i64nan(void *p, int i) {
    u.d = ((double *)p)[i];
    return ((u.ull ^ 0x8000000000000000) == 0);
}

/*
unsigned long long i32twiddle(void *p, int i)
{
    u.i = ((int *)p)[i];  // a cast of int to long would introduce 31 redundant bits after the sign bit (better to be packed together in 32)
    unsigned int mask = -(int)(u.ui >> 31) | 0x80000000;
    u.ui ^= mask;
    return( u.ull );
}
*/

unsigned long long (*twiddle)(void *, int, int);
// integer64 has NA = 0x8000000000000000. And it gives TRUE for all ISNAN(.) when '.' is -ve number.
// So, ISNAN(.) would just provide wrong results. This was particularly an issue while implementing
// DT[order(., na.last=NA)] where '.' is an integer64 column. Therefore, 'is_nan'. This is basically 
// ISNAN(.) for double and (u.ull ^ 0x8000000000000000 == 0) for integer64.
Rboolean (*is_nan)(void *, int);
size_t colSize=8;  // the size of the column type (4 or 8). Just 8 currently until iradix is merged in.

static void dradix_r(unsigned char *xsub, int *osub, int n, int radix);

#ifdef WORDS_BIGENDIAN
#define RADIX_BYTE colSize-radix-1
#else
#define RADIX_BYTE radix
#endif

static void dradix(unsigned char *x, int *o, int n)
{
    int i, j, radix, nextradix, itmp, thisgrpn, maxgrpn;
    unsigned int *thiscounts;
    unsigned long long thisx=0;
    // see comments in iradix for structure.  This follows the same. TO DO: merge iradix in here (almost ready)
    for (i=0;i<n;i++) {
        thisx = twiddle(x,i,order);
        for (radix=0; radix<colSize; radix++)
            radixcounts[radix][((unsigned char *)&thisx)[RADIX_BYTE]]++;
        // if dround==2 then radix 0 and 1 will be all 0 here and skipped.
        /* on little endian, 0 is the least significant bits (the right) 
        / and 7 is the most including sign (the left); i.e. reversed. */
    }
    for (radix=0; radix<colSize; radix++) {
        i = ((unsigned char *)&thisx)[RADIX_BYTE];                           // thisx is the last x after loop above
        skip[radix] = radixcounts[radix][i] == n;
        if (skip[radix]) radixcounts[radix][i] = 0;                     // clear it now, the other counts must be 0 already
    }
    radix = colSize-1;  // MSD
    while (radix>=0 && skip[radix]) radix--;
    if (radix==-1) {                                                    // All radix are skipped; i.e. one number repeated n times.
        if (nalast == 0 && is_nan(x, 0))                               // all values are identical. return 0 if nalast=0 & all NA
            for (i=0; i<n; i++) o[i] = 0;                               // because of 'return', have to take care of it here.
        else for (i=0; i<n; i++) o[i] = (i+1);
        push(n);
        return;
    }
    for (i=radix-1; i>=0; i--) {  // clear the lower radix counts, we only did them to know skip. will be reused within each group
        if (!skip[i]) memset(radixcounts[i], 0, 257*sizeof(unsigned int));
    }
    thiscounts = radixcounts[radix];
    itmp = thiscounts[0];
    maxgrpn = itmp;
    for (i=1; itmp<n && i<256; i++) {
        thisgrpn = thiscounts[i];
        if (thisgrpn) {  // don't cummulate through 0s, important below
            if (thisgrpn>maxgrpn) maxgrpn = thisgrpn;
            thiscounts[i] = (itmp += thisgrpn);
        }
    }
    for (i=n-1; i>=0; i--) {
        thisx = twiddle(x,i,order);
        o[ --thiscounts[((unsigned char *)&thisx)[RADIX_BYTE]] ] = i+1;
    }

    if (radix_xsuballoc < maxgrpn) {                                            // TO DO: centralize this alloc
        // The largest group according to the first non-skipped radix, so could be big (if radix is needed on first column)
        // TO DO: could include extra bits to divide the first radix up more. Often the MSD has groups in just 0-4 out of 256.
        // free'd at the end of forder once we're done calling iradix repetitively
        radix_xsub = (double *)realloc(radix_xsub, maxgrpn*sizeof(double));  // realloc(NULL) == malloc
        if (!radix_xsub) Error("Failed to realloc working memory %d*8bytes (xsub in dradix), radix=%d", maxgrpn, radix);
        radix_xsuballoc = maxgrpn;
    }
    
    alloc_otmp(maxgrpn);   // TO DO: leave to forder and remove these calls?
    alloc_xtmp(maxgrpn);
    
    nextradix = radix-1;
    while (nextradix>=0 && skip[nextradix]) nextradix--;
    if (thiscounts[0] != 0) Error("Logical error. thiscounts[0]=%d but should have been decremented to 0. dradix=%d", thiscounts[0], radix);
    thiscounts[256] = n;
    itmp = 0;
    for (i=1; itmp<n && i<=256; i++) {
        if (thiscounts[i] == 0) continue;
        thisgrpn = thiscounts[i] - itmp;  // undo cummulate; i.e. diff
        if (thisgrpn == 1 || nextradix==-1) {
            push(thisgrpn);
        } else {
            if (colSize==4) { // ready for merging in iradix ...
                error("Not yet used, still using iradix instead");
                for (j=0; j<thisgrpn; j++) ((int *)radix_xsub)[j] = twiddle(x, o[itmp+j]-1, order);  
            } else for (j=0; j<thisgrpn; j++) ((unsigned long long *)radix_xsub)[j] = twiddle(x, o[itmp+j]-1, order); // this is why this xsub here can't be the same memory as xsub in forder
            dradix_r(radix_xsub, o+itmp, thisgrpn, nextradix); // changes xsub and o by reference recursively.
        }
        itmp = thiscounts[i];
        thiscounts[i] = 0;
    }
    if (nalast == 0)                                                 // nalast = 1, -1 are both taken care already.
        for (i=0; i<n; i++) o[i] = is_nan(x, o[i]-1) ? 0 : o[i];     // nalast = 0 is dealt with separately as it just sets o to 0 
                                                                     // at those indices where x is NA. x[o[i]-1] because x is not 
                                                                     // modified by reference unlike iinsert or iradix_r

}

static void dinsert(unsigned long long *x, int *o, int n)
// orders both x and o by reference in-place. Fast for small vectors, low overhead.
// don't be tempted to binsearch backwards here, have to shift anyway; many memmove would have overhead and do the same thing
// 'dinsert' will not be called when nalast = 0 and o[0] = -1.
{
    int i, j, otmp, tt;
    unsigned long long xtmp;
    for (i=1; i<n; i++) {
        xtmp = x[i];
        if (xtmp < x[i-1]) {
            j = i-1;
            otmp = o[i];
            while (j>=0 && xtmp < x[j]) { x[j+1] = x[j]; o[j+1] = o[j]; j--; }
            x[j+1] = xtmp;
            o[j+1] = otmp;
        }
    }
    tt = 0;
    for (i=1; i<n; i++) if (x[i]==x[i-1]) tt++; else { push(tt+1); tt=0; }
    push(tt+1);
}

static void dradix_r(unsigned char *xsub, int *osub, int n, int radix)
    /* xsub is a recursive offset into xsub working memory above in dradix, reordered by reference.
       osub is a an offset into the main answer o, reordered by reference.
       dradix iterates 7,6,5,4,3,2,1,0 */
{
    int i, j, itmp, thisgrpn, nextradix;
    unsigned int *thiscounts;
    unsigned char *p;
    if (n < 200) {
        /* 200 is guess based on limited testing. Needs calibrate(). Was 50 
        based on sum(1:50)=1275 worst -vs- 256 cummulate + 256 memset + 
        allowance since reverse order is unlikely */
        dinsert((void *)xsub, osub, n);                                         // order=1 here because it's already taken care of in iradix
        return;
    }
    thiscounts = radixcounts[radix];
    p = xsub + RADIX_BYTE;
    for (i=0; i<n; i++) {
        thiscounts[*p]++;
        p += colSize;
    }
    itmp = thiscounts[0];
    for (i=1; itmp<n && i<256; i++)
        if (thiscounts[i]) thiscounts[i] = (itmp += thiscounts[i]);             // don't cummulate through 0s, important below
    p = xsub + (n-1)*colSize;
    if (colSize == 4) {
        error("Not yet used, still using iradix instead");
        for (i=n-1; i>=0; i--) {
            j = --thiscounts[*(p+RADIX_BYTE)];
            otmp[j] = osub[i];
            ((int *)xtmp)[j] = *(int *)p;
            p -= colSize;
        }
    } else { 
        for (i=n-1; i>=0; i--) {
            j = --thiscounts[*(p+RADIX_BYTE)];
            otmp[j] = osub[i];
            ((unsigned long long *)xtmp)[j] = *(unsigned long long *)p;
            p -= colSize;
        }
    }
    memcpy(osub, otmp, n*sizeof(int));
    memcpy(xsub, xtmp, n*colSize);
    
    nextradix = radix-1;
    while (nextradix>=0 && skip[nextradix]) nextradix--;
    // TO DO:  If nextradix==-1 and no further columns from forder,  we're done. We have o. Remember to memset thiscounts before returning.
    
    if (thiscounts[0] != 0) Error("Logical error. thiscounts[0]=%d but should have been decremented to 0. radix=%d", thiscounts[0], radix);    
    thiscounts[256] = n;
    itmp = 0;
    for (i=1; itmp<n && i<=256; i++) {
        if (thiscounts[i] == 0) continue;
        thisgrpn = thiscounts[i] - itmp;  // undo cummulate; i.e. diff
        if (thisgrpn == 1 || nextradix==-1) {
            push(thisgrpn);
        } else {
            dradix_r(xsub + itmp*colSize, osub+itmp, thisgrpn, nextradix);
        }
        itmp = thiscounts[i];
        thiscounts[i] = 0;
    }
}

// TO DO?: dcount. Find step size, then range = (max-min)/step and proceed as icount. Many fixed precision floats (such as prices)
// may be suitable. Fixed precision such as 1.10, 1.15, 1.20, 1.25, 1.30 ... do use all bits so dradix skipping may not help.

static int *cradix_counts = NULL;
static int cradix_counts_alloc = 0;
static int maxlen = 1;
static SEXP *cradix_xtmp = NULL;
static int cradix_xtmp_alloc = 0;

int StrCmp2(SEXP x, SEXP y) {    // same as StrCmp but also takes into account 'na.last' argument.
    if (x == y) return 0;                   // same cached pointer (including NA_STRING==NA_STRING)
    if (x == NA_STRING) return nalast;      // if x=NA, nalast=1 ? then x > y else x < y (Note: nalast == 0 is already taken care of in 'csorted', won't be 0 here)
    if (y == NA_STRING) return -nalast;     // if y=NA, nalast=1 ? then y > x
    return order*strcmp(CHAR(ENC2UTF8(x)), CHAR(ENC2UTF8(y)));  // same as explanation in StrCmp
}

int StrCmp(SEXP x, SEXP y)            // also used by bmerge and chmatch
{
    if (x == y) return 0;             // same cached pointer (including NA_STRING==NA_STRING)
    if (x == NA_STRING) return -1;    // x<y
    if (y == NA_STRING) return 1;     // x>y
    return strcmp(CHAR(ENC2UTF8(x)), CHAR(ENC2UTF8(y))); // ENC2UTF8 handles encoding issues by converting all marked non-utf8 encodings alone to utf8 first. The function could be wrapped in the first if-statement already instead of at the last stage, but this is to ensure that all-ascii cases are handled with maximum efficiency.
    // This seems to fix the issues as far as I've checked. Will revisit if necessary.
    
    // OLD COMMENT: can return 0 here for the same string in known and unknown encodings, good if the unknown string is in that encoding but not if not ordering is ascii only (C locale). TO DO: revisit and allow user to change to strcoll, and take account of Encoding. see comments in bmerge().  10k calls of strcmp = 0.37s, 10k calls of strcoll = 4.7s. See ?Comparison, ?Encoding, Scollate in R internals.

}

// TO DO: check that all unknown encodings are ascii; i.e. no non-ascii unknowns are present, and that either Latin1
//        or UTF-8 is used by user, not both. Then error if not. If ok, then can proceed with byte level. ascii is never marked known by R, but non-ascii (i.e. knowable encoding) could be marked unknown.
//        does R internals have is_ascii function exported?  If not, simple enough.

static void cradix_r(SEXP *xsub, int n, int radix)
// xsub is a unique set of CHARSXP, to be ordered by reference
// First time, radix==0, and xsub==x. Then recursively moves SEXP together for L1 cache efficiency.
// Quite different to iradix because 
//   1) x is known to be unique so fits in cache (wide random access not an issue)
//   2) they're variable length character strings
//   3) no need to maintain o.  Just simply reorder x. No grps or push.
// Fortunately, UTF sorts in the same order if treated as ASCII, so we can simplify by doing it by bytes.  TO DO: confirm
// a forwards (MSD) radix for efficiency, although more complicated
// This part has nothing to do with truelength. The truelength stuff is to do with finding the unique strings.
// We may be able to improve CHARSXP derefencing by submitting patch to R to make R's string cache contiguous
// but would likely be difficult. If we strxfrm, then it'll then be contiguous and compact then anyway.
{
    int i, j, itmp, *thiscounts, thisgrpn=0, thisx=0;
    SEXP stmp;
    
    // TO DO?: chmatch to existing sorted vector, then grow it.
    // TO DO?: if (n<N_SMALL=200) insert sort, then loop through groups via ==
    if (n<=1) return;
    if (n==2) {
        if (StrCmp(xsub[1],xsub[0])<0) { stmp=xsub[0]; xsub[0]=xsub[1]; xsub[1]=stmp; }
        return;
    }
    // TO DO: if (n<50) cinsert (continuing from radix offset into CHAR) or using StrCmp. But 256 is narrow, so quick and not too much an issue.
    
    thiscounts = cradix_counts + radix*256;
    for (i=0; i<n; i++) {
        thisx = xsub[i]==NA_STRING ? 0 : (radix<LENGTH(xsub[i]) ? (unsigned char)(CHAR(xsub[i])[radix]) : 1);
        thiscounts[ thisx ]++;   // 0 for NA,  1 for ""
    }
    if (thiscounts[thisx] == n && radix < maxlen-1) {   // this also catches when subx has shorter strings than the rest, thiscounts[0]==n and we'll recurse very quickly through to the overall maxlen with no 256 overhead each time
        cradix_r(xsub, n, radix+1);
        thiscounts[thisx] = 0;  // the rest must be 0 already, save the memset
        return;
    }
    itmp = thiscounts[0];
    for (i=1; i<256; i++)
        if (thiscounts[i]) thiscounts[i] = (itmp += thiscounts[i]);  // don't cummulate through 0s, important below
    for (i=n-1; i>=0; i--) {
        thisx = xsub[i]==NA_STRING ? 0 : (radix<LENGTH(xsub[i]) ? (unsigned char)(CHAR(xsub[i])[radix]) : 1);
        j = --thiscounts[thisx];
        cradix_xtmp[j] = xsub[i];
    }
    memcpy(xsub, cradix_xtmp, n*sizeof(SEXP));
    if (radix == maxlen-1) {
        memset(thiscounts, 0, 256*sizeof(int)); 
        return;
    }
    if (thiscounts[0] != 0) Error("Logical error. counts[0]=%d in cradix but should have been decremented to 0. radix=%d", thiscounts[0], radix);
    itmp = 0;
    for (i=1;i<256;i++) {
        if (thiscounts[i] == 0) continue;
        thisgrpn = thiscounts[i] - itmp;  // undo cummulate; i.e. diff
        cradix_r(xsub+itmp, thisgrpn, radix+1);
        itmp = thiscounts[i];
        thiscounts[i] = 0;  // set to 0 now since we're here, saves memset afterwards. Important to clear! Also more portable for machines where 0 isn't all bits 0 (?!)
    }
    if (itmp<n-1) cradix_r(xsub+itmp, n-itmp, radix+1);  // final group
}

static SEXP *ustr = NULL;
static int ustr_alloc = 0, ustr_n = 0;

static void cgroup(SEXP *x, int *o, int n)
// As icount :
//   Places the ordering into o directly, overwriting whatever was there
//   Doesn't change x
//   Pushes group sizes onto stack
// Only run when sortStr==FALSE. Basically a counting sort, in first appearance order, directly.
// Since it doesn't sort the strings, the name is cgroup.
// there is no _pre for this.  ustr created and cleared each time.
{
    SEXP s;
    int i, k, cumsum;
    // savetl_init() is called once at the start of forder
    if (ustr_n != 0) Error("Internal error. ustr isn't empty when starting cgroup: ustr_n=%d, ustr_alloc=%d", ustr_n, ustr_alloc);
    for(i=0; i<n; i++) {
        s = x[i];
        if (TRUELENGTH(s)<0) {                   // this case first as it's the most frequent
            SET_TRUELENGTH(s, TRUELENGTH(s)-1);  // use negative counts so as to detect R's own (positive) usage of tl on CHARSXP
            continue;
        }
        if (TRUELENGTH(s)>0) {  // Save any of R's own usage of tl (assumed positive, so we can both count and save in one scan), to restore
            savetl(s);          // afterwards. From R 2.14.0, tl is initialized to 0, prior to that it was random so this step saved too much.
            SET_TRUELENGTH(s,0);
        }
        if (ustr_alloc<=ustr_n) {
            ustr_alloc = (ustr_alloc == 0) ? 10000 : ustr_alloc*2;  // 10000 = 78k of 8byte pointers. Small initial guess, negligible time to alloc. 
            if (ustr_alloc>n) ustr_alloc = n;
            ustr = realloc(ustr, ustr_alloc * sizeof(SEXP));
            if (ustr == NULL) Error("Unable to realloc %d * %d bytes in cgroup", ustr_alloc, sizeof(SEXP));
        }
        SET_TRUELENGTH(s, -1); 
        ustr[ustr_n++] = s;
    }
    // TO DO: the same string in different encodings will be considered different here. Sweep through ustr and merge counts where equal (sort needed therefore, unfortunately?, only if there are any marked encodings present)
    cumsum = 0;
    for(i=0; i<ustr_n; i++) {                                    // 0.000
        push(-TRUELENGTH(ustr[i]));                            
        SET_TRUELENGTH(ustr[i], cumsum += -TRUELENGTH(ustr[i]));
    }
    int *target = (o[0] != -1) ? newo : o;
    for(i=n-1; i>=0; i--) {                     
        s = x[i];                                            // 0.400 (page fetches on string cache)
        SET_TRUELENGTH(s, k = TRUELENGTH(s)-1);
        target[k] = i+1;                                     // 0.800 (random access to o)
    }
    for(i=0; i<ustr_n; i++) SET_TRUELENGTH(ustr[i],0);     // The cummulate meant counts are left non zero, so reset for next time (0.00s).
    ustr_n = 0;
}

static int *csort_otmp=NULL, csort_otmp_alloc=0;
static void alloc_csort_otmp(int n) {
    if (csort_otmp_alloc >= n) return;
    csort_otmp = (int *)realloc(csort_otmp, n * sizeof(int));
    if (csort_otmp == NULL) Error("Failed to allocate working memory for csort_otmp. Requested %d * %d bytes", n, sizeof(int));
    csort_otmp_alloc = n;
}

static void csort(SEXP *x, int *o, int n)
/* 
   As icount :
    Places the ordering into o directly, overwriting whatever was there
    Doesn't change x
    Pushes group sizes onto stack
   Requires csort_pre() to have created and sorted ustr already 
*/
{
    int i;
    /* can't use otmp, since iradix might be called here and that uses otmp (and xtmp).
       alloc_csort_otmp(n) is called from forder for either n=nrow if 1st column, 
       or n=maxgrpn if onwards columns */
    for(i=0; i<n; i++) csort_otmp[i] = (x[i] == NA_STRING) ? NA_INTEGER : -TRUELENGTH(x[i]);
    if (nalast == 0 && n == 2) {                        // special case for nalast==0. n==1 is handled inside forder. at least 1 will be NA here
        if (o[0] == -1) for (i=0; i<n; i++) o[i] = i+1;    // else use o from caller directly (not 1st column)
        for (int i=0; i<n; i++) if (csort_otmp[i] == NA_INTEGER) o[i] = 0;
        push(1); push(1);
        return; 
    }
    if (n < N_SMALL && nalast != 0) {                                    // TO DO: calibrate() N_SMALL=200
        if (o[0] == -1) for (i=0; i<n; i++) o[i] = i+1;    // else use o from caller directly (not 1st column)
        for (int i=0; i<n; i++) csort_otmp[i] = icheck(csort_otmp[i]);
        iinsert(csort_otmp, o, n);
    } else {
        setRange(csort_otmp, n);
        if (range == NA_INTEGER) Error("Internal error. csort's otmp contains all-NA");
        int *target = (o[0] != -1) ? newo : o; 
        if (range <= N_RANGE) // && range<n)            // TO DO: calibrate(). radix was faster (9.2s "range<=10000" instead of 11.6s 
            icount(csort_otmp, target, n);       // "range<=N_RANGE && range<n") for run(7) where range=N_RANGE n=10000000
        else iradix(csort_otmp, target, n);
    }
    // all i* push onto stack. Using their counts may be faster here than thrashing SEXP fetches over several passes as cgroup does
    // (but cgroup needs that to keep orginal order, and cgroup saves the sort in csort_pre).
}

static void csort_pre(SEXP *x, int n)
// Finds ustr and sorts it.
// Runs once for each column (if sortStr==TRUE), then ustr is used by csort within each group
// ustr is grown on each character column, to save sorting the same strings again if several columns contain the same strings
{
    SEXP s;
    int i, old_un, new_un;
    // savetl_init() is called once at the start of forder
    old_un = ustr_n;
    for(i=0; i<n; i++) {
        s = x[i];
        if (TRUELENGTH(s)<0) continue;   // this case first as it's the most frequent. Already in ustr, this negative is its ordering.
        if (TRUELENGTH(s)>0) {  // Save any of R's own usage of tl (assumed positive, so we can both count and save in one scan), to restore
            savetl(s);          // afterwards. From R 2.14.0, tl is initialized to 0, prior to that it was random so this step saved too much.
            SET_TRUELENGTH(s,0);
        }
        if (ustr_alloc<=ustr_n) {
            ustr_alloc = (ustr_alloc == 0) ? 10000 : ustr_alloc*2;  // 10000 = 78k of 8byte pointers. Small initial guess, negligible time to alloc. 
            if (ustr_alloc > old_un+n) ustr_alloc = old_un + n;
            ustr = realloc(ustr, ustr_alloc * sizeof(SEXP));
            if (ustr==NULL) Error("Failed to realloc ustr. Requested %d * %d bytes", ustr_alloc, sizeof(SEXP));
        }
        SET_TRUELENGTH(s, -1);  // this -1 will become its ordering later below
        ustr[ustr_n++] = s;
        if (s!=NA_STRING && LENGTH(s)>maxlen) maxlen=LENGTH(s);  // length on CHARSXP is the nchar of char * (excluding \0), and treats marked encodings as if ascii.
    }
    new_un = ustr_n;
    if (new_un == old_un) return;  // No new strings observed, seen them all before in previous column. ustr already sufficient.
    // If we ever make ustr permanently held by data.table, we'll just need to make the final loop to set -i-1 before returning here.
    // sort ustr.  TO DO: just sort new ones and merge them in.
    // These allocs are here, to save them being in the recursive cradix_r()
    if (cradix_counts_alloc < maxlen) {
        cradix_counts_alloc = maxlen + 10;   // +10 to save too many reallocs
        cradix_counts = (int *)realloc(cradix_counts, cradix_counts_alloc * 256 * sizeof(int) );  // stack of counts
        if (!cradix_counts) Error("Failed to alloc cradix_counts");
        memset(cradix_counts, 0, cradix_counts_alloc * 256 * sizeof(int));
    }
    if (cradix_xtmp_alloc < ustr_n) {
        cradix_xtmp = (SEXP *)realloc( cradix_xtmp,  ustr_n * sizeof(SEXP) );  // TO DO: Reuse the one we have in forder. Does it need to be n length?
        if (!cradix_xtmp) Error("Failed to alloc cradix_tmp");
        cradix_xtmp_alloc = ustr_n;
    }
    cradix_r(ustr, ustr_n, 0);  // sorts ustr in-place by reference
    for(i=0; i<ustr_n; i++)     // save ordering in the CHARSXP. negative so as to distinguish with R's own usage.
        SET_TRUELENGTH(ustr[i], -i-1);
}

// functions to test vectors for sortedness: isorted, dsorted and csorted
// base:is.unsorted uses any(is.na(x)) at R level (inefficient), and returns NA in the presence of any NA.
// Hence, here we deal with NAs in C and return true if NAs are all at the beginning (what we need in data.table).
// We also return -1 if x is sorted in _strictly_ reverse order; a common case we optimize in forder.
// If a vector is in decreasing order *with ties*, then an in-place reverse (no sort) would result in instability of ties (TO DO).
// For use by forder only, which now returns NULL if already sorted (hence no need for separate is.sorted).
// TO DO: test in big steps first to return faster if unsortedness is at the end (a common case of rbind'ing data to end)
// These are all sequential access to x, so very quick and cache efficient.

static int isorted(int *x, int n)                           // order = 1 is ascending and order=-1 is descending
{                                                           // also takes care of na.last argument with check through 'icheck'
                                                            // Relies on NA_INTEGER==INT_MIN, checked in init.c
    int i=1,j=0;
    if (nalast == 0) {                                      // when nalast = NA, 
        for (int k=0; k<n; k++) if (x[k] != NA_INTEGER) j++;
        if (j == 0) { push(n); return(-2); }                // all NAs ? return special value to replace all o's values with '0'
        if (j != n) return(0);                              // any NAs ? return 0 = unsorted and leave it to sort routines to replace o's with 0's
    }                                                       // no NAs  ? continue to check the rest of isorted - the same routine as usual
    if (n<=1) { push(n); return(1); }
    if (icheck(x[1]) < icheck(x[0])) {
        i = 2;
        while (i<n && icheck(x[i]) < icheck(x[i-1])) i++;
        if (i==n) { mpush(1,n); return(-1);}                // strictly opposite to expected 'order', no ties; 
                                                            // e.g. no more than one NA at the beginning/end (for order=-1/1)
        else return(0);
    }
    int old = gsngrp[flip];
    int tt = 1;
    for (i=1; i<n; i++) {
        if (icheck(x[i]) < icheck(x[i-1])) { gsngrp[flip] = old; return(0); }
        if (x[i]==x[i-1]) tt++; else { push(tt); tt=1; }
    }
    push(tt);
    return(1);                                              // same as 'order', NAs at the beginning for order=1, at end for order=-1, possibly with ties
}

static int dsorted(double *x, int n)                        // order=1 is ascending and -1 is descending
{                                                           // also accounts for nalast=0 (=NA), =1 (TRUE), -1 (FALSE) (in twiddle)
    int i=1,j=0;
    unsigned long long prev, this;
    if (nalast == 0) {                                      // when nalast = NA, 
        for (int k=0; k<n; k++) if (!is_nan(x, k)) j++;
        if (j == 0) { push(n); return(-2); }                // all NAs ? return special value to replace all o's values with '0'
        if (j != n) return(0);                              // any NAs ? return 0 = unsorted and leave it to sort routines to replace o's with 0's
    }                                                       // no NAs  ? continue to check the rest of isorted - the same routine as usual
    if (n<=1) { push(n); return(1); }
    prev = twiddle(x,0,order);
    this = twiddle(x,1,order);
    if (this < prev) {
        i = 2;
        prev=this;
        while (i<n && (this=twiddle(x,i,order)) < prev) {i++; prev=this; }
        if (i==n) { mpush(1,n); return(-1);}                // strictly opposite of expected 'order', no ties; 
                                                            // e.g. no more than one NA at the beginning/end (for order=-1/1)
        else return(0);                                     // TO DO: improve to be stable for ties in reverse
    }
    int old = gsngrp[flip];
    int tt = 1;
    for (i=1; i<n; i++) {
        this = twiddle(x,i,order);                          // TO DO: once we get past -Inf, NA and NaN at the bottom,  and +Inf at the top, 
                                                            //        the middle only need be twiddled for tolerance (worth it?)
        if (this < prev) { gsngrp[flip] = old; return(0); }
        if (this==prev) tt++; else { push(tt); tt=1; }
        prev = this;
    }
    push(tt);
    return(1);                                              // exactly as expected in 'order' (1=increasing, -1=decreasing), possibly with ties
}

static int csorted(SEXP *x, int n)                          // order=1 is ascending and -1 is descending
{                                                           // also accounts for nalast=0 (=NA), =1 (TRUE), -1 (FALSE)
    int i=1, j=0, tmp;
    if (nalast == 0) {                                      // when nalast = NA, 
        for (int k=0; k<n; k++) if (x[k] != NA_STRING) j++;
        if (j == 0) { push(n); return(-2); }                // all NAs ? return special value to replace all o's values with '0'
        if (j != n) return(0);                              // any NAs ? return 0 = unsorted and leave it to sort routines to replace o's with 0's
    }                                                       // no NAs  ? continue to check the rest of isorted - the same routine as usual
    if (n<=1) { push(n); return(1); }
    if (StrCmp2(x[1],x[0])<0) {
        i = 2;
        while (i<n && StrCmp2(x[i],x[i-1])<0) i++;
        if (i==n) { mpush(1,n); return(-1);}                // strictly opposite of expected 'order', no ties; 
                                                            // e.g. no more than one NA at the beginning/end (for order=-1/1)
        else return(0);
    }
    int old = gsngrp[flip];
    int tt = 1;
    for (i=1; i<n; i++) {
        tmp = StrCmp2(x[i],x[i-1]);
        if (tmp < 0) { gsngrp[flip] = old; return(0); }
        if (tmp == 0) tt++; else { push(tt); tt=1; }
    }
    push(tt);
    return(1);                                              // exactly as expected in 'order', possibly with ties 
}

static void isort(int *x, int *o, int n)
{
    if (n<=2) {
        if (nalast == 0 && n == 2) {                        // nalast = 0 and n == 2 (check bottom of this file for explanation)
            if (o[0]==-1) { o[0]=1; o[1]=2; }
            for (int i=0; i<n; i++) if (x[i] == NA_INTEGER) o[i] = 0; 
            push(1); push(1);
            return;
        } else Error("Internal error: isort received n=%d. isorted should have dealt with this (e.g. as a reverse sorted vector) already",n);
    }
    if (n<N_SMALL && o[0] != -1 && nalast != 0) {                 // see comment above in iradix_r on N_SMALL=200.
        /* if not o[0] then can't just populate with 1:n here, since x is changed by ref too (so would need to be copied). */
        /* pushes inside too. Changes x and o by reference, so not suitable  in first column when o hasn't been populated yet 
           and x is the actual column in DT (hence check on o[0]). */
        if (order != 1 || nalast != -1)                     // so that default case, i.e., order=1, nalast=FALSE will not be affected (ex: `setkey`)
            for (int i=0; i<n; i++) x[i] = icheck(x[i]);
        iinsert(x, o, n);
    } else {
        /* Tighter range (e.g. copes better with a few abormally large values in some groups), but also, when setRange was once at 
           colum level that caused an extra scan of (long) x first. 10,000 calls to setRange takes just 0.04s i.e. negligible. */
        setRange(x, n);
        if (range==NA_INTEGER) Error("Internal error: isort passed all-NA. isorted should have caught this before this point");
        int *target = (o[0] != -1) ? newo : o;
        if (range<=N_RANGE && range<=n) {                   // was range<10000 for subgroups, but 1e5 for the first column, 
            icount(x, target, n);                           // tried to generalise here.  1e4 rather than 1e5 here because iterated
        } else {                                            // was (thisgrpn < 200 || range > 20000) then radix
            iradix(x, target, n);                           // a short vector with large range can bite icount when iterated (BLOCK 4 and 6)
        }
    }
    // TO DO: add calibrate() to init.c
}

static void dsort(double *x, int *o, int n)
{
    if (n <= 2) {                                           // nalast = 0 and n == 2 (check bottom of this file for explanation)
        if (nalast == 0 && n == 2) {                        // don't have to twiddle here.. at least one will be NA and 'n' WILL BE 2.
            if (o[0]==-1) { o[0]=1; o[1]=2; }
            for (int i=0; i<n; i++) if (is_nan(x, i)) o[i] = 0;
            push(1); push(1);
            return;
        } Error("Internal error: dsort received n=%d. dsorted should have dealt with this (e.g. as a reverse sorted vector) already",n);
    }
    if (n<N_SMALL && o[0] != -1 && nalast != 0) {                                    // see comment above in iradix_r re N_SMALL=200,  and isort for o[0]
        for (int i=0; i<n; i++) ((unsigned long long *)x)[i] = twiddle(x,i,order);   // have to twiddle here anyways, can't speed up default case like in isort
        dinsert((unsigned long long *)x, o, n);
    } else {
        dradix((unsigned char *)x, (o[0] != -1) ? newo : o, n);
    }
}

SEXP forder(SEXP DT, SEXP by, SEXP retGrp, SEXP sortStrArg, SEXP orderArg, SEXP naArg)
// sortStr TRUE from setkey, FALSE from by=
{
    int i, j, k, grp, ngrp, tmp, *osub, thisgrpn, n, col;
    Rboolean isSorted = TRUE;
    SEXP x, class;
    void *xd;
#ifdef TIMING_ON
    memset(tblock, 0, NBLOCK*sizeof(clock_t));
    memset(nblock, 0, NBLOCK*sizeof(int));
    clock_t tstart;  // local variable to mask the global tstart for ease when timing sub funs
#endif
    TBEG()
    
    if (isNewList(DT)) {
        if (!length(DT)) error("DT is an empty list() of 0 columns");
        if (!isInteger(by) || !length(by)) error("DT has %d columns but 'by' is either not integer or length 0", length(DT));  // seq_along(x) at R level
        n = length(VECTOR_ELT(DT,0));
        for (i=0; i<LENGTH(by); i++) {
            if (INTEGER(by)[i] < 1 || INTEGER(by)[i] > length(DT)) 
                error("'by' value %d out of range [1,%d]", INTEGER(by)[i], length(DT));
            if ( n != length(VECTOR_ELT(DT, INTEGER(by)[i]-1)) )
                error("Column %d is length %d which differs from length of column 1 (%d)\n", INTEGER(by)[i], length(VECTOR_ELT(DT, INTEGER(by)[i]-1)), n);
        }
        x = VECTOR_ELT(DT,INTEGER(by)[0]-1);
    } else {
        if (!isNull(by)) error("Input is a single vector but 'by' is not NULL");
        n = length(DT);
        x = DT;
    }
    if (!isLogical(retGrp) || LENGTH(retGrp)!=1 || INTEGER(retGrp)[0]==NA_LOGICAL) error("retGrp must be TRUE or FALSE");
    if (!isLogical(sortStrArg) || LENGTH(sortStrArg)!=1 || INTEGER(sortStrArg)[0]==NA_LOGICAL ) error("sortStr must be TRUE or FALSE");
    if (!isLogical(naArg) || LENGTH(naArg) != 1) error("na.last must be logical TRUE, FALSE or NA of length 1");
    sortStr = LOGICAL(sortStrArg)[0];
    // static global set on top...
    nalast = (LOGICAL(naArg)[0] == NA_LOGICAL) ? 0 : (LOGICAL(naArg)[0] == TRUE) ? 1 : -1; // 1=TRUE, -1=FALSE, 0=NA
    gsmaxalloc = n;  // upper limit for stack size (all size 1 groups). We'll detect and avoid that limit, but if just one non-1 group (say 2), that can't be avoided.
    
    // TODO: check for 'orderArg'
    
    SEXP ans = PROTECT(allocVector(INTSXP, n)); // once for the result, needs to be length n.
    int *o = INTEGER(ans);                      // TO DO: save allocation if NULL is returned (isSorted==TRUE)
    o[0] = -1;                                  // so [i|c|d]sort know they can populate o directly with no working memory needed to reorder existing order
                                                // had to repace this from '0' to '-1' because 'nalast = 0' replace 'o[.]' with 0 values.
    xd = DATAPTR(x);
    stackgrps = length(by)>1 || LOGICAL(retGrp)[0];
    savetl_init();   // from now on use Error not error.

    order = INTEGER(orderArg)[0];
    switch(TYPEOF(x)) {
    case INTSXP : case LGLSXP :
        tmp = isorted(xd, n); break;
    case REALSXP :
        class = getAttrib(x, R_ClassSymbol);
        if (isString(class) && STRING_ELT(class, 0) == char_integer64) {
            twiddle = &i64twiddle;
            is_nan  = &i64nan; // see explanation under `is_nan` as to why we need this
        } else {
            twiddle = &dtwiddle;
            is_nan  = &dnan;
        }
        tmp = dsorted(xd, n); break;
    case STRSXP :
        tmp = csorted(xd, n); break;
    default :
        Error("First column being ordered is type '%s', not yet supported", type2char(TYPEOF(x)));
    }
    if (tmp) {                                  // -1 or 1. NEW: or -2 in case of nalast == 0 and all NAs
        if (tmp == 1) {                         // same as expected in 'order' (1 = increasing, -1 = decreasing)
            isSorted = TRUE;
            for (i=0; i<n; i++) o[i] = i+1;     // TO DO: we don't need this if returning NULL? Save it? Unlikely huge gain, though.
        } else if (tmp == -1) {                 // -1 (or -n for result of strcmp), strictly opposite to expected 'order'
            isSorted = FALSE;
            for (i=0; i<n; i++) o[i] = n-i;
        } else if (nalast == 0 && tmp == -2) {  // happens only when nalast=NA/0. Means all NAs, replace with 0's therefore!
            isSorted = FALSE;
            for (i=0; i<n; i++) o[i] = 0;
        }
    } else {
        isSorted = FALSE;
        switch(TYPEOF(x)) {
        case INTSXP : case LGLSXP :
            isort(xd, o, n); break;
        case REALSXP :
            dsort(xd, o, n); break;
        case STRSXP :
            if (sortStr) { csort_pre(xd, n); alloc_csort_otmp(n); csort(xd, o, n); }
            else cgroup(xd, o, n);
            break;
        default :
            Error("Internal error: previous default should have caught unsupported type");
        }
    }
    TEND(0)
    
    int maxgrpn = gsmax[flip];   // biggest group in the first column
    void *xsub = NULL;           // local to forder
    int (*f)(); void (*g)();
    
    if (length(by)>1 && gsngrp[flip]<n) {
        xsub = (void *)malloc(maxgrpn * sizeof(double));    // double is the largest type, 8
        if (xsub==NULL) Error("Couldn't allocate xsub in forder, requested %d * %d bytes.", maxgrpn, sizeof(double));
        newo = (int *)malloc(maxgrpn * sizeof(int));        // global variable, used by isort, dsort, sort and cgroup
        if (newo==NULL) Error("Couldn't allocate newo in forder, requested %d * %d bytes.", maxgrpn, sizeof(int));
    }
    TEND(1)  // should be negligible time to malloc even large blocks, but time it anyway to be sure
    
    for (col=2; col<=length(by); col++) {
        x = VECTOR_ELT(DT,INTEGER(by)[col-1]-1);
        xd = DATAPTR(x);
        ngrp = gsngrp[flip];
        if (ngrp == n && nalast != 0) break;
        flipflop();
        stackgrps = col!=LENGTH(by) || LOGICAL(retGrp)[0];
        order = INTEGER(orderArg)[col-1];
        switch(TYPEOF(x)) {
        case INTSXP : case LGLSXP :
            f = &isorted; g = &isort; break;
        case REALSXP :
            class = getAttrib(x, R_ClassSymbol);
            if (isString(class) && STRING_ELT(class, 0) == char_integer64) {
                twiddle = &i64twiddle;
                is_nan  = &i64nan;
            } else {
                twiddle = &dtwiddle;
                is_nan  = &dnan;
            }
            f = &dsorted; g = &dsort; break;
        case STRSXP :
            f = &csorted;
            if (sortStr) { csort_pre(xd, n); alloc_csort_otmp(gsmax[1-flip]); g = &csort; }
            else g = &cgroup; // no increasing/decreasing order required if sortStr = FALSE, just a dummy argument
            break;
        default:
           Error("Column %d of 'by' (%d) is type '%s', not yet supported", col, INTEGER(by)[col-1], type2char(TYPEOF(x)));
        }
        size_t size = SIZEOF(x);
        if (size!=4 && size!=8) Error("Column %d of 'by' is supported type '%s' but size (%d) isn't 4 or 8\n", col, type2char(TYPEOF(x)), size);
        // sizes of int and double are checked to be 4 and 8 respectively in init.c
        i = 0;
        for (grp=0; grp<ngrp; grp++) {
            thisgrpn = gs[1-flip][grp];
            if (thisgrpn == 1) {
                if (nalast==0) {                    // this edge case had to be taken care of here.. (see the bottom of this file for more explanation)
                    switch(TYPEOF(x)) {
                    case INTSXP :
                        if (INTEGER(x)[o[i]-1] == NA_INTEGER) { isSorted=FALSE; o[i] = 0; } break;
                    case LGLSXP :
                        if (LOGICAL(x)[o[i]-1] == NA_LOGICAL) { isSorted=FALSE; o[i] = 0; } break;
                    case REALSXP :
                        if (ISNAN(REAL(x)[o[i]-1])) { isSorted=FALSE; o[i] = 0; } break;
                    case STRSXP :
                        if (STRING_ELT(x, o[i]-1) == NA_STRING) { isSorted=FALSE; o[i] = 0; } break;
                    default :
                        Error("Internal error: previous default should have caught unsupported type");
                    }
                }
                i++; push(1); continue;
            }
            TBEG()
            osub = o+i;
            // ** TO DO **: if isSorted,  we can just point xsub into x directly. If (*f)() returns 0, though, will have to copy x at that point
            //        When doing this,  xsub could be allocated at that point for the first time.
            if (size==4)
                for (j=0; j<thisgrpn; j++) ((int *)xsub)[j] = ((int *)xd)[o[i++]-1];
            else
                for (j=0; j<thisgrpn; j++) ((double *)xsub)[j] = ((double *)xd)[o[i++]-1];
            TEND(2)
            
            // continue;  // BASELINE short circuit timing point. Up to here is the cost of creating xsub.
            tmp = (*f)(xsub, thisgrpn);     // [i|d|c]sorted(); very low cost, sequential
            TEND(3)
            
            if (tmp) {
                // *sorted will have already push()'d the groups
                if (tmp==-1) {
                    isSorted = FALSE;
                    for (k=0;k<thisgrpn/2;k++) {        // reverse the order in-place using no function call or working memory 
                        tmp = osub[k];                  // isorted only returns -1 for _strictly_ decreasing order, otherwise ties wouldn't be stable
                        osub[k] = osub[thisgrpn-1-k];
                        osub[thisgrpn-1-k] = tmp;
                    }
                    TEND(4)
                } else if (nalast == 0 && tmp==-2) {    // all NAs, replace osub[.] with 0s.
                    isSorted = FALSE;
                    for (k=0; k<thisgrpn; k++) osub[k] = 0;
                }
                continue;
            }
            isSorted = FALSE;
            newo[0] = -1;                               // nalast=NA will result in newo[0] = 0. So had to change to -1.
            (*g)(xsub, osub, thisgrpn);                 // may update osub directly, or if not will put the result in global newo
                                                        
            TEND(5)
            
            if (newo[0] != -1) {
                if (nalast != 0) for (j=0; j<thisgrpn; j++) ((int *)xsub)[j] = osub[ newo[j]-1 ];           // reuse xsub to reorder osub
                else for (j=0; j<thisgrpn; j++) ((int *)xsub)[j] = (newo[j] == 0) ? 0 : osub[ newo[j]-1 ];  // final nalast case to handle!
                memcpy(osub, xsub, thisgrpn*sizeof(int));            
            }
            TEND(6)
        }
    }
#ifdef TIMING_ON
    for (i=0; i<NBLOCK; i++) {
        Rprintf("Timing block %d = %8.3f   %8d\n", i, 1.0*tblock[i]/CLOCKS_PER_SEC, nblock[i]);
        if (i==12) Rprintf("\n");
    }
    Rprintf("Found %d groups and maxgrpn=%d\n", gsngrp[flip], gsmax[flip]);
#endif
    
    if (!sortStr && ustr_n!=0) Error("Internal error: at the end of forder sortStr==FALSE but ustr_n!=0 [%d]", ustr_n);
    for(int i=0; i<ustr_n; i++)
        SET_TRUELENGTH(ustr[i],0);
    maxlen = 1;  // reset global. Minimum needed to count "" and NA
    ustr_n = 0;
    savetl_end();
    free(ustr);                ustr=NULL;          ustr_alloc=0;
    
    if (isSorted) {
        UNPROTECT(1);  // The existing o vector, which we may save in future, if in future we only create when isSorted becomes FALSE
        ans = PROTECT(allocVector(INTSXP, 0));  // Can't attach attributes to NULL
    }
    if (LOGICAL(retGrp)[0]) {
        ngrp = gsngrp[flip];
        setAttrib(ans, install("starts"), x = allocVector(INTSXP, ngrp));
        //if (isSorted || LOGICAL(sort)[0])
            for (INTEGER(x)[0]=1, i=1; i<ngrp; i++) INTEGER(x)[i] = INTEGER(x)[i-1] + gs[flip][i-1];
        //else {
            // it's not sorted already and we want to keep original group order
        //    cumsum = 0;
        //    for (i=0; i<ngrp; i++) { INTEGER(x)[i] = o[i+cumsum]; cumsum+=gs[flip][i]; }
        //    isort(INTEGER(x), ngrp);
        //}
        setAttrib(ans, install("maxgrpn"), ScalarInteger(gsmax[flip]));
    }
    
    gsfree();
    free(radix_xsub);          radix_xsub=NULL;    radix_xsuballoc=0;
    free(xsub); free(newo);    xsub=newo=NULL;
    free(xtmp);                xtmp=NULL;          xtmp_alloc=0;
    free(otmp);                otmp=NULL;          otmp_alloc=0;
    free(csort_otmp);          csort_otmp=NULL;    csort_otmp_alloc=0;

    free(cradix_counts);       cradix_counts=NULL; cradix_counts_alloc=0;
    free(cradix_xtmp);         cradix_xtmp=NULL;   cradix_xtmp_alloc=0;   // TO DO: use xtmp already got
    
    UNPROTECT(1);
    return( ans );
}

// TODO: implement 'order' argument to 'fsorted'
// Not touching 'fsorted' for now for "decreasing order". Passing '1' as the value of 'order' argument (checks only ascending order as before).
SEXP fsorted(SEXP x)
{
    // Just checks if ordered and returns FALSE early if not (and don't return ordering if so, unlike forder).
    int tmp,n;
    void *xd;
    n = length(x);
    if (n <= 1) return(ScalarLogical(TRUE));
    if (!isVectorAtomic(x)) Error("is.sorted (R level) and fsorted (C level) only to be used on vectors. If needed on a list/data.table, you'll need the order anyway if not sorted, so use if (length(o<-forder(...))) for efficiency in one step, or equivalent at C level");
    xd = DATAPTR(x);
    stackgrps = FALSE;
    order = 1;
    switch(TYPEOF(x)) {
    case INTSXP : case LGLSXP :
        tmp = isorted(xd, n); break;
    case REALSXP :
        tmp = dsorted(xd, n); break;
    case STRSXP :
        tmp = csorted(xd, n); break;
    default :
        Error("type '%s' is not yet supported", type2char(TYPEOF(x)));
    }
    return(ScalarLogical( tmp==1 ? TRUE : FALSE ));
}

SEXP isOrderedSubset(SEXP x, SEXP nrow)
// specialized for use in [.data.table only
// Ignores 0s but heeds NAs and any out-of-range (which result in NA)
{
    int i=0, last, this;
    if (!length(x)) return(ScalarLogical(TRUE));
    if (!isInteger(x)) error("x has non-0 length but isn't an integer vector");
    if (!isInteger(nrow) || LENGTH(nrow)!=1 || INTEGER(nrow)[0]<0) error("nrow must be integer vector length 1 and >=0");
    if (LENGTH(x)<=1) return(ScalarLogical(TRUE));
    while (i<LENGTH(x) && INTEGER(x)[i]==0) i++;
    if (i==LENGTH(x)) return(ScalarLogical(TRUE));
    last = INTEGER(x)[i];  // the first non-0
    i++;
    for (; i<LENGTH(x); i++) {
        this = INTEGER(x)[i];
        if (this == 0) continue;
        if (this < last || this < 0 || this > INTEGER(nrow)[0])
            return(ScalarLogical(FALSE));
        last = this;
    }
    return(ScalarLogical(TRUE));
}


/*
    // LSD approach ...
    
    Rboolean firstTime = TRUE;
    for (radix=0; radix<4; radix++) {
        if (skip[radix]) continue;
        tmp = ord; ord = ans; ans = tmp;     // flip-flop
        shift = radix * 8;
        thiscounts = counts[radix];
        for (i=1;i<256;i++) thiscounts[i] += thiscounts[i-1];
        tt = clock();
        if (firstTime) {
            for(i = n-1; i >= 0; i--) {
                thisx = x[i] >> shift & 0xFF;
                ans[--thiscounts[thisx]] = i+1;
            }
        } else {
            for(i = n-1; i >= 0; i--) {
                thisx = x[ord[i]-1] >> shift & 0xFF;
                ans[--thiscounts[thisx]] = ord[i];
            }
        }
        memset(thiscounts, 0, 256*sizeof(unsigned int));
        t[6+radix] += clock()-tt;
        firstTime = FALSE;
    }
    if (firstTime) Error("Internal error: x is one repeated number so isorted should have skipped iradix, but iradix has been called and skipped all 4 radix");
}
*/

SEXP isReallyReal(SEXP x) {
    int n, i=0;
    SEXP ans;
    if (!isReal(x))
        error("x must be of type double.");
    n = length(x);
    ans = PROTECT(allocVector(LGLSXP, 1));
    while (i<n &&
          ( ISNA(REAL(x)[i]) ||
          ( R_FINITE(REAL(x)[i]) && REAL(x)[i] == (int)(REAL(x)[i])))) {
        i++;
    }
    LOGICAL(ans)[0] = (i<n);
    UNPROTECT(1);
    return(ans);
}

void pbin(unsigned long long n)
// trace utility for dev to be used in gdb, since couldn't get stdlib::atoi() to link
{
    int sofar = 0;
    for(int shift=sizeof(long long)*8-1; shift>=0; shift--)
    {
       if (n >> shift & 1)
         Rprintf("1");
       else
         Rprintf("0");
       if (++sofar == 1 || sofar == 12) Rprintf(" ");
    }
    Rprintf("\n");
}

SEXP binary(SEXP x)
// base::intToBits is close, but why does it print the result as "00 00 00 00" (raw) rather than ("0000") bits, seems odd.
{
    char buffer[69];
    int j;
    if (!isReal(x)) error("x must be type 'double'");
    SEXP ans = PROTECT(allocVector(STRSXP, LENGTH(x)));
    for (int i=0; i<LENGTH(x); i++) {
        u.d = REAL(x)[i];
        j = 0;
        for(int bit=64; bit>=1; bit--)
        {
           buffer[j++] = '0' + (u.ull >> (bit-1) & 1);         
           if (bit==64 || bit==53 || bit==17 || bit==9) buffer[j++]=' ';
           //       ^sign      ^exponent  ^last 2 byte rounding
        }
        SET_STRING_ELT(ans, i, mkCharLen(buffer,68));  // 64 bits + 4 spaces
    }
    UNPROTECT(1);
    return ans;
}


/*
Note on challenges implementing `nalast=NA` so that things are under the same perspective when looked at later. To go to the point, there are four challenges in implementing `nalast=NA` with the current form of forder.c (which is excellent!):

    1) Handling the i|d|csorted routines for nalast=NA:
       The current implementation of nalast=NA in *sorted routine is that if all values are NAs, then it returns -2, a different value, so that all values of o/osub can be replacedw with 0. If any values are NA, then we assume that this vector is unsorted and return 0 to be taken care of by the corresponding sort routine. Now, this may very well be questioned/improved. For ex: we could detect if the vector is sorted and also if the first value is NA, and if so, get the group size of these NAs and replace that many indices with 0. This is not yet done, but could very well be implemented later. If there are no NAs in the vector, then things are the same.

    2) Handling cases where n==1 and n==2:
       The current sorting routines don't take care of n<=2, and rightly so, as they can be dealt with directly. But it's not the case for nalast=NA because, when n==2 and all are NAs, point 1 will handle it. But when n==1 and it is NA and the column is not the first column to order upon, with the current implementation, "thisgrpn" is checked for "== 1" and if so, just pushes that group size and continues.. But we've to replace it with 0 in our case. So, nalast==0 is checked for inside that if statement and if so, if the value is NA is checked for all types and if so, replaced with 0. For n==2 and "any" is NA, we've to introduce a special case for this in each sort routine to take care of and not result in an error (which tells this should be already taken care of in *sorted).

    3) o[0] and newo[0] = -1:
       Since nalast already populates NAs with 0 values in o, we can't use the old value of 0 to check if it's already populated or not. Therefore this has been changed to -1. 

    4) default cases are untouched as much as possible:
       In order to not compromise in speed for default case (`setkey`), iinsert and dinsert are not touched at all, which means, we've to make sure that they are not run when n<200 and nalast==0, as they don't replace o[x=NA] with 0. And, 'icount', 'iradix', 'dradix' are modified to take care of nalast=NA, to my knowledge.

Ends here.
------------------------------------------
This is just for my reference regarding the cases to take care of when checking for nalast=NA.
n=1     is NA   done.
n=1    not NA   done.
n=2  both NAs   done.
n=2    any NA   done.
n=2     no NA   done.
n>2   all NAs   done.
n>2    any NA   done.
n>2     no NA   done.

*/

/*
Added 2nd August 2014 - Testing different fixes for #743
When u.d = 0 or -0, we've to set it 0, so that the sign bit in -0 doesn't bite us later.

# Vector to benchmark on:
set.seed(45L)
x = 1*sample(-1e5:1e5, 1e8, TRUE)
require(data.table)

# code to run the benchmark:
system.time(data.table:::forderv(x))
system.time(data.table:::forderv(x))
system.time(data.table:::forderv(x))

# A. Timing on code before fix:
   user  system elapsed 
 11.992   1.063  13.113 
 11.968   1.095  13.154 
 11.943   1.106  13.141 

# B. Timing on fix using: u.ull <<= (!u.d); after the first line where we get 'u.d = order*...'
system.time(data.table:::forderv(x))
 12.640   1.095  13.814 
 12.629   1.129  13.836 
 12.634   1.135  13.836 

# C. Timing on current solution:
system.time(data.table:::forderv(x))
 12.208   1.134  13.426 
 12.233   1.148  13.445 
 12.223   1.154  13.548 

B is ~.7 sec slower whereas C is ~.3-.4 seconds slower. This should be okay on 100 million rows, I think!
*/