1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122

ToDo's (see also TODO.sp):
Table of contents:
1) efficiency/memory
2) interface
3) documentation
4) installation
5) bugs
6) others
1) efficiency/memory
 use a random sigma value of 64 bits by default
 try the mpn/generic/{sb,dc,mu}_bdiv_qr.c functions in GMP >= 4.3.0 for REDC
 the conversion from NTT primes to mpz_t in function mpzspv_to_mpzv()
(file mpzspv.c) is quadratic. A faster conversion is possible with a
product tree (already done for the mpz_t > NTT conversion).
 even worse, mpzspm_init seems to be cubic in the input size (because the CRT
algorithm used is quadratic in sp_num). We should use a subquadratic CRT.
 the "Reducing G * H" step is faster in NTT than with KS. This is probably
due to the fact that some transforms are cached in the NTT mode.
 the "Reducing G * H" step can be improved as follows: first compute
D = GH*I mod (x^d+1) where d = deg(F), and I = 1/F mod (x^d+1); then
compute E = D*F mod (x^d1); finally compute T = (GHE)/2 mod (x^d+1).
T equals the Montgomery product GH/(x^d+1) mod F.
See the paper "Fast convolution meets Montgomery" by Preda Mihailescu
(Mathematics of Computation).
 slowdown in stage 1 with REDC between a 58672digit number and a
58688digit number [reported by Christophe.CLAVIER@gemalto.com, 29 Aug 2007]
(((2003663613*2^1950002)/(2*23*173*3863))/1954173900202379)/3612632846010637
((2003663613*2^1950002)/(2*23*173*3863))/1954173900202379
with B1=1000 on an Opteron (44.2s for c58672, 67.5s for c58688).
The culprit seems to be the REDC routine in mpmod.c: indeed, in case the
modulus has n limbs, but the most significant one has only a few bits,
the product (called x in REDC) has only 2n1 limbs, and we never call
Mulders's short product in ecm_redc_n (however the elsecode using full
products seem faster in that case).
For c58672, if one replaces if (xn == 2 * n) in mpmod.c/REDC by
if (xn >= 2 * n  1), the time of stage 1 grows from 44s to 64s, whereas
ecm_redc_n should be faster...
This problem is still present in 6.2, ecm_redc_n should be better tuned,
in particular the choice k=0.75*n in ecm_mul_lo_n() is far from optimal.
 in BrentSuyama's extension, the evaluation of a polynomial of degree k
over N consecutive values is currently done using a O(k N) algorithm
[table of differences]. One can do O(N/k M(k)), cf Section 3 from
"Linear recurrences with polynomial coefficients and application to
integer factorization and CartierManin operator", by Alin Bostan,
Pierrick Gaudry and Eric Schost, SIAM journal on computing,
vol. 36, no. 6, pp. 1777  1806, 2007.
It is not clear if this result also applies to ECM, but at least it
should word for P1 and P+1.
 why restrict the use of mpn_mul_fft to Fermat numbers? We could use it
for any cofactor of 2^(n*BITS_PER_MP_LIMB)+1, as long as
mpn_fft_next_size (n, mpn_fft_best_k (n, S1 == S2)) == n.
 use mpres in step 2 (Target: 7.0)
 write a mpn version of add3 and duplicate
 rewrite entire mpmod.c to be based on mpn_* functions, not mpz_*
 take relative speed of multiplying/squaring into account in PRAC
(DN: couldn't get any significant speed increase)
 use/implement a mpn_mul_hi_n routine for use in mpn_REDC
 use mpn_addmul_2, mpn_addmul_4 in the basecase REDC [for machines
where it exists]. ASM code should perhaps be moved into GMP.
 try McLaughlin's algorithm for Montgomery's modular multiplication
(http://www.ams.org/mcom/000000000/S0025571803015436/home.html)
 consider Colin Percival's generalized DWT for multiplication modulo
k*a^n+b, where k*a*b is highly composite. May belong to GMP rather than
GMPECM.
 implement assembly code (redc.asm) for other architectures
 allow composite d2, or better use the S1+S2 idea from the P+1 algorithm
of Montgomery and Kruppa.
 init mpz_t's with correct amount of memory allocated to avoid reallocs.
Check for reallocs with GMP's memory interface routines. (Partly done.)
 try sliding window multiplication for ECM stage 1 (Target: 7.0)
 choose Brent/Suyama polynomial according to B2/k and not B2!
 Adjust estimated memory to take into account treefile and NTT
(done but improvement possible)
 when GWNUM is used, lower the default B2 (James Wanless, 17 Mar 2006,
james at grok.ltd.uk)
 implement enhanced standard continuation? With graph cover algorithm?
 parallel/distributed stage 2?
 add curve selection for torsion group of order 8 or 16, see Montgomery's
thesis (request of PeterLawrence Montgomery)
 Torbj"orn Granlund suggested faster code for mpn_mod_1(), used extensively
in NTT. See
http://lists.gforge.inria.fr/pipermail/ecmdiscuss/2008May/003365.html
2) interface
 from Mark Rodenkirch <rogue@wi.rr.com> 08 April 2011: print messages like
"Step 1: 1500000/100000000" with a commandline option (or with v)
http://lists.gforge.inria.fr/pipermail/ecmdiscuss/2011April/004088.html
 with resume, print %time for THIS RUN instead of total run?
[suggested by SleepHound <sleephound@yahoo.com>]
Add CPUTIME=... in the save file, to take into account the total cpu time
spend so far (in seconds). George Woltman agrees for that change. It won't
hurt prime95/mprime > will be added for his next version.
 when resuming, print the *initial* x0 for P1/P+1?
 [from Jakub Pawlewicz <pan@mimuw.edu.pl>] add an option stage1time t
to tell the step 1 time, when done by another program. PZ: or better
have it in resume file? (Target: 6.1. Command line option done)
3) documentation
4) installation
 check for __builtin_constant_p and __builtin_expect at configure time
 [suggested by Peter Montgomery] add the possibility to compile a "fat"
binary, which automatically selects the best mulredc assembly code
depending on the cpuid [see TODO.fat]
 [suggested by Thomas Kunz, who did port GMPECM to the PS3, i.e., to the
Cell architecture]: several changes to make it easier to port GMPECM to
specific architectures. Cf TODO.kunz.
5) bugs
6) others
 add primality proving of factors/cofactors? Maybe link Pari for this?
 add point counting algorithm? SEA implementation exists for Pari/GP,
use that?
 let user specify previous factoring work, compute distribution of
candidate factors, compute probability of/est. time to finding a factor
with given parameters.
 rewrite in C++? Lots of work, but would make parts of the code much
cleaner.
