1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
|
The way we're going to approach this is to start with an extremely minimal script, generate the C code, then modify the C code bit by bit to make it GPU accelerated.
I'll start with "nlse_minimal.xmds" which solves the nonlinear Schoedinger equation in 1D using RK4 (so not adaptive stepping) with EX operator, no dependent vectors in the integration, no autovectorization etc.
The main integration loop does the RK4 step, as well as basis transforms, plus sampling. So I need to get the boilerplate GPU header and library code for CUDA stuff, then replace pieces of that main loop with GPU code one piece at a time.
Currently I intend to use the CUDA API, but it will be worth checking out more platform independent stuff like ROCM and OpenCL so we can use AMD and Intel as well.
Once the C code works using some GPU offloading, and gives the same results, we can change the Cheetah templating to generate that C code.
Finally, we can introduce more and more complex script options and try to get GPU off load working in more and more situations.
Any modifications I make to the standard C code generated by XMDS will be preceeded / postscripted by the comment lines
// Begin GPU development
// End GPU development
This is a high level overview to track progress. Detailed stuff can be found in the notes.txt file. At regular points I will back up the current C file as a checkpoint. These will be called nlse_minimal_01.cc, nlse_minimal_02.cc etc
Note that it is useful to use "export PYTHONHASHSEED=0" prior to development to ensure the generated C code is the same each time (remember that now we've moved to Python 3, things come out of dictionaries in a different order run to run, so running xmds2 on a script twice will produce different C code each time).
---------------
nlse_minimal.xmds was created; associated C file is nlse_minimal.cc.
|