File: documentation.txt

package info (click to toggle)
magma 2.5.4%2Bds-3
  • links: PTS, VCS
  • area: contrib
  • in suites: bullseye
  • size: 55,132 kB
  • sloc: cpp: 403,043; fortran: 121,916; ansic: 29,190; python: 25,167; f90: 13,666; makefile: 776; csh: 232; xml: 182; sh: 178; perl: 88
file content (919 lines) | stat: -rw-r--r-- 43,456 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
This file is processed with Doxygen. See docs/html/index.html for formatted version.

/**

********************************************************************************
@mainpage MAGMA Users' Guide

Univ. of Tennessee, Knoxville \n
Univ. of California, Berkeley \n
Univ. of Colorado, Denver
@date October 2020

The goal of the MAGMA project is to create a new generation of linear algebra
libraries that achieves the fastest possible time to an accurate solution on
heterogeneous architectures, starting with current multicore + multi-GPU
systems. To address the complex challenges stemming from these systems'
heterogeneity, massive parallelism, and the gap between compute speed and CPU-GPU
communication speed, MAGMA's research is based on the idea that optimal
software solutions will themselves have to hybridize, combining the strengths of
different algorithms within a single framework. Building on this idea, the goal
is to design linear algebra algorithms and frameworks for hybrid multicore and
multi-GPU systems that can enable applications to fully exploit the power that
each of the hybrid components offers.

Designed to be similar to LAPACK in functionality, data storage, and interface,
the MAGMA library allows scientists to easily port their existing software
components from LAPACK to MAGMA, to take advantage of the new hybrid
architectures.
MAGMA users do not have to know CUDA in order to use the library.

There are two types of LAPACK-style interfaces. The first one, referred
to as the *CPU interface*, takes the input and produces the result in the CPU's
memory. The second, referred to as the *GPU interface*, takes the input and
produces the result in the GPU's memory. In both cases, a hybrid CPU/GPU
algorithm is used. Also included is MAGMA BLAS, a complementary to CUBLAS routines.


********************************************************************************
@page authors Collaborators

- A. Abdelfattah (UT Knoxville)
- H. Anzt        (UT Knoxville)
- M. Baboulin    (U Paris-Sud)
- C. Cao         (UT Knoxville)
- J. Demmel      (UC Berkeley)
- T. Dong        (UT Knoxville)
- J. Dongarra    (UT Knoxville)
- P. Du          (UT Knoxville)
- M. Faverge     (INRIA)
- M. Gates       (UT Knoxville)
- A. Haidar      (UT Knoxville)
- M. Horton      (UT Knoxville)
- J. Kurzak      (UT Knoxville)
- J. Langou      (UC Denver)
- H. Ltaief      (KAUST)
- P. Luszczek    (UT Knoxville)
- T. Mary        (IRIT, Team APO)
- R. Nath        (UT Knoxville)
- R. Solca       (ETH Zurich)
- S. Tomov       (UT Knoxville)
- V. Volkov      (UC Berkeley)
- I. Yamazaki    (UT Knoxville)


********************************************************************************
@page installing    Installing MAGMA

First, create a `make.inc` file, using one of the examples as a template.
Set environment variables for where external packages are installed,
either in your `.cshrc/.bashrc` file, or in the `make.inc` file itself.

#### CUDA

All the `make.inc` files assume `$CUDADIR` is set in your environment.
For bash (sh), put in `~/.bashrc` (with your system's path):

    export CUDADIR=/usr/loca/cuda

For csh/tcsh, put in `~/.cshrc`:

    setenv CUDADIR /usr/local/cuda

MAGMA is tested with CUDA >= 7.5. Some functionality requires a newer version.

#### Intel MKL

The MKL `make.inc` files assume `$MKLROOT` is set in your environment. To set it,
for bash (sh), put in ~/.bashrc (with your system's path):

    source /opt/intel/bin/compilervars.sh intel64

For csh/tcsh, put in ~/.cshrc:

    source /opt/intel/bin/compilervars.csh intel64

MAGMA is tested with MKL 11.3.3 (2016), both LP64 and ILP64;
other versions may work.

#### AMD ACML

The ACML `make.inc` file assumes `$ACMLDIR` is set in your environment.
For bash (sh), put in ~/.bashrc (with your system's path):

    export ACMLDIR=/opt/acml-5.3.1

For csh/tcsh, put in ~/.cshrc:

    setenv ACMLDIR  /opt/acml-5.3.1

MAGMA is tested with ACML 5.3.1; other versions may work.
See comments in `make.inc.acml` regarding ACML 4;
a couple testers fail to compile with ACML 4.

#### ATLAS

The ATLAS `make.inc` file assumes `$ATLASDIR` and `$LAPACKDIR` are set in your environment.
If not installed, install LAPACK from http://www.netlib.org/lapack/
For bash (sh), put in ~/.bashrc (with your system's path):

    export ATLASDIR=/opt/atlas
    export LAPACKDIR=/opt/LAPACK

For csh/tcsh, put in ~/.cshrc:

    setenv ATLASDIR  /opt/atlas
    setenv LAPACKDIR /opt/LAPACK

#### OpenBLAS

The OpenBLAS `make.inc` file assumes `$OPENBLASDIR` is set in your environment.
For bash (sh), put in ~/.bashrc (with your system's path):

    export OPENBLASDIR=/opt/openblas

For csh/tcsh, put in ~/.cshrc:

    setenv OPENBLASDIR /opt/openblas

Some bugs exist with OpenBLAS 0.2.19; see BUGS.txt.

#### MacOS Accelerate (previously Veclib)

Unfortunately, the MacOS Accelerate framework uses an old ABI for BLAS and
LAPACK, where single precision functions -- such as `sdot`, `cdot`, `slange`,
and `clange` -- return a double precision result. This makes them incompatibile
with our C/C++ headers and with the Fortran code used in our testers. The fix is
to substitute reference implementations of these functions, found in
`magma/blas_fix`. Setting `blas_fix = 1` in `make.inc` will compile these into
`magma/lib/libblas_fix.a`, with which your application should link.


Linking to BLAS
--------------------------------------------------------------------------------
Depending on the Fortran compiler used for your BLAS and LAPACK libraries,
the linking convention is one of:

- Add underscore, so `gemm()` in Fortran becomes `gemm_()` in C.
- Uppercase,      so `gemm()` in Fortran becomes `GEMM() ` in C.
- No change,      so `gemm()` in Fortran stays   `gemm() ` in C.

Set `-DADD_`, `-DUPCASE`, or `-DNOCHANGE`, respectively, in all FLAGS in your
`make.inc` file to select the appropriate one. Use `nm` to examine your BLAS
library:

    acml-5.3.1/gfortran64_mp/lib> nm libacml_mp.a | grep -i 'T.*dgemm'
    0000000000000000 T dgemm
    00000000000004e0 T dgemm_

In this case, it shows that either `-DADD_ (dgemm_)` or `-DNOCHANGE (dgemm)`
should work. The default in all make.inc files is `-DADD_`.


Compile-time options
--------------------------------------------------------------------------------
Several compiler defines, below, affect how MAGMA is compiled and
might have a large performance impact. These are set in `make.inc` files
using the `-D` compiler flag, e.g., `-DMAGMA_WITH_MKL` in CFLAGS.

- `MAGMA_WITH_MKL`

    If linked with MKL, allows MAGMA to get MKL's version and
    set MKL's number of threads.

- `MAGMA_WITH_ACML`

    If linked with ACML 5 or later, allows MAGMA to get ACML's version.
    ACML's number of threads are set via OpenMP.

- `MAGMA_NO_V1`

    Disables MAGMA v1.x compatability. Skips compiling non-queue versions
    of MAGMA BLAS routines, and simplifies magma_init().

- `MAGMA_NOAFFINITY`

    Disables thread affinity, available in glibc 2.6 and later.

- `BATCH_DISABLE_CHECKING`

    For batched routines, disables the info_array that contains errors.
    For example, for Cholesky factorization if you are sure your matrix is
    SPD and want better performance, you can compile with this flag.

- `BATCH_DISABLE_CLEANUP`

    For batched routines, disables the cleanup code.
    For example, the {sy|he}rk called with "lower" will write data on
    the upper triangular portion of the matrix.

- `BATCHED_DISABLE_PARCPU`

    In the testing directory, disables the parallel implementation of the
    batched computation on CPU. Can be used to compare a naive versus a
    parallelized CPU batched computation.


Run-time options
--------------------------------------------------------------------------------
These variables control MAGMA, BLAS, and LAPACK run-time behavior.

- `$MAGMA_NUM_GPUS`

For multi-GPU functions, set `$MAGMA_NUM_GPUS` to the number of GPUs to use.

- `$OMP_NUM_THREADS`
- `$MKL_NUM_THREADS`
- `$VECLIB_MAXIMUM_THREADS`

    For multi-core BLAS libraries, set `$OMP_NUM_THREADS` or `$MKL_NUM_THREADS`
    or `$VECLIB_MAXIMUM_THREADS` to the number of CPU threads, depending on your
    BLAS library. See the documentation for your BLAS and LAPACK libraries.


Building without Fortran
--------------------------------------------------------------------------------
If you do not have a Fortran compiler, comment out `FORT` in `make.inc`.
MAGMA's Fortran 90 interface and Fortran testers will not be built. Also, many
testers will not be able to check their results -- they will print an error
message, e.g.:

    magma/testing> ./testing_dgehrd -N 100 -c
    ...
    Cannot check results: dhst01_ unavailable, since there was no Fortran compiler.
      100     ---   (  ---  )      0.70 (   0.00)   0.00e+00        0.00e+00   ok


Building shared libraries
--------------------------------------------------------------------------------
By default, all `make.inc` files (except ATLAS) add the `-fPIC` option to CFLAGS,
FFLAGS, F90FLAGS, and NVCCFLAGS, required for building a shared library. Note in
NVCCFLAGS that `-fPIC` is passed via the `-Xcompiler` option. Running:

    make

or

    make lib
    make test
    make sparse-lib
    make sparse-test

will create shared libraries:

    lib/libmagma.so
    lib/libmagma_sparse.so

and static libraries:

    lib/libmagma.a
    lib/libmagma_sparse.a

and testing drivers in `testing` and `sparse-iter/testing`.

The current exception is for ATLAS, in `make.inc.atlas`, which in our
install is a static library, thus requiring MAGMA to be a static library.


Building static libraries
--------------------------------------------------------------------------------
Static libraries are always built along with the shared libraries above.
Alternatively, comment out `FPIC` in your `make.inc` file to compile only a
static library. Then, running:

    make

will create static libraries:

    lib/libmagma.a
    lib/libmagma_sparse.a

and testing drivers in `testing` and `sparse-iter/testing`.


Installation
--------------------------------------------------------------------------------
To install libraries and include files in a given prefix, run:

    make install prefix=/usr/local/magma

The default prefix is `/usr/local/magma`. You can also set `prefix` in `make.inc`.
This installs
MAGMA libraries in `${prefix}/lib`,
MAGMA header files in `${prefix}/include`, and
`${prefix}/lib/pkgconfig/magma.pc` for `pkg-config`.


Tuning
--------------------------------------------------------------------------------
You can modify the blocking factors for the algorithms of
interest in `control/get_nb.cpp`.

Performance results are included in `results/vA.B.C/cudaX.Y-zzz/*.txt`
for MAGMA version A.B.C, CUDA version X.Y, and GPU zzz.


********************************************************************************
@page testing       Running tests

The testing directory includes tests of most MAGMA functions. These are
useful as examples, though they contain additional testing features that
your application does not need, or would do differently. The \ref example "example"
directory has a simple example without all this additional framework.

The \ref run_tests.py script runs a standard set of test sizes and
options. If output is to a file, a summary is printed to stderr.
Currently, nearly all the tests pass with `--tol 100`. With the default
tolerance, usually 30, some tests will signal false negatives, where the test is
okay but just slightly above the accuracy tolerance check.

The \ref run_summarize.py script post-processes tester output to find
errors, failed tests, suspicious tests that were just a little above the
accuracy threshold, and known failures (see BUGS.txt). For example:


********************************************************************************
@page example       Example

The `example` directory shows a simple, standalone example. This shows how to
use MAGMA, apart from the MAGMA Makefiles and other special framework that
we've developed for the tests.
You must edit `example/Makefile` to reflect your `make.inc`,
or use `pkg-config`, as described in `example/README.txt`.


********************************************************************************
@page routines      Overview

The interface for MAGMA is similar to LAPACK, to facilitate porting existing
codes. Many routines have the same base names and the same arguments as LAPACK.
In some cases, MAGMA needs larger workspaces or some additional arguments in
order to implement an efficient algorithm.

There are several classes of routines in MAGMA:

1. \ref driver -- Solve an entire problem.

2. \ref comp   -- Solve one piece of a problem.

3. \ref blas   -- Basic Linear Algebra Subroutines.
   These form the basis for linear algebra algorithms.

4. \ref aux    --
   Additional BLAS-like routines, many originally defined in LAPACK.

5. \ref util   --
   Additional routines, many specific to GPU programming.

A brief summary of routines is given here.
Full descriptions of individual routines are given in the Modules section.

To start, in C or C++, include the magma_v2.h header.

    #include <magma_v2.h>

Driver & computational routines have a `magma_` prefix. These are
generally hybrid CPU/GPU algorithms. A suffix indicates in what memory the
matrix starts and ends, not where the computation is done.

Suffix       |  Example            |  Description
-----------  |  -----------        |  -----------
none         |  magma_dgetrf       |  hybrid CPU/GPU          routine where the matrix is initially in CPU host memory.
_m           |  magma_dgetrf_m     |  hybrid CPU/multiple-GPU routine where the matrix is initially in CPU host memory.
_gpu         |  magma_dgetrf_gpu   |  hybrid CPU/GPU          routine where the matrix is initially in GPU device memory.
_mgpu        |  magma_dgetrf_mgpu  |  hybrid CPU/multiple-GPU routine where the matrix is distributed across multiple GPUs' device memories.

In general, MAGMA follows LAPACK's naming conventions.
The base name of each routine has a
one letter precision (occasionally two letters),
two letter matrix type,
and usually a 2-3 letter routine name. For example,
DGETRF is D (double-precision), GE (general matrix), TRF (triangular factorization).

Precision    |  Description
-----------  |  -----------
s            |  single real precision (float)
d            |  double real precision (double)
c            |  single-complex precision (magmaFloatComplex)
z            |  double-complex precision (magmaDoubleComplex)
sc           |  single-complex input with single precision result (e.g., scnrm2)
dz           |  double-complex input with double precision result (e.g., dznrm2)
ds           |  mixed-precision algorithm (double and single, e.g., dsgesv)
zc           |  mixed-precision algorithm (double-complex and single-complex, e.g., zcgesv)

Matrix type  |  Description
-----------  |  -----------
ge           |  general matrix
sy           |  symmetric matrix, can be real or complex
he           |  Hermitian (complex) matrix
po           |  positive definite, symmetric (real) or Hermitian (complex) matrix
tr           |  triangular matrix
or           |  orthogonal (real) matrix
un           |  unitary (complex) matrix

Driver routines  {#driver}
=================================
Driver routines solve an entire problem.

Name                            |  Description
-----------                     |  -----------
\ref magma_gesv  "gesv"         |  solve linear system, AX = B, A is general (non-symmetric)
\ref magma_posv  "posv"         |  solve linear system, AX = B, A is symmetric/Hermitian positive definite
\ref magma_hesv  "hesv"         |  solve linear system, AX = B, A is symmetric indefinite
\ref magma_gels  "gels"         |  least squares solve, AX = B, A is rectangular
\ref magma_geev  "geev"         |  non-symmetric eigenvalue solver, AX = X Lambda
\ref magma_heev  "syev/heev"    |  symmetric eigenvalue solver, AX = X Lambda
\ref magma_heev  "syevd/heevd"  |  symmetric eigenvalue solver, AX = X Lambda, using divide & conquer
\ref magma_hegv  "sygvd/hegvd"  |  symmetric generalized eigenvalue solver, AX = BX Lambda
\ref magma_gesvd "gesvd"        |  singular value decomposition (SVD), A = U Sigma V^H
\ref magma_gesvd "gesdd"        |  singular value decomposition (SVD), A = U Sigma V^H, using divide & conquer

Computational routines  {#comp}
=================================
Computational routines solve one piece of a problem. Typically, driver
routines call several computational routines to solve the entire problem.
Here, curly braces { } group similar routines.
Starred * routines are not yet implemented in MAGMA.

Name                                                                                |  Description
-----------                                                                         |  -----------
: **Triangular factorizations** :                                                   |  **Description**
\ref magma_getrf "getrf", \ref magma_potrf "potrf", \ref magma_hetrf "hetrf"        |  triangular factorization (LU, Cholesky, Indefinite)
\ref magma_getrs "getrs", \ref magma_potrs "potrs", \ref magma_hetrs "hetrs"        |  triangular forward and back solve
\ref magma_getri "getri", \ref magma_potri "potri"                                  |  triangular inverse
\ref magma_getf2 "getf2", \ref magma_potf2 "potf2"                                  |  triangular panel factorization (BLAS-2)
. **Orthogonal factorizations**                                                     |  **Description**
ge{\ref magma_geqrf "qrf", \ref magma_geqlf "qlf",  \ref magma_gelqf "lqf",  rqf*}  |  QR, QL, LQ, RQ factorization
\ref magma_geqp3 "geqp3"                                                            |  QR with column pivoting (BLAS-3)
or{\ref magma_unmqr "mqr", \ref magma_unmql "mql",  \ref magma_unmlq "mlq",  mrq*}  |  multiply by Q after factorization (real)
un{\ref magma_unmqr "mqr", \ref magma_unmql "mql",  \ref magma_unmlq "mlq",  mrq*}  |  multiply by Q after factorization (complex)
or{\ref magma_ungqr "gqr", gql*, glq*, grq*}                                        |  generate Q after factorization (real)
un{\ref magma_ungqr "gqr", gql*, glq*, grq*}                                        |  generate Q after factorization (complex)
. **Eigenvalue & SVD**                                                              |  **Description**
\ref magma_gehrd "gehrd"                                                            |  Hessenberg  reduction (in geev)
\ref magma_hetrd "sytrd/hetrd"                                                      |  tridiagonal reduction (in syev, heev)
\ref magma_gebrd "gebrd"                                                            |  bidiagonal  reduction (in gesvd)

There are many other computational routines that are mostly internal to
MAGMA and LAPACK, and not commonly called by end users.

BLAS routines  {#blas}
=================================
BLAS routines follow a similar naming scheme: precision, matrix type (for
level 2 & 3), routine name.
For BLAS routines, the **magma_ prefix** indicates a wrapper around CUBLAS
(e.g., magma_zgemm calls cublasZgemm), while the **magmablas_ prefix** indicates
our own MAGMA implementation (e.g., magmablas_zgemm). All MAGMA BLAS routines
are GPU native and take the matrix in GPU memory.
The descriptions here are simplified, omitting scalars (alpha & beta) and
transposes.


BLAS-1: vector operations
---------------------------------
These do O(n) operations on O(n) data and are memory-bound.

Name                     |  Description
-----------              |  -----------
\ref magma_copy  "copy"  |  copy vector, y = x
\ref magma_scal  "scal"  |  scale vector, y = alpha*y
\ref magma_swap  "swap"  |  swap two vectors, y <---> x
\ref magma_axpy  "axpy"  |  y = alpha*x + y
\ref magma_nrm2  "nrm2"  |  vector 2-norm
\ref magma_iamax "amax"  |  vector max-norm
\ref magma_asum  "asum"  |  vector one-norm
\ref magma__dot  "dot"   |  dot product (real), x^T y
\ref magma__dot  "dotu"  |  dot product (complex), unconjugated, x^T y
\ref magma__dot  "dotc"  |  dot product (complex), conjugated,   x^H y

BLAS-2: matrix-vector operations
---------------------------------
These do O(n^2) operations on O(n^2) data and are memory-bound.

Name                                             |  Description
-----------                                      |  -----------
\ref magma_gemv  "gemv"                          |  general matrix-vector product, y = A*x
\ref magma_symv  "symv", \ref magma_hemv "hemv"  |  symmetric/Hermitian matrix-vector product, y = A*x
\ref magma_syr   "syr",  \ref magma_her  "her"   |  symmetric/Hermitian rank-1 update, A = A + x*x^H
\ref magma_syr2  "syr2", \ref magma_her2 "her2"  |  symmetric/Hermitian rank-2 update, A = A + x*y^H + y*x^H
\ref magma_trmv  "trmv"                          |  triangular matrix-vector product, y = A*x
\ref magma_trsv  "trsv"                          |  triangular solve, one right-hand side (RHS), solve Ax = b

BLAS-3: matrix-matrix operations
---------------------------------
These do O(n^3) operations on O(n^2) data and are compute-bound.
Level 3 BLAS are significantly more efficient than the memory-bound level 1 and level 2 BLAS.

Name                                                |  Description
-----------                                         |  -----------
\ref magma_gemm  "gemm"                             |  general matrix-matrix multiply, C = C + A*B
\ref magma_symm  "symm",  \ref magma_hemm  "hemm"   |  symmetric/Hermitian matrix-matrix multiply, C = C + A*B, A is symmetric
\ref magma_syrk  "syrk",  \ref magma_herk  "herk"   |  symmetric/Hermitian rank-k update, C = C + A*A^H, C is symmetric
\ref magma_syr2k "syr2k", \ref magma_her2k "her2k"  |  symmetric/Hermitian rank-2k update, C = C + A*B^H + B*A^H, C is symmetric
\ref magma_trmm  "trmm"                             |  triangular matrix-matrix multiply, B = A*B or B*A, A is triangular
\ref magma_trsm  "trsm"                             |  triangular solve, multiple RHS, solve A*X = B or X*A = B, A is triangular

Auxiliary routines  {#aux}
=================================
Additional BLAS-like routines, many originally defined in LAPACK.
These follow a similar naming scheme: precision, then "la", then the routine name.
MAGMA implements these common ones on the GPU, plus adds a few such as symmetrize and transpose.

For auxiliary routines, the **magmablas_ prefix** indicates
our own MAGMA implementation (e.g., magmablas_zlaswp). All MAGMA auxiliary routines
are GPU native and take the matrix in GPU memory.

Name                                |  Description
-----------                         |  -----------
\ref magma_geadd "geadd"            |  add general matrices (like axpy), B = alpha*A + B
\ref magma_laswp "laswp"            |  swap rows (in getrf)
\ref magma_laset "laset"            |  set matrix to constant
\ref magma_lacpy "lacpy"            |  copy matrix
\ref magma_lascl "lascl"            |  scale matrix
\ref magma_lange "lange"            |  norm, general matrix
\ref magma_lanhe "lansy/lanhe"      |  norm, symmetric/Hermitian matrix
\ref magma_lantr "lantr"            |  norm, triangular matrix
\ref magma_lag2  "lag2"             |  convert general    matrix from one precision to another (e.g., dlag2s is double to single)
\ref magma_lat2  "lat2"             |  convert triangular matrix from one precision to another
\ref magma_larf  "larf"             |  apply    Householder elementary reflector
\ref magma_larfg "larfg"            |  generate Householder elementary reflector
\ref magma_larfb "larfb"            |  apply      block Householder elementary reflector
\ref magma_larft "larft"            |  form T for block Householder elementary reflector
\ref magma_symmetrize "symmetrize"  |  copy lower triangle to upper triangle, or vice-versa
\ref magma_transpose  "transpose"   |  transpose matrix

Utility routines  {#util}
=================================

Memory Allocation
---------------------------------
MAGMA can use regular CPU memory allocated with malloc or new, but it may
achieve better performance using aligned and, especially, pinned memory.
There are typed versions of these (e.g., magma_zmalloc) that avoid the need
to cast and use sizeof, and un-typed versions (e.g., magma_malloc) that
are more flexible but require a (void**) cast and multiplying the number of
elements by sizeof.

Name                  |  Description
-----------           |  -----------
magma_malloc_cpu()    |  allocate CPU memory that is aligned for better performance & reproducibility
magma_malloc_pinned() |  allocate CPU memory that is pinned (page-locked)
magma_malloc()        |  allocate GPU memory
magma_free_cpu()      |  free CPU memory allocated with magma_malloc_cpu
magma_free_pinned()   |  free CPU memory allocated with magma_malloc_pinned
magma_free()          |  free GPU memory allocated with magma_malloc

where * is one of the four precisions, s d c z, or i for magma_int_t, or none
for an un-typed version.


Communication
---------------------------------
The name of communication routines is from the CPU's point of view.

Name                               |  Description
-----------                        |  -----------
\ref magma_setmatrix  "setmatrix"  |  send matrix to GPU
\ref magma_setvector  "setvector"  |  send vector to GPU
\ref magma_getmatrix  "getmatrix"  |  get  matrix from GPU
\ref magma_getvector  "getvector"  |  get  vector from GPU


Data types & complex numbers {#types}
=================================

Integers
---------------------------------
MAGMA uses **magma_int_t** for integers. Normally, this is mapped to the C/C++ int type.
Most systems today use the LP64 convention, meaning long and pointers are
64-bit, while int is 32-bit.

MAGMA also supports the ILP64 convention as an alternative, where int, long, and
pointers are all 64-bit. To use this, we typedef magma_int_t to be long long. To
use ILP64, define MAGMA_ILP64 or MKL_ILP64 when compiling, and link with an
ILP64 BLAS and LAPACK library; see make.inc.mkl-ilp64 for an example.

Complex numbers
---------------------------------
MAGMA supports complex numbers. Unfortunately, there is not a single standard
for how to implement complex numbers in C/C++. Fortunately, most implementations
are identical on a binary level, so you can freely cast from one to another.
The MAGMA types
are: **magmaFloatComplex**,  which in CUDA MAGMA is a typedef of cuFloatComplex,
and  **magmaDoubleComplex**, which in CUDA MAGMA is a typedef of cuDoubleComplex.

For C, we provide macros to manipulate complex numbers.
For C++ support, include the magma_operators.h header, which provides
overloaded C++ operators and functions.

C macro                |  C++ operator  |  Description
-----------            |  -----------   |  -----------
c = MAGMA_*_MAKE(r,i)  |                |  create complex number from real & imaginary parts
r = MAGMA_*_REAL(a)    |  r = real(a)   |  return real part
i = MAGMA_*_IMAG(a)    |  i = imag(a)   |  return imaginary part
c = MAGMA_*_NEGATE(a)  |  c = -a;       |  negate
c = MAGMA_*_ADD(a,b)   |  c = a + b;    |  add
c = MAGMA_*_SUB(a,b)   |  c = a - b;    |  subtract
c = MAGMA_*_MUL(a,b)   |  c = a * b;    |  multiply
c = MAGMA_*_DIV(a,b)   |  c = a / b;    |  divide
c = MAGMA_*_CNJG(a)    |  c = conj(a)   |  conjugate
r = MAGMA_*_ABS(a)     |  r = fabs(a)   |  2-norm, sqrt( real(a)^2 + imag(a)^2 )
r = MAGMA_*_ABS1(a)    |  r = abs1(a)   |  1-norm, abs(real(a)) + abs(imag(a))
. **Constants**        |                |  **Description**
c = MAGMA_*_ZERO       |                |  zero
c = MAGMA_*_ONE        |                |  one
c = MAGMA_*_NAN        |                |  not-a-number (e.g., 0/0)
c = MAGMA_*_INF        |                |  infinity (e.g., 1/0, overflow)

where * is one of the four precisions, S D C Z.


Conventions for variables {#variables}
=================================

Here are general guidelines for variable names; there are of course
exceptions to these.

- Uppercase letters indicate matrices: A, B, C, X.

- Lowercase letters indicate vectors: b, x, y, z.

- "d" prefix indicates matrix or vector on GPU device: dA, dB, dC, dX; db, dx, dy, dz.

- Greek words indicate scalars: alpha, beta.

- m, n, k are matrix dimensions.

Typically, the order of arguments is:
- options (uplo, etc.)
- matrix sizes (m, n, k, etc.),
- input matrices & vectors (A, lda, x, incx, etc.)
- output matrices & vectors
- workspaces (work, lwork, etc.)
- info error code

LAPACK and MAGMA use column-major matrices. For matrix X with dimension
(lda,n), element X(i, j) is X[ i + j*lda ].
For symmetric, Hermitian, and triangular matrices, only the lower or upper
triangle is accessed, as specified by the uplo argument; the other triangle is
ignored.

lda is the leading dimension of matrix A; similarly ldb for B, ldda for dA, etc.
It should immediately follow the matrix pointer in the argument list.
The leading dimension can be the number of rows, or if A is a sub-matrix of
a larger parent matrix, lda is the leading dimension (e.g., rows) of the
parent matrix.

On the GPU, it is often beneficial to round the leading dimension up to a
multiple of 32, to provide better performance. This aligns memory reads so they
are coalesced. This is provided by the magma_roundup function:

    ldda = magma_roundup( m, 32 );

The formula ((m + 31)/32)*32 also works, relying on floored integer division,
but the roundup function is clearer to use.

On the CPU, it is often beneficial to ensure that the leading dimension
is **not** a multiple of the page size (often 4 KiB), to minimize TLB misses.

For vectors, incx is the increment or stride between elements of vector x. In
all cases, incx != 0. In most cases, if incx < 0, then the vector is indexed in
reverse order, for instance, using Matlab notation,

    incx =  1   means   x( 1 : 1 : n     )
    incx =  2   means   x( 1 : 2 : 2*n-1 )

while

    incx = -1   means   x( n     : -1 : 1 )
    incx = -2   means   x( 2*n-1 : -2 : 1 )

For several routines (amax, amin, asum, nrm2, scal), the order is irrelevant,
so negative incx are not allowed; incx > 0.


Constants {#constants}
=================================

MAGMA defines a few constant parameters, such as `MagmaTrans, MagmaNoTrans`,
that are equivalent of CBLAS and LAPACK parameters. The
naming and numbering of these parameters follow that of
[CBLAS from Netlib](http://www.netlib.org/blas/blast-forum/cblas.tgz) and the
[C Interface to LAPACK from Netlib](http://www.netlib.org/lapack/lapwrapc/), and
[PLASMA](http://icl.utk.edu/plasma/).

MAGMA includes functions, `lapack_xyz_const()`, which take MAGMA's integer
constants and return LAPACK's string constants, where `xyz` is a MAGMA type such
as `uplo`, `trans`, etc. From the standpoint of LAPACK, only the first letter of
each string is significant. Nevertheless, the functions return meaningful
strings, such as "No transpose", "Transpose", "Upper", "Lower",
etc. Similarly, there are functions to go from MAGMA's integer constants to
CBLAS, OpenCL's clBLAS, and CUDA's cuBLAS integer constants.

There are also functions, `magma_xyz_const()`, to go in the opposite direction,
from LAPACK's string constants to MAGMA's integer constants.


The most common constants are those defined for BLAS routines:

- enum { MagmaNoTrans, MagmaTrans, MagmaConjTrans } magma_order_t

  Whether a matrix is not transposed, transposed, or conjugate-transposed.
  For a real matrix, Trans and ConjTrans have the same meaning.

- enum { MagmaLower, MagmaUpper, MagmaFull } magma_uplo_t

  Whether the lower or upper triangle of a matrix is given, or the full matrix.

- enum { MagmaLeft, MagmaRight } magma_side_t

  Whether the matrix is on the left or right.

- enum { MagmaUnit, MagmaNonUnit } magma_diag_t

  Whether the diagonal is assumed to be unit (all ones) or not.

Additional constants for specific routines are defined in the documentation for
the routines.

Because MAGMA, CBLAS, LAPACK, CUBLAS, and clBlas use potentially different
constants, converters between them are provided.

These convert LAPACK constants to MAGMA constants.
Note that the meaning of LAPACK constants depends on the context:
'N' can mean False, NoTrans, NonUnit, NoVec, etc.
Here, curly braces { } group similar constants.

.              | Function            |  .             |  Description
-----------    | ----                | ----           |  -----------
magma_bool_t   | magma_bool_const    | ( character )  |  Map 'N', 'Y'      \n to MagmaTrue, MagmaFalse
magma_order_t  | magma_order_const   | ( character )  |  Map 'R', 'C'      \n to MagmaRowMajor, MagmaColMajor
magma_trans_t  | magma_trans_const   | ( character )  |  Map 'N', 'T', 'C' \n to MagmaNoTrans, MagmaTrans, MagmaConjTrans
magma_uplo_t   | magma_uplo_const    | ( character )  |  Map 'L', 'U'      \n to MagmaLower, MagmaUpper
magma_diag_t   | magma_diag_const    | ( character )  |  Map 'N', 'U'      \n to MagmaNonUnit, MagmaUnit
magma_side_t   | magma_side_const    | ( character )  |  Map 'L', 'R'      \n to MagmaLeft, MagmaRight
magma_norm_t   | magma_norm_const    | ( character )  |  Map 'O', '1', '2', 'F', 'E', 'I', 'M' \n to Magma{One, Two, Frobenius, Inf, Max}Norm
magma_dist_t   | magma_dist_const    | ( character )  |  Map 'U', 'S', 'N' \n to MagmaDist{Uniform, Symmetric, Normal}
magma_vec_t    | magma_vec_const     | ( character )  |  Map 'V', 'N', 'I', 'A', 'S', 'O' \n to MagmaVec, Magma{No, I, All, Some, Overwrite}Vec
magma_range_t  | magma_range_const   | ( character )  |  Map 'A', 'V', 'I' \n to MagmaRange{All, V, I}
magma_vect_t   | magma_vect_const    | ( character )  |  Map 'Q', 'P'      \n to MagmaQ, MagmaP
magma_direct_t | magma_direct_const  | ( character )  |  Map 'F', 'B'      \n to MagmaForward, MagmaBackward
magma_storev_t | magma_storev_const  | ( character )  |  Map 'C', 'R'      \n to MagmaColumnwise, MagmaRowwise


These do the inverse map, converting MAGMA to LAPACK constants.
From the standpoint of LAPACK, only the first letter of
each string is significant. Nevertheless, the functions return meaningful
strings, such as "No transpose", "Transpose".
Substitute `lapacke` for `lapack` to get version that returns single char instead of string (const char*).

.           | Function             | .                   |  Description
----------- | ----                 | ----                |  -----------
const char* | lapack_bool_const    | ( magma_bool_t   )  |  Inverse of magma_bool_const()
const char* | lapack_order_const   | ( magma_order_t  )  |  Inverse of magma_order_const()
const char* | lapack_trans_const   | ( magma_trans_t  )  |  Inverse of magma_trans_const()
const char* | lapack_uplo_const    | ( magma_uplo_t   )  |  Inverse of magma_uplo_const()
const char* | lapack_diag_const    | ( magma_diag_t   )  |  Inverse of magma_diag_const()
const char* | lapack_side_const    | ( magma_side_t   )  |  Inverse of magma_side_const()
const char* | lapack_norm_const    | ( magma_norm_t   )  |  Inverse of magma_norm_const()
const char* | lapack_dist_const    | ( magma_dist_t   )  |  Inverse of magma_dist_const()
const char* | lapack_vec_const     | ( magma_vec_t    )  |  Inverse of magma_vec_const()
const char* | lapack_range_const   | ( magma_range_t  )  |  Inverse of magma_range_const()
const char* | lapack_vect_const    | ( magma_vect_t   )  |  Inverse of magma_vect_const()
const char* | lapack_direct_const  | ( magma_direct_t )  |  Inverse of magma_direct_const()
const char* | lapack_storev_const  | ( magma_storev_t )  |  Inverse of magma_storev_const()
const char* | lapack_const         | ( constant )        |  Map any MAGMA constant, Magma*, to an LAPACK string constant
char        | lapacke_const        | ( constant )        |  Map any MAGMA constant, Magma*, to an LAPACKE character


To convert MAGMA to Nvidia's CUBLAS constants:

.                 | Function           | .          |  Description
-----------       | ----               | ----       |  -----------
cublasOperation_t | cublas_trans_const | ( trans )  |  Map MagmaNoTrans, MagmaTrans, MagmaConjTrans \n to CUBLAS_OP_N, CUBLAS_OP_T, CUBLAS_OP_C
cublasFillMode_t  | cublas_uplo_const  | ( uplo  )  |  Map MagmaLower,   MagmaUpper \n to CUBLAS_FILL_MODE_LOWER, CUBLAS_FILL_MODE_UPPER
cublasDiagType_t  | cublas_diag_const  | ( diag  )  |  Map MagmaNonUnit, MagmaUnit  \n to CUBLAS_DIAG_NON_UNIT,   CUBLAS_DIAG_UNIT
cublasSideMode_t  | cublas_side_const  | ( side  )  |  Map MagmaLeft,    MagmaRight \n to CUBLAS_SIDE_LEFT,       CUBLAS_SIDE_Right


To convert MAGMA to AMD's clBlas constants:

.                 | Function           | .          |  Description
-----------       | ----               | ----       |  -----------
clblasOrder       | clblas_order_const | ( order )  |  Map MagmaRowMajor, MagmaColMajor \n to clAmdBlasRowMajor, clAmdBlasColumnMajor
clblasTranspose   | clblas_trans_const | ( trans )  |  Map MagmaNoTrans,  MagmaTrans, MagmaConjTrans \n to clAmdBlasNoTrans, clAmdBlasTrans, clAmdBlasConjTrans
clblasUplo        | clblas_uplo_const  | ( uplo  )  |  Map MagmaLower,    MagmaUpper    \n to clAmdBlasLower,    clAmdBlasUpper
clblasDiag        | clblas_diag_const  | ( diag  )  |  Map MagmaNonUnit,  MagmaUnit     \n to clAmdBlasNonUnit,  clAmdBlasUnit
clblasSide        | clblas_side_const  | ( side  )  |  Map MagmaLeft,     MagmaRight    \n to clAmdBlasLeft,     clAmdBlasRight


To convert MAGMA to CBLAS constants:

.                    | Function           | .          |  Description
-----------          | ----               | ----       |  -----------
enum CBLAS_ORDER     | cblas_order_const  | ( order )  |  Map MagmaRowMajor, MagmaColMajor \n to CblasRowMajor, CblasColMajor
enum CBLAS_TRANSPOSE | cblas_trans_const  | ( trans )  |  Map MagmaNoTrans,  MagmaTrans, MagmaConjTrans \n to CblasNoTrans, CblasTrans, CblasConjTrans
enum CBLAS_UPLO      | cblas_uplo_const   | ( uplo  )  |  Map MagmaLower,    MagmaUpper    \n to CblasLower,    CblasUpper
enum CBLAS_DIAG      | cblas_diag_const   | ( diag  )  |  Map MagmaNonUnit,  MagmaUnit     \n to CblasNonUnit,  CblasUnit
enum CBLAS_SIDE      | cblas_side_const   | ( side  )  |  Map MagmaLeft,     MagmaRight    \n to CblasLeft,     CblasRight


Errors {#errors}
=================================

Driver and computational routines, and a few BLAS/auxiliary routines, currently
return errors both as a return value and in the info argument. The return value
and info should always be identical. In general, the meaning is as given in this
table. Predefined error codes are large negative numbers. Using the symbolic
constants below is preferred, but the numeric values can be found in
include/magma_types.h.

Info                       |  Description
-----------                |  -----------
info = 0 (MAGMA_SUCCESS)   |  Successful exit
info < 0, but small        |  For info = -i, the i-th argument had an illegal value
info > 0                   |  Function-specific error such as singular matrix
MAGMA_ERR_DEVICE_ALLOC     |  Could not allocate GPU device memory
MAGMA_ERR_HOST_ALLOC       |  Could not allocate CPU host memory
MAGMA_ERR_ILLEGAL_VALUE    |  An argument had an illegal value (deprecated; instead it should return -i to say the i-th argument was bad)
MAGMA_ERR_INVALID_PTR      |  Can't free pointer
MAGMA_ERR_NOT_IMPLEMENTED  |  Function or option not implemented
MAGMA_ERR_NOT_SUPPORTED    |  Function or option not supported on the current architecture

magma_xerbla() is called to report errors (mostly bad arguments) to the user.

magma_strerror() returns string description of an error code.


Methodology {#methodology}
=================================

One-sided matrix factorizations
---------------------------------

The one-sided LU, Cholesky, and QR factorizations form a basis for solving
linear systems. A general recommendation is to use
LU for general n-by-n matrices, Cholesky for symmetric/Hermitian positive
definite (SPD) matrices, and QR for solving least squares problems,

    min || A x - b ||

for general m-by-n, m > n matrices.

We use hybrid algorithms where the computation is split between the GPU
and and the CPU. In general for the one-sided factorizations,
the panels are factored on the CPU and the trailing sub-matrix updates
on the GPU. Look-ahead techniques are used to overlap the CPU and GPU
work and some communications.

In both the CPU and GPU interfaces the matrix to be factored resides in the
GPU memory, and CPU-GPU transfers are associated only with the panels.
The resulting matrix is accumulated (on the CPU or GPU according to the
interface) along the computation, as a byproduct of the algorithm, vs.
sending the the entire matrix when needed. In the CPU interface, the original
transfer of the matrix to the GPU is overlapped with the factorization of the
first panel. In this sense the CPU and GPU interfaces, although similar,
are not derivatives of each other as they have different communication patterns.

Although the solution step has O(n) times less floating point operations
than the factorization, it is still very important to optimize it.
Solving a triangular system of equations can be very slow because
the computation is bandwidth limited and naturally not parallel.
Various approaches have been proposed in the past. We use an approach
where diagonal blocks of A are explicitly inverted and used in a block
algorithm. This results in a high performance, numerically stable algorithm,
especially when used with triangular matrices coming from numerically stable
factorization algorithms (e.g., as in LAPACK and MAGMA).

For instances when the GPU's single precision performance is much higher than
its double precision performance, MAGMA provides a second set of solvers,
based on the mixed precision iterative refinement technique.
The solvers are based again on correspondingly the LU, QR, and Cholesky
factorizations, and are designed to solve linear problems in double
precision accuracy but at a speed that is characteristic for the much
faster single precision computations.
The idea is to use single precision for the bulk of the computation,
namely the factorization step, and than use that factorization
as a preconditioner in a simple iterative refinement process in double
precision arithmetic. This often results in the desired high performance
and high accuracy solvers.


Two-sided matrix factorizations
---------------------------------

As the one-sided matrix factorizations are the basis for various linear
solvers, the two-sided matrix factorizations are the basis for eigen-solvers,
and therefore form an important class of dense linear algebra routines.
The two-sided factorizations have been traditionally more difficult to
achieve high performance. The reason is that the two-sided
factorizations involve large matrix-vector products which are memory bound,
and as the gap between compute and communication power increases exponentially,
these memory bound operations become an increasingly more difficult to handle
bottleneck. GPUs though offer an attractive possibility to accelerate them.
Indeed, having a high bandwidth (e.g. 10 times larger than current CPU
bus bandwidths), GPUs can accelerate matrix-vector products significantly
(10 to 30 times). Here, the panel factorization itself is hybrid, while the
trailing matrix update is performed on the GPU.

*/