Block Size
---------------------------
- Transpose
  * Q6600 ran fast at 200 for 10k matrices.  didn't check higher
  * Pentium-M degrades performance slightly for 2k matrix.
  * Selected 60 as a compromise between Q6600 and pentium-m performance.
- Block Cholesky
  * 20 seems to be optimal, 60 slows it down a lot.


Matrix Size optimization and Decompositions/Linear Solvers
---------------------------

Once a new instance of a Linear Solver or Decomposition has been created no more dynamics adjustments should
happen for the input matrix size.  This is done to simply the code and testing for correctness by reducing
the number of permutations.


Block Cholesky Decomposition
---------------------------
The DMatrixRMaj based CholeskyDecompositionBlock class is easier to tune and offers equivalent
performance to BlockMatrix64F based BlockCholeskyOuter class. CholeskyDecompositionBlock achieves
optimal performance with munch smaller block sizes, but its performance actually degrades when
larger blocks that are optimal for BlockMatrix64F algorithms are used.  CholeskyDecompositionBlock
also works directly with DMatrixRMaj and avoids the need to convert matrix types.

Block matrix solver is much faster than the row-major solver.

However selecting the default class for Cholesky decomposition is not obvious because using
CholeskyDecompositionBlock would require an additional tuning parameter making the library
even more difficult to use.

Block QR Decomposition
--------------------------

Saving the W matrix instead of recomputing it each time was tried.  For smaller matrices it has
a noticeable speed improvement when solving.  Anything over 2k it seems to be negligible and almost double 
the memory required.  Block QR Decomposition is only used on larger matrices so there is no point in
saving W for later reuse when solving.  Based on profiling results W can save about 5% of runtime when solving
a system.

Block Matrix Multiply
---------------------------
- Block matrix multiplication does have fewer cache misses.
- Converting from row major to block and multiplying causes too many cache misses.
- Two different types of block matrix were created.
  * single continuous array
  * N*M arrays
- After making the code ugly and barely readable they had comparable performance to multReorder(),
  when multiplying two block matrices together.
- The code currently committed and that resides in experimental has
  not been optimized as much, but is readable.

Unrolled Matrix Multiplication
---------------------------
- Tried to unroll either a row or column in either of the inputs
- Does result in improve performance of small square matrices
- Does not always translate to tall or wide matrices
  * can be much slower than other orders
- Did not integrate into library because of added complexity

Combine matrix for DenseMatrix64
---------------------------
combine() is a function in SimpleMatrix that combines two matrices together and grows if needed.
No equivalent is directly provided in CommonOps since it is horribly memory inefficient.  Why
use DenseMatrix64 if you are going to do that.

QR Decomposition: Column Major vs. Transpose
---------------------------
Two variants of QR decomposition currently exist in the code.  One converts the input matrix
into a 2D array in a column major format and the other creates a transposed matrix.  QR
decomposition has fewer cache misses when internally it reformats matrices like this.

The column major 2D array format is about 10% faster than the transpose algorithm due
to less array traversal overhead.  However, it requires specialized code and increases
maintenance overhead.  It will also not benefit from improvements to more common operations.

Deleting the 2D array format was being considered for sake of simplifying the code base.
However, it is actually an idea format for QR with column pivots and results in simpler
faster code.  So it was decided to keep both variants.