== Sample CUDA application for shared memory bank conflicts == Transposes a N x N square matrix of float elements in global memory and generates an output matrix in global memory. Defines two versions of CUDA kernel: transposeCoalesced : Coalesced global memory transpose with shared memory bank conflicts transposeNoBankConflicts : Coalesced global memory transpose with reduced shared memory bank conflicts Compiling the code: ================== > nvcc -lineinfo -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_87,code=sm_87 -gencode arch=compute_89,code=sm_89 sharedBankConflicts.cu -o sharedBankConflicts Command line arguments (all are optional): ========================================== 1) Integer value, If not specified uses 1. 1: Use transposeCoalesced() kernel 2: Use transposeNoBankConflicts() kernel 2) Matrix size should be greater than or equal to tile size (TILE_DIM - defined in source file "sharedBankConflicts.cu") and must be an integral multiple of tile size. Default value: DEFAULT_MATRIX_SIZE (defined in source file "sharedBankConflicts.cu") 3) String value can be one of "none", "shared", "l1", "equal" Default value: "none" Refer the CUDA Runtime API cudaFuncSetCacheConfig() documentation for details of cache configuration. Sample usage: ============ - Run with default arguments - transposeCoalesced() kernel and default value of N > ./sharedBankConflicts - Run with the transposeNoBankConflicts() kernel and default value of N > ./sharedBankConflicts 2 - Run with the transposeCoalesced() kernel and N=1024 > ./sharedBankConflicts 1 1024 - Run with the transposeNoBankConflicts() kernel with N=1024 and cache config option "l1" (to prefer larger L1 cache and smaller shared memory) > ./sharedBankConflicts 2 1024 l1 Profiling the sample using Nsight Compute command line ====================================================== - Profile transposeCoalesced() - the initial version of kernel > ncu --set full --import-source on -o transposeCoalesced.ncu-rep ./sharedBankConflicts 1 - Profile transposeNoBankConflicts() - the updated version of the kernel > ncu --set full --import-source on -o transposeNoBankConflicts.ncu-rep ./sharedBankConflicts 2 The profiler report files for the sample are also provided and they can be opened in the Nsight Compute UI using the "File->Open" menu option.