question about BlockSize
I think if the number of registers available on the multiprocessor is not being exceeded (also shared memory), then the code should run faster when we use more threads per block.
so i run a simple test on -blocksize:0 and -blocksize:1, example/main/06_Wavemaker.
the pictrue above is the run.out for -blocksize:0, which set 128 threads per block
and this shows -blocksize:1 1024 threads per block.
my device is 2080TI, WDDM mode, cuda 10.1, sm=70.
so, the question is :
1.why the code run slower when using -blocksize:1
2.what the column Time/Sec means, at least i am sure that it definitely not means the physical time.