question about BlockSize

I think if  the number of registers available on the multiprocessor is not being exceeded (also shared memory), then the code should run faster when we use more threads per block.

so i run a simple test on -blocksize:0 and -blocksize:1, example/main/06_Wavemaker.


the pictrue above is the run.out for -blocksize:0, which set 128 threads per block

and this shows -blocksize:1 1024 threads per block.

my device is 2080TI, WDDM mode, cuda 10.1, sm=70.

so, the question is :

1.why the code run slower when using -blocksize:1

2.what the column Time/Sec means, at least i am sure that it definitely not means the physical time.

Thanks

Comments

  • emmm, i know here is not the forum of cuda developer, but, maybe someone konws.

    how can i understand the capacity and number of registers per sm. for example, if i declare 2 bool value in kernel function. and we konw the capacity of register is 32-bit. so in this situation, i used 16 bit of one register or 2 registers.

  • why the code run slower when using -blocksize:1?

    The optimal blocksize depends on several several factors and option 1 only tries to maximize the occupancy which does not always improve the performance. The default option usually gives the best performance or close to it.

    what the column Time/Sec means, at least i am sure that it definitely not means the physical time?

    This column shows the execution time to simulate 1 second of physical time.

Sign In or Register to comment.