extremely slow with cudamemcpy after my code modification
i have make some modifications with Interactionforces. After that, i feel it runs very slow. Then i use nvprof to check. result as below (just 5 step).
we can see cudamemcpy cost a lot, which is weird. i try to debug it, then find after Interactionforce in each loop, the first time executing cudamemcpy (ReduMaxFloat, copy the max value of ViscDt, DevicetoHost, just one float), would spend too much, after that it would be normal.
And i also check the original code ver.4.4.09 (twophase dambreak one step)
though it has the same problem, but it is acceptable.
I think i just change the compute process, nothing to do with the others.
Thanks a lot, for any advise