GPUs performance evaluation

Following the discussion here: forums.dual.sphysics.org/discussion/1846

A few of us (@Asalih3d ), users of DSPH, wanted to evaluate the performance of hardware setup. This would help people deciding if they should upgrade their hardware.

As a first step, we choose to run the dambreak example as it is simple.

Here is the xml code I've run

It is set with dp=0.01, I've run test with dp from 0.01 to 0.005 every 0.001.

Here are the results (3xGTX1080ti +2xGTX1060)

Clockspeed are not the spec ones, but the ones I read under load with GPU-Z tool (don't forget to "pre-heat" your GPU as the actual clockspeed shown in sensors varies at the beginning). I did not detect any thermal throttling.

I've looked at the theoretical performance delta between two hardware setups and the actual (averaged between all simulations) performance delta.

Performances of the 3 1080ti are consistant between each other (different manufacturers and different CPUs).

Comparing 1080ti and 1060, the theoretical and actual delta are different (1060 perform 12% better than expected)

Feel free to crunch my numbers differently and discuss.

It would be very beneficial if other people could run the same test, especially with RTX series (2000 and 3000...).

Kind regards.

Comments

  • Awesome!

    I hope to get my results for my weaker GPU in the weekend to compare with. If anyone has access to 2000 and 3000 series and wants to share their results, it would also be quite nice. For now don't bother too much about exact version, just ensure it is atleast version 5.

    For now I think we keep results here in this thread, and later on perhaps we should migrate it to a github library. Unfortunately I am quite busy at the time.

    Kind regards

  • What is the perf row in the spreadsheet?

  • HI @jonnilehtiranta

    Perf for performance it is the product of numbers of cuda cores and clock speed.

  • Are you running with double-precision or single-precision? Some people are not aware that the number of cores quoted for each card is actually the number of single-precision cores. The number of double-precision cores, and hence performance, is dramatically less for consumer-grade (GTX, RTX) cards. This article explains why: https://arrayfire.com/explaining-fp64-performance-on-gpus/

    That's why I bought a Titan Black, which still has competitive double-precision price / performance when set to double-precision mode.

    This article also points out that the double-precision performance of consumer-grade AMD graphics cards is dramatically better than for Nvidia. Any chance of DualSPHysics being ported to AMD graphics cards? This would be a popular move as to get high-performance double-precision performance from Nvidia costs an arm and a leg, not only because their professional cards are much more expensive, but also because they don't have graphical output: Thus forcing you to have a motherboard that can handle two graphics devices simultaneously AND with two separate PCIE channels to maintain the data bandwidth with the graphics card doing the heavy processing. Unless you buy a Titan Black.

  • edited February 2021

    @MikeHersee As far as I understood, since DualSPHysics 5.0 positions are always in double precision and velocity and density always in single (https://forums.dual.sphysics.org/discussion/1812/) I wondered what the penalty in terms of compute time is after this choice, but eventually I did not test this (also because you'd need to run the older DSPH 4.4, and DSPH is being refactored at places from time to time, so the comparison would not be too sound).

    Long story short, my intuition is that now there is always a double precision core: this is a bottleneck to consider when buying a GPU card.

    Reinforcing your message, for example the conceptual map of a Pascal architecture below shows the single precision cores in light green and the double precision cores in yellow. They are distinct processors and are half as much. No surprise that the compute throughput in double precision is half as much as that in single precision.


  • @MikeHersee @sph_tudelft_nl

    Thanks to both of you for your input ! I initiated this post for the exact reasons you mention: it is not easy to predict GPU performance regarding DSPH (mix of single vs double precision)

    The titan black seems like a nice deal, but is very hard to find nowadays.

    Would you be willing to run the standard DamBreak example (it is fast) on v5.0 and share your GPU model and results so we can shed some light on this subject and help people make sound investment when starting using DSPH ?

    Thanks

  • @jmdalonso can include here some figures of the performance with different GPU he has been performing lately

  • @TPouzol It is a very relevant topic and is even part of broader assessments done for an "audit" of DualSPHysics. I already made a note of your wish and am glad to contribute --- please bear with me while I cross out other items from my to-do list.

    In any analysis, consider that the dam break consists grossly of two stages.

    The first one, in which the water box collapses, hits the wall and the jet rebounds: the flow is compact and essentially irrotational, so the time step stays high and (perhaps too) finding neighbours is fast.

    The second one in which there is spraying, splashing and mixing, droplets accelerate following gravity, time steps go down, and (perhaps too) finding neighbours is more demanding.

    So the computation effectively goes at two speeds. This is relevant, should you compare performances measured with one test case rather than another.

    Even a difference in tank design between two dam breaks will influence this two-stage partition; how much this is remains to see. But, for example, the dam break used to test the modified boundary conditions is another one (actually the SPHERIC benchmark https://www.spheric-sph.org/tests/test-2) The one used earlier as a flagship testcase is another experiment, for which only the measured speed of the leading edge was available, if I am not wrong.

    For the sake of SPHERIC standardisation, I would suggest to make this performance tests with the SPHERIC benchmark dambreak (and apply the old dynamic boundary conditions, which seems to be a stable, non-beta-type modelling feature).

  • @TPouzol Late addition to the above.

    I failed to mention that there is a third final stage in which the water settles down, which is slow.

    If the SPHERIC dam-break standard gains traction as a performance benchmark, that's a longish simulation (some 6 seconds). Beside using the dynamic boundary conditions, for economy of time I would then suggest to simulate only the first 2 seconds only, where the action is. I imagine that this should be sufficient to exercise all code parts: the water hits the front face of the container around 0.4s, the end wall around 0.6s and the lee face of the container around 1.1s.

    You can see a simulation of those 2 seconds in those conditions in this https://youtu.be/xJxMPtGogB0 The default playtime is 5x; you can slow it down further with the playback commands if you like.

  • @sph_tudelft_nl

    Again, thanks for your very valuable contribution. I also think that testing for performance is not that easy since there is many factors to take into account. I'm open to any suggestion regarding the testcase (even imperfect). The standard dambreak from spherric is a good idea ! My main concern is to assure that anyone can easily launch a performence test, this would maybe help to have more data points and reduce the risk of testing different configuration without even noticing.

    I think we should provide an all-in-one folder containing the testcase that people can download, launch on their hardware and then share the results. A little bit like the examples in DSPH (maybe it could be added by the dev into the main download if this interesting for them too ??)

Sign In or Register to comment.