NVIDIA, DirectX 12, and Asynchronous Compute: Don’t Panic Yet

Ashes of the Singularity: the game to bring NVIDIA cards to their knees?

Ashes of the Singularity: The game to bring NVIDIA cards to their knees?

Monday was a terrifying day to browse the web as the owner of an NVIDIA graphics card. News hit early this week that the company’s latest series of Maxwell GPUs, the GTX 900-series, could have a design flaw that compromises performance compared to AMD graphics cards when performing asynchronous compute in DirectX 12.

In short: A few weeks ago, Oxide Games released a benchmark demo of an upcoming game called Ashes of the Singularity, the first demo for DirectX 12, the soon-to-come update to Microsoft’s popular gaming API. Many Ashes benchmark reviews found that while NVIDIA graphics cards ran the game quite well with DirectX 11, AMD cards showed an enormous performance jump when upgrading to DX 12. NVIDIA cards, on the other hand, showed no performance improvements with DX 12, and in some cases, actually took a slight hit to performance compared to running the game with DX 11.

The Ashes benchmark resulted in a great deal of debate and speculation online over the past few weeks. An early rumor that was quickly repeated was that NVIDIA’s current generation of GTX cards does not support asynchronous compute. AMD’s current line of graphics cards, however, do support asynchronous computing/shading.

While DX 11 did not allow for asynchronous computing/shading, DX 12 does. Hence, with the asynchronous shading potential “unlocked” using DX 12, AMD cards can benefit from significant performance boosts, while NVIDIA cards may suffer while trying to do the same thing.

Reddit user SilverforceG wrote up a nice overview of the controversy on r/pcgaming, and even included a simple “explain it like I’m 5” summary.

In theory, GPUs that support asynchronous shading (AMD) should see significant performance gains in DX 12 when dealing with dynamic lighting, shadows and global illumination in games. GPUs that do not support asynchronous shading would not.

The news has prompted many new 900-series owners to lament their purchases and, in some cases, contact retailers to ask about a refund.

But is this really a death sentence for NVIDIA cards? Should you toss out your brand new GTX 980 Ti and replace it with an R9 290 from a garage sale? Maybe not quite yet.

 

[edit: This has changed a lot since the original article, thanks to some helpful comments below pointing out my initial misunderstanding of the tool]

A user in the Beyond3D forum created a little tool to test latency of different cards while performing graphics and compute operations. You can see results from the tool here: http://nubleh.github.io/async/scatter.html#6

Fury X

Above is a graph of Fury X results. The blue line shows the amount of time taken for the pure graphics part of the work to be done. The red line shows the amount of time for the pure compute part of the work to be done. The green line shows the total time for both workloads combined. Since the green line is not a flat line consisting of red time + blue time, this indicated that asynchronous compute is working, because some of the compute work can be done at the same time as the graphics work.

980Ti

The 980Ti graph, however, shows something different: The green line *is* the sum of the red and blue line. This means that for some reason the 980Ti isn’t able to do the compute work at the same time as the graphics work. It would appear that asynchronous compute isn’t working as advertised on NVIDIA cards. (The “steps” aren’t important for the question of whether asynchronous compute is working or not, the important part is whether or not green = red + blue)

 

 

What does this mean for current NVIDIA and AMD cards?

AMD graphics cards have an advantage in at least one DX 12 game, Ashes of the Singularity. They may have more advantages in other DX12 games. Completely writing off NVIDIA, however, is just silly.

NVIDIA GPUs will continue to perform well in DX 12. The Ashes benchmark is one game that utilizes asynchronous shading significantly, but we’ve yet to see any other real-world DX 12 benchmarks. We don’t know how well other upcoming games will use asynchronous shading, if they use it at all.

AMD’s Mantle/Vulkan API already use some of the features coming in DX 12, and have supported asynchronous shading for some time. While a couple benchmarks show enormous performance gains using Mantle over DX 11 (60%+ in some extreme cases), most real-world performance benefits on balanced gaming PCs are more in the 5-10% range. Very nice, but not Earth-shattering.

This controversy reminds us of a similar sky-is-falling event experienced by AMD owners a few years ago: When NVIDIA announced PhysX, gamers went nuts, and AMD owners felt like they had drawn the short stick in the GPU wars. In retrospect, of course, we know that AMD owners never really suffered too much by missing out on PhysX. And how often do you see PhysX hyped today?

We are not here to root for NVIDIA over AMD, or vice-versa. In fact, it would be nice to see AMD catch up on their lagging GPU sales, as we don’t want to see either company achieve a monopoly in the graphics card space. We just want to deal with the facts, not the hype.

In the long run, developers will no doubt make more use of asynchronous shading. They already are on consoles. But that will take years to make its way into PC games in any big way, because developers need to learn to use the new features, and will want to still support older PC hardware.

Our PC hardware recommendations will continue to be based on what works well now, and what will likely work well in the future. That includes considerations related to real-world gaming performance, acoustics, thermals, reliability and build quality. Sacrificing that viewpoint based on speculation about what may or may not happen in the future would be irresponsible.

  • If I missed anything, please let me know!

    • Kwee

      It would be better to delete this article because all of what you are saying is wrong. Async Compute/Shader is not working on Maxwell, because Async Compute mean Graphics + Compute not add rendertime.On Nvidia cards this is not the case. Nvidia use software scheduler in Mixed Mode. And don’t call me Fanboy, i’ve a GTX 960.

  • Torque

    Good article. Keep us updated on this one!

  • Chad Thundercock

    10/10 article. Keep up the good work.

  • For anyone concerned about NVIDIA bias, James wrote this on a MacBook, and I edited it on my PC with an R9 290X that I bought myself. No NVIDIA involvement of any kind, just a desire to keep things sane and logical!

    • Huy

      This article needs a revision, from one of the most respected b3d member:

      https://forum.beyond3d.com/posts/1869776/

      “This benchmark is valid for testing async compute latency. This is important for some GPGPU applications.
      It is important to notice that this benchmark doesn’t even try to measure async compute performance (GPU throughput).”

      The reason is covered here: https://forum.beyond3d.com/posts/1869700/

    • MancVandaL

      “sane and logical!” Good luck with that. PC Gamers aren’t known for any of this.

      • Jammy

        Talk about yourself, sweetness.

        Then youtube the console tournaments. Check out the sportsmanship (or should i also say , sportswomanship, because PC gamers are PC -_-‘ ).
        xD can’t stop laughing making this comment.
        Also, sent from my MacBook Bro :p

  • Huy

    There seems to be a case of failure of understanding what Async Compute is or isn’t. The program at b3d is not meant to be a tool to compare performance. It’s purpose is to test functional Async Compute (Parallel execution of graphics & compute) or not. Not how good the architecture is at it.

    This is from Sebbi, a very well known programmer at b3d explaining why its wrong to use it as a performance metric, period:
    https://forum.beyond3d.com/posts/1869700/

    “Benchmarking thread groups that are under 256 threads on GCN is not going to lead into any meaningful results, as you would (almost) never use smaller thread groups in real (optimized) applications. I would suspect a performance bug if a kernel thread count doesn’t belong to {256, 384, 512}. Single lane thread groups result in less than 1% of meaningful work on GCN. Why would you run code like this on a GPU (instead of using the CPU)? Not a good test case at all. No GPU is optimized for this case.”

    This is the likely reason for Maxwell’s “Async Compute Support” and also the implications:
    https://www.reddit.com/r/pcgaming/comments/3jfgs9/maxwell_does_support_async_compute_but_with_a/

    • Matthew Zehner

      Thanks for letting us know! That’s a lot of helpful information.

      • Huy

        There continues to be people who mis-represent the purpose of the program at b3d, they are using it as a benchmark tool and spreading even more confusion.

        It’s quite clear, the creator made it to test function, not performance. It’s also clear from other programmers there, it cannot be used as a performance comparison, for one simple reason: 50ms latency for compute for GCN would result in games with any usage of compute, would not go beyond 20 fps (1000ms/50ms). Since there have been many games with DirectCompute that run much faster than that, its clear that GCN doesn’t have a 50ms latency for compute. Just that this software, is not optimized to test performance.

        • You’re right, I didn’t take the time to understand the tool well enough. I’ll re-write that part.

      • Huy
    • Langebein

      Good someone is pointing this out. That chart has mislead a lot of people I think.

      It shows NVidia doing things a lot faster, then it’s becomes hard to explain that even the biggest queue in the benchmark (128 commands) is tiny compared to a real world compute/shader program and the result is completely dominated by other factors.

      There is a bit of that in this article too, like “Again, using CPUs as an example, there’s a reason that Intel’s mainstream CPUs are still only 2-4 cores”. Yes there is, but GPUs have four digit number of cores, why is that? Real world fragment shaders do not have concurrency issues like that most general programming problems often have. If they have and can not be scaled up to the 1000+ compute cores that GPUs have (Fury X has like 4096 now?), then they simply cannot be run in a real time application.

  • Rene

    It seems that the person who wrote this didn’t get what asynchronous computing is, let me explain, in the Beyond3D benchmark what they test is if the GPUs can do compute and graphics at the same time, not how many computes queues they do and how fast the compute part is as showed above, and in conclusion, the Nvidia cards can perform better the compute part, but not at the same time with graphics, which AMD cans and is faster once combined, check and read everything!

  • Blockchains

    Async compute does appear to be functioning on AMD hardware, but the compute elements of the test are extremely slow for some reason, which is probably something specific to the architecture and how the programmer who wrote the test was testing against a Kepler card. On the nVidia side, since the results are stacked (The Async tests essentially result in compute + graphics without any overlap) this suggests the GPU completely switches context between either execution, and does not benefit at all from Async Compute.

    They still need to work on the test, of course, but the suggestion that Async Compute works on AMD hardware and does not on nVidia hardware seems to be correct. The bigger question here, however, is not whether nVidia is actually able to use it, but rather whether or not they’d significantly benefit from it in relation to AMD. AMD seems to be the ones who will benefit from it the most, as on paper their hardware is extremely fast, but in real world scenarios there appears to be a bottleneck somewhere (beyond simply draw call efficiency). If that bottleneck can be alleviated through the currently un-used ACE units (or rather, the CU’s can be be better saturated), current users with AMD hardware may see a significant speedup in some upcoming DX12 games that use it appropriately.

    • Huy

      Indeed, the program is not to be used as a benchmark, one cannot say “oooh, nv graph, lower = better”, that wasn’t its design goal:
      https://forum.beyond3d.com/posts/1869700/

      • I did indeed get thrown off by the latency measurement and missed the actual point of the tool. I have rewritten that part. Thanks for letting me know!

  • Petrus Laine

    You are reading the results completely wrong, they are showing that async shaders are NOT working. The actual way to read the numbers is to see how async time compares to graphics time + compute time. On nvidia, they are more or less equal, async doesn’t save any time, which means it’s not working. On AMD, async hides graphics time sometimes mostly, sometimes completely, which shows async shaders is working as intended

    • You are correct. I’ve updated the article. Thanks for pointing this out!

    • Abram Carroll

      Why would it improve time? Nividia runs a tight hardware pipe. You can see slight improvements here and there. If there is an idle they downclock the GPU so they can later upclock(boost). Nvidia already has tech to use idle time. AMD doesn’t. AMD is going to get more from it.

      Context switching is faster with ACE’s so this give AMD an advantage with Async in high draw call situations. AMD still takes a hit though and the draw calls are simply from lazy/sloppy programing.

  • Arton Haziri

    We don’t know how well other upcoming games will use asynchronous shading, if they use it at all.. see how ignorant nvidia fan boiz really are, thing is you got ass burned for paying a fortune for a gpu that cant do async and you hope some driver is gonna fix what you lack in hardware,is DX12 is the future and rest a sure are gonna use async/compute,i know is hard to admit defeat from amd just suck it up and sell yo incapable card and get yo self a Radeon Gpu

    • Matthew Zehner

      I like my Nvidia GPU just fine, but it’s true that AMD makes some great ones as well! Thanks for your opinion.

  • Katana Man

    Awesome, my 290x lives on against a three times more expensive 980ti with 3 billion more transistors, LOL!!!!