Background
It has been a long journey.
I remember the first time I ever used a dual processor system. I fondly recall it as if it were a religious experience. It was a dual socket 370 system that was running two hacked up overclocked Celeron 300A processors, the best I can recall. I’ve never gone back from single core computing.
The software responsiveness and multitask-ability was truly breathtaking. I’d originally built the system not really knowing what to expect in terms of how it would perform. I was more enamored with the idea of building a system with not one, but two, CPUs that were never meant to be overclocked as much as they were, and in a dual processor config. It worked – but just how well it worked floored me.
That’s been my experience with AMD Threadripper, and EPYC CPUs in server contexts, though one contrast is that’s how those CPUs have been designed: To be awesome, but not too expensive.
Instead of the move to two sockets it’s the move to two CPU dies in one package, on a single socket. Chiplets, as they have come to be called.
Threadripper offers up to 64 PCIe lanes; EPYC offers up to 128 PCIe lanes. Both offer substantial memory capacities, expandability and a lot of bells and whistles that, previously, were unattainably expensive or so expensive the cost/benefit calculation was way off.
Like those budget Celeron processors – these Threadripper CPUs are punching way above their price class.
It’s past January 1, 2019. I was at Computex in June of 2018 and intel demonstrated a socket 3647 enthusiast class CPU with 28 cores and running at 5ghz. An “i7” or “i9” version of their flagship Xeon processor, but with much higher clocks than the Xeon counterpart. At that event, Intel promised a December 2018 launch but December has come and gone.
AMD had also just announced the Threadripper 2990WX – a desktop CPU with 32 cores and 64 threads – that handily outperformed the existing 28-core Intel Xeon Platinum 8160. The Xeon CPU cost about $10,000 US while the Threadripper 2990WX debuted for just $1,700. Expensive for a CPU, but a steal in that cost/benefit ratio calculation.
There was just one fly in the ointment – performance regressions for some programs.
Some performance quirks were attributable to the core-to-memory configuration of Threadripper. The CPU could operate in Uniform Memory Access mode, but it meant higher peak memory latencies which some applications are sensitive to. The default seems to be NUMA – Non-uniform memory access.
As media outlets (including Level1) began to test the CPU it would have odd performance quirks – well beyond what could be explained by slightly higher latency differences from the two modes -- in certain workloads.
In other workloads that 32 core monster would just absolutely shred, however.
A lot of people started to compare the 16-core Threadripper 2950X to the 32-core Threadripper 2990WX and found that the performance delta wasn’t enough to justify the $800ish price difference between these two CPUs.
Some applications regressed – notably Adobe Premiere, Indigo’s Renderer, Blender, 7zip and others.
A lot of games also had really weird performance anomalies on nVidia graphics cards. But it turned out that video card drivers from Nvidia just weren’t ready for 64 threads. That problem was fixed with a software patch in short order, once it was identified, after just a few weeks.
Issues with other applications were patched, too, but other issues have lingered. One such lingering issue is with Indigio, another 3d rendering program.
When we did our testing, it certainly looked like it could be that, but it didn’t sit well with me. It always bugged me, and it was something that I’ve tinkered with on and off (until now, at least).
Our Approach
The design that AMD has adopted with the TR2990 means that there are 4 active 8-core/16-thread dies in one physical processor package in one socket. Each die supports up to two channels of memory but on Threadripper, only 4 memory channels are wired up on current motherboards.
That means two dies (sixteen cores, total) in the Threadripper 2990 package must get information from memory physically attached to the other two dies. This happens over infinity fabric and adds a bit of latency.
For EPYC server CPUs, however, each die is wired to 2 or 4 dimm slots but always in a dual channel configuration. Four dies times two channels equals 8 channels of memory operation.
That’s our test system with the Gigabyte MZ01-CE0 – a fully-populated 8-channel system.
On and off, as I continued to have new ideas, I would do more testing. This testing led to even more results that didn’t make sense.
Finally, I resolved to get an Epyc 7551 – it’s a 32 core/64 thread monster just like its 2990WX cousin – but with 8 memory channels instead of 4.
If PC World and others are right about the memory bandwidth, these apps will immediately perform great. Much to my surprise, when I tested Indigo I was getting a score much less than I expected – on part with the low and regressed scores.
Indigo Bedroom Scene > |
Linux |
Windows |
Threadripper 2990WX NUMA, Stock+XMP |
~3.5 |
~1.5 |
Threadripper 2990WX, NUMA, No SMT |
~2.9 |
~1.0 |
Epyc 7551, NUMA |
~3.0 |
~1.3 |
Epyc 7551, UMA |
~3.0 |
~3.0 |
Threadripper 2990WX, NUMA, Indigo in WSL |
~1.55 |
|
Threadripper 2950X, NUMA, Stock+XMP |
~2.5 |
~2.3 |
Threadripper 2990WX 7zip, Compression | ~90,000 MIPS | ~41,000 MIPS |
Threadripper 2990WX 7zip, Compression (with Coreprio fix from this article) | - | ~70,000 MIPS |
The indigo benchmark is a rendering program, like Cinebench. Cinebench scales very well to more cores and basically represents a best-case scenario. It doesn’t make sense that Cinebench would scale well while Indigo would not.
Programmers and Computer Scientists have not really caught up to the many-cores-on-a-desktop revolution that is happening and many pieces of desktop software are not yet implemented in such a way as to scale their workload well across more than 4 to 8 CPU cores.
What was baffling was that, often, Indigo would perform worse on a 32 core 2990WX than a 16 core 2950X – and by a significant margin.
When rendering the bedroom scene with indigo, a well-tuned 2990 with fast memory will score about 3.5. The Epyc 7551 will score about 3.0 (mainly owing to the lower clock speed). At least, on Linux, that’s the score you can expect. (Note that in our video, some of the scores are a touch lower owing to OBS recording the screen.)
On Windows, something else entirely has been happening.
The score is typically 1.0 to 1.5 on both of these CPUs (EPYC always being a bit slower because of the clock speed). Even on the EPYC 7551 with a fully populated 8-channel configuration. Obviously, memory bandwidth isn’t the issue in these cases. And obviously hardware isn’t just the issue issue because it’s not happening on Linux.
I discovered that taking a single thread out of the list of available threads to run the program on (called CPU Affinity) via the windows task manager would improve performance to similar levels as I was seeing on Linux. Even more baffling was the fact that adding the CPU back (resetting the affinity back to what it was) would also cure the performance regression in most cases.
With this one weird trick, Indigo on Windows would come close or match the Linux performance.
Another reason for purchasing the Epyc 7551 was that I could enable UMA mode – Uniform Memory Access. The fact is that latencies between dies on a carrier in a single socket is much, much less than socket-to-socket latencies that a lot of the “busy-ness balancing” algorithms in most operating systems are designed for in terms of shuffling processes to cores. (This isn’t possible on a TR2990 because of the aforementioned asymmetry with dies and memory connections to those dies. )
Lo and behold, UMA mode on our single-socket Epyc 7551 test system does not exhibit any of the performance anomalies we’ve been seeing with Indigo, 7zip and other apps.
So what’s going on here?
Before I bought the Epyc system, I had experimented with the Linux Subsystem for Windows (WSL). This is a great feature of Windows 10 that lets you run Linux apps under Windows. As Indigo is a cross-platform rendering program, I resolved to run it under the WSL. One technical detail about the WSL – it’s really just a bridge for GNU Utilities to talk to the Windows Kernel. The Linux kernel is not exactly there and thread management and runners are really handled by the Windows Kernel.
On bare metal, on the TR 2990, and with Linux, I could score 3.5 in the bedroom scene in Indigo. On the WSL instance of Indigo, I got roughly the same crappy performance as Windows in NUMA mode.
This effectively rules out operating system libraries or something about the application itself that causes the regression (with the possible exception it’s just using older methods for signaling the operating system that it’s about to do with its many rendering threads). Since the WSL does not pass any information about CPU topology to processes, it must assume responsibility for properly handling a multithreaded workload. Demonstrably, it is not being handled properly.
The other curiosity is that when Indigo is performing badly, or well with my affinity hack, the CPU utilization across all cores and threads remains the same. An analysis of performance counters looking for clues in memory wait states, I/O wait states, reads/writes per second and many other variables for clues yielded no real usable insights. In both “slow” and “fast” cases the CPU really is running 100% of the time, it seemed, but obviously not doing the same work.
Finally, I reached out to a friend – Jeremy at Bitsum. He’s done a lot of work on Process Lasso and I myself am not really a “windows guy.” Clearly windows is doing something wrong, but what? And what does it have to do with the affinity? “When has to do with affinity but isn’t affinity?” we wondered.
As part of my earlier work, I had reached out to Dr. Ian Cutress of Anandtech to try to help me crack the Core 0 mystery. His testing was comprehensive but the results had only left me with more questions.
Jeremy and I resolved to gather more data. Clearly something was changing, other than affinity, with calls to the CPU affinity change.
Many had already experimented with the CLI “start” utility that – by design – sets the NUMA node and CPU affinity before a process is created. But processes started in this way didn’t work the same way as I observed by changing the affinity via task manager.
Once we’d decided to dump all the information about the windows processes from the internal windows structures – the answer became obvious.
Indigo Benchmark.exe PID: 0x000023C4 TID: 0x00002684 ideal_cpu: g0 #3 cpu_mask: FFFFFFFFFFFFFFFF
The windows internal thread management software tags the threads spawned by Indigo with an “ideal CPU” tag.
Here's the full dump from a "good" run -- and here's the link for the "bad" run.
[ This is a new feature just run bitd.exe -p <processname>-- to help you troubleshoot what might be happening. ]
The CPU_MASK setting follows whatever affinity you specify with the CLI utility “start” (or whatever you change via task manager). The ideal_cpu setting, however, will only recommend CPUs from one NUMA node when using the “start” CLI. When setting the affinity via task manager, the ideal_cpu is chosen from any NUMA node, not just one.
When only one NUMA node is recommended via the “ideal CPU” the windows kernel seems to spend half the available CPU time just shuffling threads between cores. That explains the high-CPU -utilization-but-nothing-gets-done aspect of the low performance. It also means it’s a bit tricky to spot apps/threads that are flailing about this way.
Here’s an interesting twist: If you only have one OTHER NUMA node – windows seems to fall back to allowing the threads to establish themselves on the second NUMA node (the ideal CPU tag is ignored, basically).
This is most likely related to a bugfix from Microsoft for 1 or 2 socket Extreme Core Count (XCC) Xeons wherein a physical Xeon CPU has two numa nodes. In the past (with Xeon V4 and maybe V3), one of these NUMA nodes has no access to I/O devices (but does have access to memory through the ring bus).
If that’s true, then that work-around to make sure this type of process stays on the “ideal CPU” in the same socket has no idea what to do when there is more than one other NUMA node in the same package to “fail over” to.
In the case of the Threadripper 2990, there are three other NUMA nodes in the socket.
As such that algorithm seems to just aimlessly shuffle threads and that is one plausible explanation for why the Indigo performance is so much worse on the 2990WX than the 16-core 2950X.
Jeremy and I put a lot of work into arriving at this conclusion. As such he’s modified the CorePrio utility to look for this specific defect – we’ve termed it “NUMA Disassociation” – and to reset this wrong “ideal_CPU” business.
You can download it for free to try it out here:
https://bitsum.com/portfolio/coreprio/
With this work-around in place we can nearly double the performance of Indigio on the 2990WX as reported by Anandtech, The Tech Report, and ourselves.
The rumors of a memory bandwidth problem, even with 32 cores (at least in these instances), has been greatly exaggerated.
Note: None of us truly work in a vacuum -- I owe a huge thanks to Jeremy @ Bitsum, Ian @ Anandtech, GIGABYTE and many others in the tech community. I also owe a lot of thanks to our Patrons! And our readers and supporters! So THANK YOU! I couldn't do this kind of work without the support of the community. Let's make more awesome stuff happn every chance we get. Woot, and whatnot. We've still got some room for improvement, but I would guess further fruit is not as realtively low-hanging as this has been. So thanks to my friends, colleagues, and you all!