Disable Global C-States (or how a machine ran for 5 years and suddenly started locking up)

tl;dr: If you have an AM4 processor (Ryzen 9 5950X) that suddenly starts locking up. Disable “Global C-States” in your BIOS.

About two weeks ago, I woke up to a fairly broken network at my house. It didn’t take long to realize the problem was with my Proxmox server. Once I got to the console, I noticed it was frozen. I power cycled it and I was back.

I looked at the logs, they just stopped dead. Nothing helpful like “hey, your disks are failing”. I made a note, but moved on with life.

I hoped this was just an anomaly, but I was not that lucky. A week or two later, it would happen again, but worse.

Same pattern, I woke up and everything was dead. Looked at the console and it was frozen. Again I power cycled the computer. This time it wouldn’t stay up. Proxmox was up for less than a minute before it locked up. I tried this one or two more times, same thing. I had somewhere to be that morning, so this would have to be continued.

When I got back to it, I started some basic troubleshooting. I opened up the box and re-seated all the RAM and PCIe cards. I ran memtest86+ which succeeded with no errors. The machine ran for nearly 2 hours without locking up. Boot back into Proxmox, and it locked up almost immediately.

Passing memtest86+

RAM passes, but that doesn’t really mean much here.

I tried some other operating systems, OpenBSD seemed to work. Ubuntu locked up. Gentoo seemed to run okay. The Gentoo LiveUSB stayed up long enough to read both my NVMes, just as a very simple stress test.

Back to the Proxmox install ISO… nope. It locks up. It seemed like the problem happens when the video mode switches. I borrowed another card from a friend, no luck. Somewhere in there, I flashed the BIOS to the latest “non-beta” version. Again, no luck.

At this point, the RAM seems good, the video card doesn’t seem to be the problem. Now I’m thinking disks.

I was using RAIDZ0 (ZFS) across two NVMes. This is a bit risky, as there’s no redundancy. It’s been fine for about 5 years, but maybe my luck has run out. Somehow, the system stays up long enough for me to reinstall Proxmox using one NVMe and XFS. Nope, locks up again. Okay, let’s remove the NVMes and install to a SSD using XFS. Nope, locks up again after the installation (or maybe it didn’t even finish the installation).

So I no longer think it’s the disks. We’re basically down to the CPU or motherboard. Maybe the power supply? But I wouldn’t think a bad power supply would make it through nearly 2 hours of memtest86+.

I don’t think I’ve ever encountered a bad CPU, so I’m guessing motherboard. I can still get an AM4 motherboard, so I’ll pick one up tomorrow.

In the meantime, I throw Proxmox on a really old machine… so I can get a little bit of infrastructure back online. It’s nice to be able to spin up Proxmox in a pinch without having to worry about license activation.

I buy a new motherboard, but the chipset in my old motherboard is better. All I could get was a B550, where I had an X570.

Sensing this won’t be successful, I go ahead and buy parts to build a Threadripper. Hopefully the completely new box will work, and I can get all my infrastructure back online. If I can salvage the old machine, I’ll build a Proxmox cluster. I used to have a second node, but it got flakey, and if you only have two Proxmox nodes, you don’t have a quorum and Proxmox is not happy.

With a car load of parts, step one is to try the new AM4 motherboard and let memtest86+ run while I assemble the new computer. I swap out the AM4 motherboards and start memtest86+.

Assembling the Threadripper goes pretty well, except the heat sink installation is a bit tricky. As a bonus, the heatsink came with zero instructions. Anyway, it goes together. Proxmox installs. I can pull in my backups and spin up all my VMs. Crisis over.

Back to the AM4. memtest86+ completes and keeps running. So far so good.

I boot into Proxmox, after about 9 seconds, it locks up. It might actually be worse. Could it be the CPU? I have a friend with an AM4. He offers to let me try parts from his build to try and isolate the problem.

He brings it over. I swap in his AM4 CPU. It won’t boot, and the power LED flashes quickly. I have ECC RAM and I don’t think all AM4s support that. Ok, I put his RAM in my box. It boots… and then hangs. At this point, I’ve replaced everything except the power supply and the case. Let’s try my friend’s power supply. Nope, it hangs. I’m feeling cursed at this point. Could be the case? Sure. Let’s pull it out of the case. Nope. I return all my friend’s parts.

At this point, you just start entering keywords into search engines, clicking, scrolling, and trying things one at a time:

  • Flash the BIOS to the absolute latest version, even though it’s a “beta”. Nope.

  • Make a new USB stick, even though this one installed the Threadripper. Nope.

  • Try successively older Proxmox versions. Nope.

At this point, I’m thinking this box might actually be cursed. I’ll do some more random keyword searches tomorrow.

Somehow I stumble on this post:

https://www.overclock.net/threads/ryzen-5950x-system-unstable-when-its-cold-cpu-degradation.1814836/

A few people mention disabling “global C-states” solved the problem. Worth a shot. I boot into a Gentoo LiveUSB. It boots and launches KDE. I let it sit for a few hours. Still running. I re-install Proxmox. It boots. It keeps running for a few hours. I put it back together and put it on the network. It runs overnight. It’s running as I write this days later.

In conclusion …

I don’t know why a box that ran 24/7 for at least four years suddenly started locking up. I have no clue as to why it would suddenly be necessary to disable global c-states. In the end, that was the one change that fixed it. Other than that and the BIOS version, nothing changed. My best guess is a microcode update broke something, but I don’t think those persist across boots. Also, my friend’s CPU would also have to have had the bad microcode. Maybe some day I’ll figure it out.