Page 1 of 1

Curious case of memory corruption

Posted: Sun Dec 22, 2024 2:17 pm
by bdsy
Hello

When running an ordinary Linux distro on an ordinary PC, I've got a file cached in memory corrupted in an unusual way. I'd really want to know the root cause rather than apply some generic advice in hopes the issue disappears. Perhaps you have some ideas.

Diff between the correct file and the corrupted one:
diff.png
What I see here:
  • A chunk of 4 bytes right before the next chunk
  • A chunk of 256 bytes, 4096-aligned. There's a pattern of 8-byte groups:
    • Looks like a little-endian number that increases by 13.
    • Fifth byte is not overwritten.
Initially I noticed that something is wrong with the file, calculated its hash, and got a mismatch. I saved the corrupted copy to disk. The file was read multiple times in process from the memory cache so it couldn't be a one time read from RAM error. Also, I ran Memtest86+ for 19.5 hours (14 passes) and it found nothing. It also couldn't be a disk fault because the file was stored on an encrypted partition, so a corrupted block after decryption would've been something random-looking.

What's probably more important, I found the same issue reported online (in a place I'd rather not link to). In their case, the corruption happened in Windows driver code:
windows-chkimg.txt
(4.66 KiB) Downloaded 18 times
So, the issue is not OS-specific. But the common part seems to be AMD Zen 4 CPU, B650 chipset and Gigabyte motherboard. I wonder if the corruption could've been caused by SMM? Or else, what kind of hardware failure can cause these patterns?

I'm planning to write a program that scans the whole RAM for this pattern to at least know the rate at which this occurs and potentially correlate it with other things happening on the PC.

Re: Curious case of memory corruption

Posted: Sun Dec 22, 2024 6:24 pm
by Octocontrabass
bdsy wrote: Sun Dec 22, 2024 2:17 pmI'd really want to know the root cause rather than apply some generic advice in hopes the issue disappears.
Vendors won't admit there's a problem if they can avoid it, but they'll fix problems anyway. That's why the generic advice like "update your BIOS" and "install components according to the latest version of the user manual and compatibility list" so often fixes problems.
bdsy wrote: Sun Dec 22, 2024 2:17 pmAlso, I ran Memtest86+ for 19.5 hours (14 passes) and it found nothing.
Memory tests don't always catch faulty RAM. Sometimes the fault only occurs with specific access patterns. This corruption doesn't look like faulty RAM, though.
bdsy wrote: Sun Dec 22, 2024 2:17 pmI wonder if the corruption could've been caused by SMM?
It could be.
bdsy wrote: Sun Dec 22, 2024 2:17 pmOr else, what kind of hardware failure can cause these patterns?
Any hardware component that can perform DMA can corrupt memory. If that's what's happening, Linux might be able to use the IOMMU to catch the offending device.

Re: Curious case of memory corruption

Posted: Mon Dec 23, 2024 6:31 pm
by bdsy
Thank you for the reply! I'll try to see what I can do with IOMMU.
Octocontrabass wrote: Sun Dec 22, 2024 6:24 pm
bdsy wrote: Sun Dec 22, 2024 2:17 pmI'd really want to know the root cause rather than apply some generic advice in hopes the issue disappears.
Vendors won't admit there's a problem if they can avoid it, but they'll fix problems anyway. That's why the generic advice like "update your BIOS" and "install components according to the latest version of the user manual and compatibility list" so often fixes problems.
Now it looks like that's what I'm going to do eventually.