CPU bug makes virtually all chips vulnerable

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
User avatar
bluemoon
Member
Member
Posts: 1761
Joined: Wed Dec 01, 2010 3:41 am
Location: Hong Kong

Re: CPU bug makes virtually all chips vulnerable

Post by bluemoon »

~ wrote:But your data will be marked cache-disabled.
Yes it works by totally disable cache, great you figure it out. But no it's impractical as it's slower than running things from emulator - which provide even better isolation.
~ wrote:That's the case I've never read in the documents, ...
It doesn't mean its not possible, it could be just irrelevant. Did the document say you need to turn on the computer?
Korona wrote:Okay, at this point you're legitimately trolling.
OK I give up too.
User avatar
~
Member
Member
Posts: 1228
Joined: Tue Mar 06, 2007 11:17 am
Libera.chat IRC: ArcheFire

Re: CPU bug makes virtually all chips vulnerable

Post by ~ »

DMA is supposed not to use the cache as well and it's extremely fast.

The security data we are using here is a very small data set, a small set of memory pages would be enough to hold it without having to disable the cache of code sections. The root data for encryption and the like is very small, the already-encrypted data is no longer understandable so the critical data would be a very few bytes to kilobytes, so even if it was bigger, it would not slow down the machine as most of the rest that is running and the documents displayed is not so critical or is already on disk.

If so critical a document can be made absent from the spillable cache, it couldn't slow down the computer enough as code would still be in the fast cache.
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: CPU bug makes virtually all chips vulnerable

Post by Korona »

Let me once again dumb-down everything. If this does not work, I won't respond to ~ anymore as doing that does not seem to further the discussion.

The attack still works if you have the following setup:
  • Buffer A, cache-disabled, containing sensitive data
  • Buffer B, cached, containing random data (or all zeros or whatever, it does not matter)
In this situation, buffer A can still be leaked. Do you understand that?
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
User avatar
~
Member
Member
Posts: 1228
Joined: Tue Mar 06, 2007 11:17 am
Libera.chat IRC: ArcheFire

Re: CPU bug makes virtually all chips vulnerable

Post by ~ »

Korona wrote:Let me once again dumb-down everything. If this does not work, I won't respond to ~ anymore as doing that does not seem to further the discussion.

The attack still works if you have the following setup:
  • Buffer A, cache-disabled, containing sensitive data
  • Buffer B, cached, containing random data (or all zeros or whatever, it does not matter)
In this situation, buffer A can still be leaked. Do you understand that?
It would suggest that pages marked cache-disabled are loaded in the cache by bad design, when they should just never contemplate using the cache if they are marked as non-cacheable.

That would sound like a very wasteful CPU design, caching pages when you have told the machine at the lowest level possible that you don't want to do so.

But I still am to read somewhere that states "pages marked cache-disabled via the page table entries will still be loaded into the cache as determined by different specific CPU models". That would be a critical statement that nobody here has posted in an official URL.
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: CPU bug makes virtually all chips vulnerable

Post by Korona »

In my example, buffer A is never loaded into the cache. It can still be leaked.

Buffer B does not contain anything interesting. The contents of buffer A and buffer B are completely unrelated. Just the existence of buffer B suffices for the attack (and to leak the uncached buffer A). Do you understand that?
Last edited by Korona on Sun Jan 07, 2018 12:04 pm, edited 1 time in total.
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
User avatar
~
Member
Member
Posts: 1228
Joined: Tue Mar 06, 2007 11:17 am
Libera.chat IRC: ArcheFire

Re: CPU bug makes virtually all chips vulnerable

Post by ~ »

I don't know where the data will be leaked if it isn't in cache and if it can't be read from the original RAM.

If it's because the addresses can be determined and then a low level driver used to read the memory, it would be different. Now all leaked things from browsers using only JavaScript are because all the data is in pages marked cache-enabled as OSes do by default, the measure I'm talking about hasn't been used in Chrome or Firefox.

_________________________________
As I understand it, the problem is that data from a program will get loaded in the cache at some point, and then the system will switch to an exploit program which will then read the cache, containing data from many programs, not just itself.

If everyone says that disabling fully the cache stops the problem (or disabling per-page for a practical solution) then it makes look the problem as if the cache is a key component for Meltdown/Spectre, and if it isn't used, then the problem will be affected or not possible to be performed.

The best way of knowing would be to allow programs (for example a browser with login data, SSL pages, cookies) to mark cache-disabled pages for all of their sensitive data/variables, and then try to see if the attack is stopped. Without actually programming that we cannot know if it will work without executing this solution.
Last edited by ~ on Sun Jan 07, 2018 12:20 pm, edited 1 time in total.
User avatar
bluemoon
Member
Member
Posts: 1761
Joined: Wed Dec 01, 2010 3:41 am
Location: Hong Kong

Re: CPU bug makes virtually all chips vulnerable

Post by bluemoon »

It's called side channel attack.

The cache is not used to observe the sensitive data content, it is just used to create observable side effect. cache hit or miss to deduce bit of one or zero.

Also, there are other potential candidate to create observable side effect beside cache (e.g. power consumption - maybe let SE run some noop vs div make a difference, who know)
~ wrote:As I understand it
No you don't.
Last edited by bluemoon on Sun Jan 07, 2018 12:21 pm, edited 1 time in total.
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: CPU bug makes virtually all chips vulnerable

Post by Korona »

~ wrote:As I understand it, the problem is that data from a program will get loaded in the cache at some point, and then the system will switch to an exploit program which will then read the cache, containing data from many programs, not just itself.
No, that is not the attack. The attack executes

Code: Select all

mov eax, [A]
shl eax, 6
mov [B + eax], 0
because buffer A is in the kernel, the first instruction causes a page fault. But the CPU speculatively runs the next instructions, which causes buffer B to be loaded in the cache. Now buffer B[64] is in the cache if and only if A[0] = 1,
B[128] is in the cache if and only if A[0] = 2 and so on. Buffer A is never loaded into the cache.
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
User avatar
~
Member
Member
Posts: 1228
Joined: Tue Mar 06, 2007 11:17 am
Libera.chat IRC: ArcheFire

Re: CPU bug makes virtually all chips vulnerable

Post by ~ »

Can't the usage of the cache be better synchronized such that at least the most recent data of program A (the kernel) gets erased from the cache by invalidating, before and after switching to another program, such that while we are running program A no cached memory from another processes is ever seen, and when we run program B the same, only the cache for the currently running program will ever be present? Erasing only data cache, not code cache, would be the most optimum.

But if not, the problem would be an attempt to compare values between different privilege levels and getting a result despite a page fault, but as you can see, the cache is always key. Maybe the code in the CPU that loaded cache pages came, bogusly, before the code for checking legal privilege access between operands before doing anything.

If the cache is protected such that each program won't see the contents of another not even for internal comparison, then no data could be deduced, for example by invalidating.

A page fault could probably invalidate cache for the currently running program (or sometimes all cache) - the kernel would have to do that, wait for cache flush, and then switch to another program with clean cache, leave the offending program suspended at the end of the tasks queue probably for several cycles, -not registering any hardware parameters for it that could allow it timing bogus hardware times -, not return control to the offending program immediately, to ensure that later cache operations cannot be leaked like that.

It seems also that it has to do with bad synchronization between usage/invalidation of the caches, and making sure that the current program will always have clean, private cache before and after switching in and out of it.

Even if it was chosen selectively by the programs to behave like that, it would make their own cache private between the rest of the programs, probably that would prevent the parameters manually checked by Meltdown/Spectre to be determinable.

Aligning the time while waiting for cache flush because of page faults or other errors such that the time will always be the same despite the wanted condition, at least for programs that selectively use that aligned delay via a kernel API for cache management, or on faults, would make any differences more invisible.

If the processor doesn't check for accesses between privilege levels, the kernel probably could in addition to flush cache and align its usage time while leaving the offending task at the end of the tasking queue for several low-priority cycles. Definitely the cache would need to be flushed and wait for flush completion for a program that causes page faults like in this case.
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: CPU bug makes virtually all chips vulnerable

Post by Korona »

Buffer A in the example is not cached at all.
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
User avatar
~
Member
Member
Posts: 1228
Joined: Tue Mar 06, 2007 11:17 am
Libera.chat IRC: ArcheFire

Re: CPU bug makes virtually all chips vulnerable

Post by ~ »

But as I said, if in the kernel you flush the bogus cached data every time a page fault or other employed faults occur, wait for cache flush with aligned time, and let that process sleep for a while, then the result will be that the time measured reading RAM or cache will always be the same because all cache usage will be synchronized in a way that doesn't allow filtering data through timing because data cache will be invalidated for the bogus program every time it faults and thus no longer usable for timing.

It's really confusing and suspicious since one would believe that cache would and should be flushed by the kernel memory management code if page faults or more complex faults occur, if that isn't done, it would seem like lower-quality memory management code.
Last edited by ~ on Sun Jan 07, 2018 2:06 pm, edited 2 times in total.
User avatar
DavidCooper
Member
Member
Posts: 1150
Joined: Wed Oct 27, 2010 4:53 pm
Location: Scotland

Re: CPU bug makes virtually all chips vulnerable

Post by DavidCooper »

I don't use memory management in my OS for security (because my plan is for all the code running in the machine to be written by AI in the OS, thereby ensuring that there is never any malicious code present in the first place, but it's still a future solution rather than a current one), so there's a massive gap in my knowledge of this business, but I'd like to know how common is it for programs to try to access memory that should be off limits to them, and whether the OS can detect all these attempts to read memory that shouldn't be happening. Have any of you actually collected statistics on this using any OS? A properly debugged program should never try to touch memory that it isn't allowed to, so might there be a potential solution to the problem there that could avert the need to replace billions of machines? If programs are shut down as soon as they start misbehaving in this way, they could be labelled as potentially malicious and be banned from running not only on the machine that caught them, but on all other machines too - it only needs the program to be reported once and all others could be informed by their anti-virus software. This would force companies to debug their code better before releasing it, and it would identify any programs that might be trying to exploit this vulnerability. Perhaps I'm asking the impossible of most human programmers though, so can anyone tell me if they've ever found any commercial app that runs without ever attempting to access memory that it shouldn't be touching?
Help the people of Laos by liking - https://www.facebook.com/TheSBInitiative/?ref=py_c

MSB-OS: http://www.magicschoolbook.com/computing/os-project - direct machine code programming
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: CPU bug makes virtually all chips vulnerable

Post by Korona »

You don't need to trigger the page fault. You can suppress it using TSX. You can (probably, I didn't test this) put code that triggers legitimate page faults (to not-yet swapped-in pages) in front of the mov and it will still be speculatively executed.
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
User avatar
Sik
Member
Member
Posts: 251
Joined: Wed Aug 17, 2016 4:55 am

Re: CPU bug makes virtually all chips vulnerable

Post by Sik »

~ wrote:DMA is supposed not to use the cache as well and it's extremely fast.
It also by-passes the CPU core and accesses consecutive addresses. The latter is literally the one case where not having a cache is irrelevant (consecutive accesses are fast, it's random accesses that are slow and hence why they need to be cached).


Also unrelated: at the rate things are going, who wants to bet that we'll soon find some other exploit that doesn't rely on speculation at all, just on the existence of the cache? (I hope not, but I wouldn't be surprised) I was thinking about such a scenario, I know it's common for a DSP to separate internal (fast) from external (slow) RAM and treat the latter as having to be explicitly requested manually by the code (possibly in parallel to its execution), if the cache itself turns out to be dangerous we may have to start thinking how to treat a CPU to work like that as well (and rewrite just about every program out there).

PS: for the record, I wouldn't be that worried about losing speculation, especially if you treat the CPU mainly as a controller that parses some logic and then give directions to the rest of the hardware. You're bound to end up with the CPU wasting time waiting for the rest of the hardware to do their job instead. Losing the speed from cached accesses would be the real killer, as that'd reduce performance to a tiny percentage of what it currently is.
User avatar
~
Member
Member
Posts: 1228
Joined: Tue Mar 06, 2007 11:17 am
Libera.chat IRC: ArcheFire

Re: CPU bug makes virtually all chips vulnerable

Post by ~ »

Korona wrote:You don't need to trigger the page fault. You can suppress it using TSX. You can (probably, I didn't test this) put code that triggers legitimate page faults (to not-yet swapped-in pages) in front of the mov and it will still be speculatively executed.
TSX is probably new enough as to make it irrelevant at least for important legacy hardware, and it could be that it's found in a small margin of existing CPUs so it would be less critical to find newer CPUs without the design flaw, but deeming hardware useless just for this couldn't be true, in fact it's starting to educate everyone about paging, memory management, caching, optimizing code, Assembly language, C language, kernel development...

If you invalidate the data cache at the start and end of program switches, at least selectively for some programs, data cache would still be clean with no old cache since the last task switch time or cache from other programs, and would be unusable for this attack.

The OS could still be configured to enable or disable TSX, for example, in an extremely high security environment like a virtual remote machine account, that feature could be turned off to force page fault management by the OS, it could also be configured for each program, always fine-tuning more consciously the applications that the human users and that obviously good known programs know that aren't malware. If not, the safest options could be used by default (no TSX) and still we would have the remaining choices using basic protection mechanisms and better synchronization to prevent the other attack variations.

In some environments TSX could be forced disabled even if the program requested the OS to enable it so that human users and long-running, long-installed, long-present binaries enable those features automatically or manually as they are more used.

If it's possible to keep clean the cache between processes and flush it at any faults/exceptions, flushing despite bogus out-of-order execution using synchronization in the kernel, as well as keeping selected pages cache-disabled for private data, then it won't be so difficult to fix, but has to be implemented and run to see if it really stops the attack, we can't know without executing these measures.

________________________________________
________________________________________
We would still need to reproduce the bug for implementing OS code that really solves something, not just designing it based on suspicion of how things work if we don't actually know first how a vulnerability works as to write code that instead of trying to patch problems, handles the memory/cache resources better to not even have things to patch.
Last edited by ~ on Sun Jan 07, 2018 7:21 pm, edited 1 time in total.
Post Reply