Page 1 of 2
Including an unused variable corrupts the multiboot kernel
Posted: Mon Apr 08, 2024 7:41 am
by FrankRay78
Hello,
I've encountered a really strange bug in my nascent kernel development. Whilst
PatienceOS is a C# bare metal kernel (nb. nothing close to an OS yet), the simplicity of the codebase and compilation to direct machine code means that it's nothing much more than the C barebones tutorial
here.
Bootstrap Assembly:
src
Linker template:
src
Main function:
src
Console struct:
src
Build script:
src
The checked-in code (above) builds and runs fine in QEMU. However, when I add a single line to the console struct (see below), a variable which is declared but never used/referenced, the kernel no longer boots in QEMU. Rather, the screen flashes as if the multiboot has been corrupted somehow.
Code: Select all
private byte foregroundColor = 0x0F;
I'm guessing it's something to do with the packing of the struct (
see here) and/or the memory alignment in the linker template, perhaps.
To be honest, I'm a little out of my depth, but I would really appreciate any suggestions as to how I can practically troubleshoot the situation. I'm more interested in
learning how to go about understanding how to fix this, rather than seeking a silver bullet.
Frank
Re: Including an unused variable corrupts the multiboot kern
Posted: Mon Apr 08, 2024 8:18 am
by iansjack
I'd suggest that you run the kernel under a debugger. I'm not familiar with Windows debuggers, but gdb can run on Windows and works in cooperation with qemu. Ideally you debug in the high-level language, but I'm not familiar enough with C# to know how you could set that up. But the program is simple enough for you to just debug the assembly code directly.
Here's a link to gdb for Windows:
https://rpg.hamsterrepublic.com/ohrrpgce/GDB_on_Windows
and using gdb with qemu:
https://qemu-project.gitlab.io/qemu/system/gdb.html
Learning how to use a debugger is a very good discipline for OS development, and this provides an opportunity to gain that knowledge on a simple system.
I could say that all of this would be much easier if you were using C or Rust with a Linux development machine, but I'm guessing you don't want to hear that.
Re: Including an unused variable corrupts the multiboot kern
Posted: Mon Apr 08, 2024 8:18 am
by MichaelPetch
I don't have an appropriate build environment to build this. Would you be able to make available the kernel.elf file that works (prior to the change) and the kernel.elf that doesn't work? You could put them somewhere in your Github repo.
Re: Including an unused variable corrupts the multiboot kern
Posted: Mon Apr 08, 2024 10:24 am
by FrankRay78
MichaelPetch wrote:I don't have an appropriate build environment to build this. Would you be able to make available the kernel.elf file that works (prior to the change) and the kernel.elf that doesn't work? You could put them somewhere in your Github repo.
Thank you. I have placed both of them here:
https://github.com/FrankRay78/PatienceO ... /Debugging
I load them in QEMU with the following command
Code: Select all
qemu-system-i386 -kernel <kernel filename>.elf
Re: Including an unused variable corrupts the multiboot kern
Posted: Mon Apr 08, 2024 10:28 am
by FrankRay78
Thank you for the advice and links to the debugger, I will seriously look into this more.
iansjack wrote:
I could say that all of this would be much easier if you were using C or Rust with a Linux development machine, but I'm guessing you don't want to hear that.
Believe me, I really did try to get the toolchain working end to end on Linux. Explanation of my failed attempts are here:
Commentary on the build environment. Something to come back to, in the fullness of time.
Re: Including an unused variable corrupts the multiboot kern
Posted: Mon Apr 08, 2024 11:00 am
by MichaelPetch
I ran QEMU with these options to see what exceptions and interrupts were occurring:
Code: Select all
qemu-system-i386 -kernel kernel-notworking.elf -d int -no-reboot -no-shutdown
I saw this:
Code: Select all
0: v=06 e=0000 i=0 cpl=0 IP=0008:00201006 pc=00201006 SP=0010:00207fd4 env->regs[R_EAX]=00000000
EAX=00000000 EBX=00009500 ECX=00207ff0 EDX=00010511
ESI=00000000 EDI=00002000 EBP=00207fe8 ESP=00207fd4
EIP=00201006 EFL=00000006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
CS =0008 00000000 ffffffff 00cf9a00 DPL=0 CS32 [-R-]
SS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
DS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
FS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
GS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
GDT= 000cb2b4 00000027
IDT= 00000000 000003ff
CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000
DR6=ffff0ff0 DR7=00000400
CCS=00000014 CCD=00207fd4 CCO=SUBL
EFER=0000000000000000
v=06 is exception 0x06 (
Invalid opcode). When I look at address 0x00201006 where the exception occurred I see this:
Code: Select all
201006: 0f 57 e4 xorps %xmm4,%xmm4
This is an SSE instruction. I didn't look at your code but I suspect the issue is because SSE instructions are not enabled in the processor before executing this code. I guess the option is to build without SSE instructions (don't know if you can do that with C#) or enable SSE instruction support. You can find code to do that here:
https://wiki.osdev.org/SSE. In the working version of the kernel SSE instructions aren't being used. The change you made seems to have prompted some optimizations that include using SSE/SIMD.
Because you don't have an IDT set up with proper exception handlers the processor ends up triple faulting and reboots when it encounters the Invalid Opcode.
Note: I didn't connect a debugger to determine what was at address 0x00201006. I dumped the contents of the ELF file with this command:
Re: Including an unused variable corrupts the multiboot kern
Posted: Mon Apr 08, 2024 3:37 pm
by FrankRay78
Thank you, MichaelPetch, that's incredibly helpful and very much appreciated. It's amazing seeing what you've done step by step.
For a moment there, I thought the solution would be trivial.
indicates a number of instruction sets can be used:
Code: Select all
x86: base, sse, sse2, sse3, ssse3, sse4.1, sse4.2, avx, avx2, aes, etc
and
indicates the actual CPU can be specified:
Code: Select all
Available CPUs:
x86 486 (alias configured by machine type)
x86 486-v1
x86 Broadwell (alias configured by machine type)
x86 Broadwell-IBRS (alias of Broadwell-v3)
x86 Broadwell-noTSX (alias of Broadwell-v2)
etc
So... I explicitly enabled sse in the compilation, and also set the CPU to pentium3 (which has sse support)
Code: Select all
ilc --targetos windows --targetarch x86 --instruction-set base,sse --verbose kernel.ilexe -g -o kernel.obj --systemmodule kernel --map kernel.map -O
...
qemu-system-i386 -cpu pentium3 -kernel kernel.elf
But alas, the issue still remains.
I'll need to look into this further. I suspect it's either the .Net AOT compiler, ilc, not respecting the command line switch, or my native Windows install of QEMU (which they mark as 'experimental').
Massive progress though, and thanks once again.
Re: Including an unused variable corrupts the multiboot kern
Posted: Mon Apr 08, 2024 3:54 pm
by MichaelPetch
Your compile process is already emitting SSE (causing the issues when you added the extra ember to the structure). That is the problem. You want to be able to turn that off (not on). The issue revolves around the fact that GRUB doesn't guarantee anything about whether the processors SSE support is enabled when transferring control to your kernel. It is likely not enabled even on processors that support SSE.
If you want to enable SSE in your kernel you have to programmatically turn it on. Adding the appropriate code to loader.asm before your kernel main is called is where that should be done.
https://wiki.osdev.org/SSE has code to do that. I haven't tested this (it is based on the Wiki code) but I think the logic is correct:
Code: Select all
_start:
cli ; block interrupts
mov esp, stack_space ; set stack pointer
enablesse:
; Is SSE supported on this CPU?
mov eax, 0x1
cpuid
test edx, 1<<25
jnz .sse ; If SSE supported enable it.
.nosse:
; SSE not supported - do something like print an error and stop
jmp $
.sse:
;now enable SSE and the like
mov eax, cr0
and ax, 0xFFFB ; clear coprocessor emulation CR0.EM
or ax, 0x2 ; set coprocessor monitoring CR0.MP
mov cr0, eax
mov eax, cr4
or ax, 3 << 9 ; set CR4.OSFXSR and CR4.OSXMMEXCPT at the same time
mov cr4, eax
; Call Main
call __managed__Main
; Infinite loop
hlt
jmp $
If you choose not to disable SSE from your code generator, you will need your kernel to check for SSE *support* and if there is none do something (print an error) and go into an infinite loop informing the user that you need a CPU with SSE support. If there is SSE support in the processor then you need to enable the SSE instruction set.
Re: Including an unused variable corrupts the multiboot kern
Posted: Mon Apr 08, 2024 4:09 pm
by MichaelPetch
FrankRay78 wrote:Code: Select all
x86: base, sse, sse2, sse3, ssse3, sse4.1, sse4.2, avx, avx2, aes, etc
I assume (just a guess) "base" would be code without SSE. If you can change to that then you may find the code works. From what you are saying SSE code generation could be disabled using `--instruction-set base` (notice I removed SSE). If you can't turn off code generation with SSE instructions you'll have to enable SSE at run time with code similar to what I have in my previous post.
I don't believe the problem here is with QEMU. Use QEMU as you were originally invoking it.
Re: Including an unused variable corrupts the multiboot kern
Posted: Tue Apr 09, 2024 12:03 am
by FrankRay78
Apologies MichaelPetch, it was late and my response was poor.
I did try everything with the ilc to prevent the sse code from being emitted. The instruction-set switch with only ‘base’ didn’t work. I trawled and trawled GitHub issues and could not find a single bit of documentation whether this was intended, or not. It was at that point I decided to see if I could force sse to be always on, but ran foul of (what I thought) was QEMU not behaving.
Today I plan to log an issue with Microsoft regarding the ‘base’ switch, to confirm whether that should be allowing sse optimisations, and in the meantime, enable sse support in my bootstrapper, which you’ve kindly pointed out. Requiring that startup assembly was a gap in my understanding, even though I was reading about what cpus supported which versions of sse.
Update - An issue has been logged with the Microsoft runtime/AOT team, here:
ilc.exe is emitting the sse instruction, xorps, with --instruction-set base
Re: Including an unused variable corrupts the multiboot kern
Posted: Tue Apr 09, 2024 3:33 pm
by FrankRay78
The answers given on the above GitHub issue I raised are clear and unambiguous, namely:
Firstly, win-x86 is unsupported. Secondly, the baseline is SSE2.
and also
The support for pre-SSE2 hardware was removed several years back and there is no interest in adding it back. We consider at least SSE, SSE2, CMOV, and CPUID as part of our baseline requirements.
Re: Including an unused variable corrupts the multiboot kern
Posted: Tue Apr 09, 2024 6:29 pm
by MichaelPetch
So keep your build as it was before and modify loader.asm with the code I suggested. Hopefully if I haven't screwed anything up that should work. My code changes to loader.asm check if SSE is supported by the CPU. If it isn't supported it just goes into an infinite loop (you could add code to print an error to the display). If SSE is supported then I enable the SSE features. That should allow your kernel code to run even if it uses SSE.
Initializing the x87/FPU to a valid state probably isn't a bad idea either although that's not currently an issue for you. On some systems if you issue a x87 FPU instruction it may also cause an exception if not initialized ahead of time.
Re: Including an unused variable corrupts the multiboot kern
Posted: Tue Apr 09, 2024 8:35 pm
by Octocontrabass
If your compiler always uses SSE2, that means you'll need to save/restore the SSE registers in every kernel entry/exit point instead of only during a context switch. The same applies to any other registers your compiler might use, but most examples you'll see were written with the assumption that the compiler only uses general-purpose registers.
Most Linux distros require i686+SSE2 at minimum, but the Linux kernel (usually) doesn't use SSE registers.
Re: Including an unused variable corrupts the multiboot kern
Posted: Tue Apr 09, 2024 11:37 pm
by iansjack
What happens if you initial the variables in the constructor rather than in the structure definition? Perhaps C# is using an inbuilt memory move routine (which often uses SSE instructions) when there are multiple initialized variables in a structure definition.
If that is the case then, IMO, C# isn’t a suitable tool for OS development. It would be interesting to know whether the same problem exists if open-source tools, such as mono, are used.
Re: Including an unused variable corrupts the multiboot kern
Posted: Wed Apr 10, 2024 4:26 am
by FrankRay78
Dear MichaelPetch, your suggestion worked and I'm very grateful, here's the commit:
Enable cpu support for sse in bootstrap. I'm also very inspired to take seriously my OS learning, given how your support has opened my eyes to this truly fascinating subject.
Dear iansjack, I tried the following:
Code: Select all
private byte foregroundColor;
public Console(int width, int height, FrameBuffer frameBuffer, byte foregroundColor = 0x0F)
{
this.width = width;
this.height = height;
this.frameBuffer = frameBuffer;
this.foregroundColor = foregroundColor;
}
and also without the default value specified on the constructor, both still result in the sse instruction being emitted.
I don't understand enough about how sse works, nor the memory move comments, and given the 32-bit AOT compiler isn't officially supported yet, I'm not sure what I can deduce from this. I'll read up some more, and probably inspect the generated IL (resulting in with/without sse) to see if that sheds any light.