FPU state on multicore processors

rdos · Post by **rdos** » Wed Oct 12, 2011 10:41 am

I just discovered that my FPU state handler will fail on multicore processors, so I need to update it, although I'm not sure how.

Current (single-core) functionality:

1. Every task-switch will set TS flag in CR0-register (previously, this was done automatically when using hardware task-switching).

2. Exception 7 will check if the FPU state belongs to the current thread (there is a single variable called math_tss which holds the last owner). If not, a fsave is done to the math_tss, followed by a frstor from the new thread, and then math_tss is set to the new thread.

Of course, this logic will malfunction on a multicore processor since there is only a single current FPU thread per system.

However, just extending the logic to make the current FPU thread become a per-core variable will not work smothly. If a FPU-thread is moved from one core to another core, the other core cannot easily load the state from the other core (IPIs will be needed). If the state is saved every time a thread might be moved to another core, it would fix this, but this is not easily detected.

Currently, I think I might do it like this:

1. Create a new flag per core, HAS_FPU_STATE, and clear it during processor initialization.
2. Set TS flag during processor initialization.
3. Exception 7 will set HAS_FPU_STATE, load FPU state for a thread and do an CLTS.
4. The task-switcher will no longer set the TS flag, but will instead check HAS_FPU_STATE. If it is set, it will save the FPU state, clear HAS_FPU_STATE and set TS flag.

How would (have) others solved this problem?

Combuster · Post by **Combuster** » Wed Oct 12, 2011 11:15 am

There are two factors to consider:
- what is the cost of moving the FPU state from one core to another
- how often does the FPU state have to switch from cores

If the state swaps over often, it is probably faster to save on each task switch
If the state swaps over sparingly, it is probably faster to interrupt the remote processor.

Incidentally, the same holds for the processor caches, meaning that every core swap is a performance hit in its own right, and you will want to minimize them as much as possible. For instance, a dumb scheduler will cause a thread to hop cores when it's scheduled with probability (n_cores - 1) / n_cores, i.e. 50% on a dualcore and a whopping 83% on a 4x2 Core i7

After you solved that problem, you can also predict an cross-core FPU switch and have it store its state before the new process ever considers retrieving it (reducing the total waiting time).

rdos · Post by **rdos** » Wed Oct 12, 2011 11:59 am

As per a previous discussion about cores and threads, the probability that any task will switch core is low (less than 10-20%), but it will happen. Therefore, a FPU task MIGHT switch core, which means that this possibility must be accounted for if FPU state is not saved every time a thread has modified it.

There are other factors involved as well:

1. If FPU state might switch between cores, there is a need to make sure that the logic works under all possible conditions. If this is rare, testing the logic will become very hard. Untested FPU saving logic could become a nightmare.

2. It takes time to do checks for FPU state saving in the mainline scheduler, which is bad if these tests seldom are successful.

rdos · Post by **rdos** » Wed Oct 12, 2011 2:49 pm

It seems to work now.

Task-switcher code:

Code: Select all

;
;  AX is new thread to load
;
    test fs:ps_flags,PS_FLAG_FPU
    jz load_fpu_ok     ; Thread did not use FPU in last time-slice
;    
    mov bx,fs:ps_math_thread
    cmp ax,bx
    je load_fpu_ok    ; Same thread as previously
;
    push ds
    mov ds,bx
    mov bx,OFFSET p_math_control
    clts
    db 9Bh, 66h, 0DDh, 37h      ;       32-bit fsave [bx]
    pop ds
;
    lock and fs:ps_flags,NOT PS_FLAG_FPU
;
    mov eax,cr0
    or al,8
    mov cr0,eax        ; set TS flag so next FPU operation faults.

load_fpu_ok:

The FPU exception looks like this:

Code: Select all

    GetThread
    mov ds,ax
    mov bx,OFFSET p_math_control
    clts
    db 9Bh, 66h, 0DDh, 27h      ;       32-bit frstor [bx]
;
    mov bx,core_data_sel
    mov ds,bx
    mov ds:ps_math_thread,ax
    lock or ds:ps_flags,PS_FLAG_FPU

gravaera · Post by **gravaera** » Wed Oct 12, 2011 3:39 pm

Yo:

Thanks for pointing this out, I hadn't thought of it myself. The solution that came to me is: assuming you have per-logical-cpu run-queues, a thread will rarely ever be rescheduled to a new logical CPU; however, when a thread is to be rescheduled the general thing is to take it off its CPU's run-queue and place it into the "global" scheduler which will select a new CPU for it.

So when you are taking a task off its current CPU's scheduler queues (migration), something similar to:

Code: Select all

if (task is running on current cpu)
{
   if (cpu has task's FPU context) {
      save task FPU context;
   }
}
else
{
   Find out which CPU the task is running on;
   Send a message to that CPU to check if it has that task's context and save it if it does;
   // Synch and wait for IPI to finish.
};

migrate the task off its current CPU.

should suffice. Appreciate any comments I get on that, thanks

Combuster · Post by **Combuster** » Thu Oct 13, 2011 7:16 am

This would be the most straightforward approach as far as I'm concerned:

Code: Select all

void restore_fpu_state()
{
    if (!task->fpu_live)
    {
        save_current_fpu_state();
        load_fpu_state(task->fpu_state);
    }
    else
    {
        if (current_cpu() = task->fpu_cpu)
        {
            // do nothing
        }
        else
        {
            send_remote_fpu_ipi(); // you can also do this when a tread with a live FPU state is migrated to a different cpu.
            while (task->fpu_live) 
                waste_time();
            save_current_fpu_state();
            load_fpu_state(task->fpu_state);        
        }
    }
    task->fpu_cpu = current_cpu();
    task->fpu_live = true;
}

EDIT: removed platform-specific code.

rdos · Post by **rdos** » Thu Oct 13, 2011 10:04 am

Combuster wrote:This would be the most straightforward approach as far as I'm concerned

You must do "clts" before accessing the FPU, otherwise the code will risk exceptions.

Besides, the FPU save / restore is not really a procedure, but must be made as an event handler for effective function.

Brendan · Post by **Brendan** » Thu Oct 13, 2011 10:47 am

Hi,

At which point does the overhead of doing the "delayed FPU state loading and saving" become so high that it's simply not worth doing it at all? I'd suggest that as soon as IPIs need to be broadcast to all other CPUs (and maybe even just sent to one other CPU), you've gone beyond that point (especially on systems with lots of CPUs). I'd also suggest that if you've minimised thread migration to reduce the chance of IPIs caused by "delayed FPU state loading and saving" (and you've crippled the scheduler's ability to schedule tasks on available CPUs effectively) then you've also gone too far.

With this in mind, the first thing I'd be doing (for multi-CPU) is saying "FPU state is saved during task switches whenever the previous tasks used it". That avoids all synchronisation, IPI overhead and tasks migration problems. It also means I'd only be considering "delayed FPU state loading" (and not saving).

When a "device not available" exception occurs you load the FPU state, and set a flag so the scheduler knows that the FPU state needs to be saved when there's a task switch.

The second step would be tracking how often each task uses the FPU. If you detect that a task uses the FPU most of the time, then you can avoid the overhead of a likely "device not available" exception by pre-loading the FPU state during the task switch.

The next step would be "delayed FPU state initialisation". When a task is created, set a flag saying "FPU state not initialised", and if/when the task uses the FPU, initialise the FPU state in the "device not available" exception handler.

Of course all of the above applies to MMX (and SSE and AVX) too.

However, for SSE it may be possible to also use the "OSFXSR" flag in CR4 to detect if a task actually uses SSE; and avoid loading and saving SSE state for tasks that only use FPU/MMX and don't use SSE. When a task is created you'd set a flag saying "FPU state not initialised" and another flag saying "SSE state not initialised". When you get the first "device not available" exception you initialise FPU state, set the "FPU state initialised" flag and return (like before); and if/when you get an "invalid opcode" exception you check if it was an SSE instruction, initialise SSE state and set the "SSE state initialised" flag (and also initialise FPU state if it hasn't already been initialised). If SSE state has been initialised, when you switch to the task you'd set TS and OFSXSR (and any "device not available" exception would cause both FPU and SSE to be loaded, rather than just FPU).

After all that comes AVX, where things get messy (but the basic idea behind "avoid loading and saving SSE state" should work for AVX too, just with XGETBV, XSETBV and XCR0 instead of OSFXR alone).

Cheers,

Brendan

rdos · Post by **rdos** » Sat Oct 15, 2011 2:36 am

Brendan wrote:With this in mind, the first thing I'd be doing (for multi-CPU) is saying "FPU state is saved during task switches whenever the previous tasks used it". That avoids all synchronisation, IPI overhead and tasks migration problems. It also means I'd only be considering "delayed FPU state loading" (and not saving).

Exactly. Avoiding IPIs, especially to many CPUs, is essential both for performance and for ease of debugging and getting it to work under all conditions.

Brendan wrote:When a "device not available" exception occurs you load the FPU state, and set a flag so the scheduler knows that the FPU state needs to be saved when there's a task switch.

The second step would be tracking how often each task uses the FPU. If you detect that a task uses the FPU most of the time, then you can avoid the overhead of a likely "device not available" exception by pre-loading the FPU state during the task switch.

This is my current logic. Except that I also check if the next scheduled thread is identical to the one owning the FPU context.

Brendan wrote:The next step would be "delayed FPU state initialisation". When a task is created, set a flag saying "FPU state not initialised", and if/when the task uses the FPU, initialise the FPU state in the "device not available" exception handler.

I'd initialize the state in the thread control block at thread creation time instead. There is no need to execute the "finit" operation in the new task context anyway. Basically, setting tag-register to 0xFFFF will do it.

Brendan wrote:Of course all of the above applies to MMX (and SSE and AVX) too.

However, for SSE it may be possible to also use the "OSFXSR" flag in CR4 to detect if a task actually uses SSE; and avoid loading and saving SSE state for tasks that only use FPU/MMX and don't use SSE. When a task is created you'd set a flag saying "FPU state not initialised" and another flag saying "SSE state not initialised". When you get the first "device not available" exception you initialise FPU state, set the "FPU state initialised" flag and return (like before); and if/when you get an "invalid opcode" exception you check if it was an SSE instruction, initialise SSE state and set the "SSE state initialised" flag (and also initialise FPU state if it hasn't already been initialised). If SSE state has been initialised, when you switch to the task you'd set TS and OFSXSR (and any "device not available" exception would cause both FPU and SSE to be loaded, rather than just FPU).

After all that comes AVX, where things get messy (but the basic idea behind "avoid loading and saving SSE state" should work for AVX too, just with XGETBV, XSETBV and XCR0 instead of OSFXR alone).

Messy. I don't bother with MMX, SSE and AVX yet. They are not used by any tasks yet.

tom9876543 · Post by **tom9876543** » Mon Nov 14, 2011 12:12 pm

db 9Bh, 66h, 0DDh, 27h ; 32-bit frstor [bx]

Please excuse my ignorance, but why was it necessary to manually encode the instrucction? Is there a problem with the assembler?

Combuster · Post by **Combuster** » Mon Nov 14, 2011 1:02 pm

He wrote his own tools, with the obvious consequences

rdos · Post by **rdos** » Mon Nov 14, 2011 3:02 pm

tom9876543 wrote:
db 9Bh, 66h, 0DDh, 27h ; 32-bit frstor [bx]
Please excuse my ignorance, but why was it necessary to manually encode the instrucction? Is there a problem with the assembler?

It is an artifact from the time when I used TASM to assemble. It couldn't correctly generate some 32-bit instructions, especially not in 16-bit segments. I think WASM can handle this, but I haven't tested as the code was written before I switched to WASM.

OTOH, I wonder how the operand-size (in a 16-bit segment) would be given to frstor? How would this be coded in NASM for instance?

Combuster · Post by **Combuster** » Mon Nov 14, 2011 3:21 pm

You can override the operand and address size with o16/o32 and a16/a32 when the instruction does not imply any such prefix. You get constructs like o32 frstor [bx] and a32 rep stosw.

rdos · Post by **rdos** » Mon Nov 14, 2011 3:40 pm

Combuster wrote:You can override the operand and address size with o16/o32 and a16/a32 when the instruction does not imply any such prefix. You get constructs like o32 frstor [bx] and a32 rep stosw.

Which more or less corresponds to "db 66h" and "db 67h" with TASM/WASM. The only problem is that the instruction coding starts with a "wait", and the "db 66h" should be after the wait. OTOH, I'm not sure if wait really is needed here.

rdos · Post by **rdos** » Fri Jan 02, 2015 9:36 am

Turns out the logic doesn't work after all, and the reason is that the task can be scheduled on a new core before the old core loads a new thread. That creates really nasty problems. It takes about a week before this bug triggers in the terminal setup. Changed the logic so the FPU state save is done when the ordinary CPU registers are saved on multicore (if the FPU exception has occurred), and re-inserted the "good" logic on single-core so FPU state is only saved when a new thread starts using the FPU.

Started a new week test to see if this fixes this really nasty bug.

OSDev.org

FPU state on multicore processors

FPU state on multicore processors

Re: FPU state on multicore processors

Re: FPU state on multicore processors

Re: FPU state on multicore processors

Re: FPU state on multicore processors

Re: FPU state on multicore processors

Re: FPU state on multicore processors

Re: FPU state on multicore processors

Re: FPU state on multicore processors

Re: FPU state on multicore processors

Re: FPU state on multicore processors

Re: FPU state on multicore processors

Re: FPU state on multicore processors

Re: FPU state on multicore processors

Re: FPU state on multicore processors