diff options
author | Nicholas Piggin <npiggin@gmail.com> | 2020-04-27 21:08:03 +1000 |
---|---|---|
committer | Oliver O'Halloran <oohall@gmail.com> | 2020-06-11 12:52:55 +1000 |
commit | 2cc897067b87d9983250776739c0a4b2e44578f3 (patch) | |
tree | f4a79914a69107c2628b781199a1b55bbb8280fb /include/cpu.h | |
parent | 449e1052cbe1cd718b57d5f1e07fc6efa8c1f21d (diff) | |
download | skiboot-2cc897067b87d9983250776739c0a4b2e44578f3.zip skiboot-2cc897067b87d9983250776739c0a4b2e44578f3.tar.gz skiboot-2cc897067b87d9983250776739c0a4b2e44578f3.tar.bz2 |
fast-reboot: improve fast reboot sequence
The current fast reboot sequence is not as robust as it could be. It
is this:
- Fast reboot CPU stops all other threads with direct control xscoms;
- it disables ME (machine checks become checkstops);
- resets its SPRs (to get HID[HILE] for machine check interrupts) and
overwrites exception vectors with our vectors, with a special fast
reboot sreset vector that fixes endian (because OS owns HILE);
- then the fast reboot CPU enables ME.
At this point the fast reboot CPU can handle machine checks with the
skiboot handler, but no other cores can if the OS had switched HILE
(they'll execute garbled byte swapped instructions and crash badly).
- Then all CPUs run various cleanups, XIVE, resync TOD, etc.
- The boot CPU, which is not necessarily the same as the fast reboot
initiator CPU, runs xive_reset.
This is a lot of code to run, including locking and xscoms, with
machine check inoperable.
- Finally secondaries are released and everyone sets SPRs and enables
ME.
Secondaries on other cores don't wait for their thread 0 to set shared
SPRs before calling into the normal OPAL secondary code. This is
mostly okay because the boot CPU pauses here until all secondaries
reach their idle code, but it's not nice to release them out of the
fast reboot code in a state with various per-core SPRs in flux.
Fix this by having the fast reboot CPU not disable ME or reset its
SPRs, because machine checks can still be handled by the OS. Then
wait until all CPUs are called into fast reboot and spinning with
ME disabled, only then reset any SPRs, copy remaining exception
vectors, and now skiboot has taken over the machine check handling,
then the CPUs enable ME before cleaning up other things.
This way, the region with ME disabled and SPRs and exception vectors
in flux is kept absolutely minimal, with no xscoms, no MMIOs, and few
significant memory modifications, and all threads kept closely in step.
There are no windows where a machine check interrupt may execute
garbage due to mismatched HILE on any CPU.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Diffstat (limited to 'include/cpu.h')
-rw-r--r-- | include/cpu.h | 1 |
1 files changed, 1 insertions, 0 deletions
diff --git a/include/cpu.h b/include/cpu.h index f142a4f..c90b961 100644 --- a/include/cpu.h +++ b/include/cpu.h @@ -21,6 +21,7 @@ enum cpu_thread_state { cpu_state_no_cpu = 0, /* Nothing there */ cpu_state_unknown, /* In PACA, not called in yet */ cpu_state_unavailable, /* Not available */ + cpu_state_fast_reboot_entry, /* Called back into OPAL, real mode */ cpu_state_present, /* Assumed to spin in asm entry */ cpu_state_active, /* Secondary called in */ cpu_state_os, /* Under OS control */ |