fast-reboot: improve fast reboot sequence

The current fast reboot sequence is not as robust as it could be. It is this: - Fast reboot CPU stops all other threads with direct control xscoms; - it disables ME (machine checks become checkstops); - resets its SPRs (to get HID[HILE] for machine check interrupts) and overwrites exception vectors with our vectors, with a special fast reboot sreset vector that fixes endian (because OS owns HILE); - then the fast reboot CPU enables ME. At this point the fast reboot CPU can handle machine checks with the skiboot handler, but no other cores can if the OS had switched HILE (they'll execute garbled byte swapped instructions and crash badly). - Then all CPUs run various cleanups, XIVE, resync TOD, etc. - The boot CPU, which is not necessarily the same as the fast reboot initiator CPU, runs xive_reset. This is a lot of code to run, including locking and xscoms, with machine check inoperable. - Finally secondaries are released and everyone sets SPRs and enables ME. Secondaries on other cores don't wait for their thread 0 to set shared SPRs before calling into the normal OPAL secondary code. This is mostly okay because the boot CPU pauses here until all secondaries reach their idle code, but it's not nice to release them out of the fast reboot code in a state with various per-core SPRs in flux. Fix this by having the fast reboot CPU not disable ME or reset its SPRs, because machine checks can still be handled by the OS. Then wait until all CPUs are called into fast reboot and spinning with ME disabled, only then reset any SPRs, copy remaining exception vectors, and now skiboot has taken over the machine check handling, then the CPUs enable ME before cleaning up other things. This way, the region with ME disabled and SPRs and exception vectors in flux is kept absolutely minimal, with no xscoms, no MMIOs, and few significant memory modifications, and all threads kept closely in step. There are no windows where a machine check interrupt may execute garbage due to mismatched HILE on any CPU. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
author: Nicholas Piggin <npiggin@gmail.com> 2020-04-27 21:08:03 +1000
committer: Oliver O'Halloran <oohall@gmail.com> 2020-06-11 12:52:55 +1000
commit: 2cc897067b87d9983250776739c0a4b2e44578f3 (patch)
tree: f4a79914a69107c2628b781199a1b55bbb8280fb /include/cpu.h
parent: 449e1052cbe1cd718b57d5f1e07fc6efa8c1f21d (diff)
download: skiboot-2cc897067b87d9983250776739c0a4b2e44578f3.zip
skiboot-2cc897067b87d9983250776739c0a4b2e44578f3.tar.gz
skiboot-2cc897067b87d9983250776739c0a4b2e44578f3.tar.bz2
1 files changed, 1 insertions, 0 deletions
diff --git a/include/cpu.h b/include/cpu.h
index f142a4f..c90b961 100644
--- a/include/cpu.h
+++ b/include/cpu.h
@@ -21,6 +21,7 @@ enum cpu_thread_state {
 	cpu_state_no_cpu	= 0,	/* Nothing there */
 	cpu_state_unknown,		/* In PACA, not called in yet */
 	cpu_state_unavailable,		/* Not available */
+	cpu_state_fast_reboot_entry,	/* Called back into OPAL, real mode */
 	cpu_state_present,		/* Assumed to spin in asm entry */
 	cpu_state_active,		/* Secondary called in */
 	cpu_state_os,			/* Under OS control */
author	Nicholas Piggin <npiggin@gmail.com>	2020-04-27 21:08:03 +1000
committer	Oliver O'Halloran <oohall@gmail.com>	2020-06-11 12:52:55 +1000
commit	2cc897067b87d9983250776739c0a4b2e44578f3 (patch)
tree	f4a79914a69107c2628b781199a1b55bbb8280fb /include/cpu.h
parent	449e1052cbe1cd718b57d5f1e07fc6efa8c1f21d (diff)
download	skiboot-2cc897067b87d9983250776739c0a4b2e44578f3.zip skiboot-2cc897067b87d9983250776739c0a4b2e44578f3.tar.gz skiboot-2cc897067b87d9983250776739c0a4b2e44578f3.tar.bz2