diff options
author | Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> | 2019-03-22 20:28:00 +0530 |
---|---|---|
committer | oohal <oohal@users.noreply.github.com> | 2019-05-09 17:04:30 +1000 |
commit | ee8837ade2a4b886a53cef6fa5055b65e785753f (patch) | |
tree | 9f4eaa132f88b6b9c8d061cffde82cdaf337ece7 | |
parent | 65a36d64be2a9ca5af305ff67b2e08ebbef78e8c (diff) | |
download | skiboot-ee8837ade2a4b886a53cef6fa5055b65e785753f.zip skiboot-ee8837ade2a4b886a53cef6fa5055b65e785753f.tar.gz skiboot-ee8837ade2a4b886a53cef6fa5055b65e785753f.tar.bz2 |
Fix hang in pnv_platform_error_reboot path due to TOD failure.
[ Upstream commit 2c71d7032484f4267ca425b5a58c5a6f06a2eaf3 ]
On TOD failure, with TB stuck, when linux heads down to
pnv_platform_error_reboot() path due to unrecoverable hmi event, the panic
cpu gets stuck in OPAL inside ipmi_queue_msg_sync(). At this time, rest
all other cpus are in smp_handle_nmi_ipi() waiting for panic cpu to proceed.
But with panic cpu stuck inside OPAL, linux never recovers/reboot.
p0 c1 t0
NIA : 0x000000003001dd3c <.time_wait+0x64>
CFAR : 0x000000003001dce4 <.time_wait+0xc>
MSR : 0x9000000002803002
LR : 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec>
STACK: SP NIA
0x0000000031c236e0 0x0000000031c23760 (big-endian)
0x0000000031c23760 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec>
0x0000000031c237f0 0x00000000300aa5f8 <.hiomap_queue_msg_sync+0x7c>
0x0000000031c23880 0x00000000300aaadc <.hiomap_window_move+0x150>
0x0000000031c23950 0x00000000300ab1d8 <.ipmi_hiomap_write+0xcc>
0x0000000031c23a90 0x00000000300a7b18 <.blocklevel_raw_write+0xbc>
0x0000000031c23b30 0x00000000300a7c34 <.blocklevel_write+0xfc>
0x0000000031c23bf0 0x0000000030030be0 <.flash_nvram_write+0xd4>
0x0000000031c23c90 0x000000003002c128 <.opal_write_nvram+0xd0>
0x0000000031c23d20 0x00000000300051e4 <opal_entry+0x134>
0xc000001fea6e7870 0xc0000000000a9060 <opal_nvram_write+0x80>
0xc000001fea6e78c0 0xc000000000030b84 <nvram_write_os_partition+0x94>
0xc000001fea6e7960 0xc0000000000310b0 <nvram_pstore_write+0xb0>
0xc000001fea6e7990 0xc0000000004792d4 <pstore_dump+0x1d4>
0xc000001fea6e7ad0 0xc00000000018a570 <kmsg_dump+0x140>
0xc000001fea6e7b40 0xc000000000028e5c <panic_flush_kmsg_end+0x2c>
0xc000001fea6e7b60 0xc0000000000a7168 <pnv_platform_error_reboot+0x68>
0xc000001fea6e7bd0 0xc0000000000ac9b8 <hmi_event_handler+0x1d8>
0xc000001fea6e7c80 0xc00000000012d6c8 <process_one_work+0x1b8>
0xc000001fea6e7d20 0xc00000000012da28 <worker_thread+0x88>
0xc000001fea6e7db0 0xc0000000001366f4 <kthread+0x164>
0xc000001fea6e7e20 0xc00000000000b65c <ret_from_kernel_thread+0x5c>
This is because, there is a while loop towards the end of
ipmi_queue_msg_sync() which keeps looping until "sync_msg" does not match
with "msg". It loops over time_wait_ms() until exit condition is met. In
normal scenario time_wait_ms() calls run pollers so that ipmi backend gets
a chance to check ipmi response and set sync_msg to NULL.
while (sync_msg == msg)
time_wait_ms(10);
But in the event when TB is in failed state time_wait_ms()->time_wait_poll()
returns immediately without calling pollers and hence we end up looping
forever. This patch fixes this hang by calling opal_run_pollers() in TB
failed state as well.
Fixes: 1764f2452 ("opal: Fix hang in time_wait* calls on HMI for TB errors.")
Cc: skiboot-stable@lists.ozlabs.org
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
-rw-r--r-- | core/timebase.c | 10 |
1 files changed, 10 insertions, 0 deletions
diff --git a/core/timebase.c b/core/timebase.c index ce958d1..d821b33 100644 --- a/core/timebase.c +++ b/core/timebase.c @@ -28,6 +28,16 @@ static void time_wait_poll(unsigned long duration) unsigned long period = msecs_to_tb(5); if (this_cpu()->tb_invalid) { + /* + * Run pollers to allow some backends to process response. + * + * In TOD failure case where TOD is unrecoverable, running + * pollers allows ipmi backend to deal with ipmi response + * from bmc and helps ipmi_queue_msg_sync() to get un-stuck. + * Thus it avoids linux kernel to hang during panic due to + * TOD failure. + */ + opal_run_pollers(); cpu_relax(); return; } |