Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit 9ca8bf1bde56330075634bd3cb601d0f6ee90514 ]
`msg` is valid pointer here. I don't recall why I added assert here :-(
This is not correct. We shouldn't call assert here. Also we are not using
`msg`. Hence convert it to `__unused`.
Fixes: 19d4f98e ('FSP/NVRAM: Handle "get vNVRAM statistics" command')
Cc: skiboot-stable@lists.ozlabs.org # v5.4.x +
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
|
|
[ Upstream commit 2a63db6511b63a75efe820f90bb7972afc2fcdef ]
FSP IPMI driver serializes ipmi messages. It sends message to FSP and waits
for response before sending new message. It works fine as long as we get
response from FSP on time.
If we have inflight ipmi message during FSP R/R, we will not get resonse
from FSP. So if we initiate inband FSP R/R then all subsequent inband ipmi
message gets blocked.
Sequence:
- ipmitool mc reset cold
- <FSP R/R complete>
- ipmitool <any command> <-- gets blocked
This patch clears inflight ipmi messages after FSP R/R complete.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Cc: skiboot-stable@lists.ozlabs.org
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
|
|
Commit fd6b71fc fixed the situation where ipmi console was open (hvc0) but got
data on different console (hvc1).
During FSP R/R OPAL closes all consoles. After R/R complete FSP requests to
open hvc1 and sends data on this. If hvc1 registration failed or not opened in
host kernel then it will not read data and results in RCU stalls.
Note that this is workaround for older kernel where we don't have separate irq
for each console. Latest kernel works fine without this patch.
CC: stable
CC: Sam Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit c9cc5ef5772ebfbfff978f1c25763d733e45752c)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Commit c8a7535f (FSP/CONSOLE: Workaround for unresponsive ipmi daemon) added
error logging when buffer is full. In some corner cases kernel may call this
function multiple time and we may endup logging error again and again.
This patch fixes it by generating error log only once. I think this is enough
to indicate something went wrong.
Also with previous patch, once console buffer is full, OPAL is returning error
to payload from fsp_console_write_buffer_space(). So payload will never call
fsp_console_write(). Hence move error logging logic to right place.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit d798c276b4da7559970702c2f31b12549c92741e)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Kernel calls fsp_console_write_buffer_space() to check console buffer space
availability. If there is enough buffer space to write data, then kernel will
call fsp_console_write() to write actual data.
In some extreme corner cases (like one explained in commit c8a7535f)
console becomes full and this function returns 0 to kernel (or space available
in console buffer < next incoming data size). Kernel will continue retrying
until it gets enough space. So we will start seeing RCU stalls.
This patch keeps track of previous available space. If previous space is same
as current means not enough space in console buffer to write incoming data.
It may be due to very high console write operation and slow response from FSP
-OR- FSP has stopped processing data (ex: because of ipmi daemon died). At this
point we will start timer with timeout of SER_BUFFER_OUT_TIMEOUT (10 secs).
If situation is not improved within 10 seconds means something went bad. Lets
return OPAL_RESOURCE so that kernel can drop console write and continue.
CC: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
CC: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
[stewart: reset timeout in fsp_console_write() path]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 6557a728385c3fbcef297d2c61c7d93bc539f8bb)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Presently we are not closing SOL and FW console sessions during R/R. Host will
continue to write to SOL buffer during FSP R/R. If there is heavy console write
operation happening during FSP R/R (like running `top` command inside console),
then at some point console buffer becomes full. fsp_console_write_buffer_space()
returns 0 (or less than required space to write data) to host. While one thread
is busy writing to console, if some other threads tries to write data to console
we may see RCU stalls (like below) in kernel.
kernel call trace:
------------------
[ 2082.828363] INFO: rcu_sched detected stalls on CPUs/tasks: { 32} (detected by 16, t=6002 jiffies, g=23154, c=23153, q=254769)
[ 2082.828365] Task dump for CPU 32:
[ 2082.828368] kworker/32:3 R running task 0 4637 2 0x00000884
[ 2082.828375] Workqueue: events dump_work_fn
[ 2082.828376] Call Trace:
[ 2082.828382] [c000000f1633fa00] [c00000000013b6b0] console_unlock+0x570/0x600 (unreliable)
[ 2082.828384] [c000000f1633fae0] [c00000000013ba34] vprintk_emit+0x2f4/0x5c0
[ 2082.828389] [c000000f1633fb60] [c00000000099e644] printk+0x84/0x98
[ 2082.828391] [c000000f1633fb90] [c0000000000851a8] dump_work_fn+0x238/0x250
[ 2082.828394] [c000000f1633fc60] [c0000000000ecb98] process_one_work+0x198/0x4b0
[ 2082.828396] [c000000f1633fcf0] [c0000000000ed3dc] worker_thread+0x18c/0x5a0
[ 2082.828399] [c000000f1633fd80] [c0000000000f4650] kthread+0x110/0x130
[ 2082.828403] [c000000f1633fe30] [c000000000009674] ret_from_kernel_thread+0x5c/0x68
Hence lets close SOL (and FW console) during FSP R/R.
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 9d1755179112071652bb4a317f9006da630ce25d)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Presently OPAL sends associate/unassociate MBOX command for all
FSP serial console (like below OPAL message). We have to check
console is available or not before sending this message.
OPAL log:
-------
[ 5013.227994012,7] FSP: Reassociating HVSI console 1
[ 5013.227997540,7] FSP: Reassociating HVSI console 2
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 175e406ac29435017c5bd08b2c45d93b2c5a7669)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Commit 42d5d047 fixed scenario where DPO has been initiated, but FSP went
into reset before the CEC power down came in. But this is generic issue
that can happen in normal shutdown path as well.
Hence disable PSI link as soon as we detect FSP impending R/R.
CC: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
CC: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit a4788a49f004a91bb8ca015336abf9ae119fbc52)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
It turns out GCC7 adds a useful warning and does fancy things like
parsing your comments to work out that you intended to do the fallthrough.
There's a few places where we don't match the regex. Fix them, as it's
harmless to do so.
Found by building on Fedora Rawhide in Travis.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit e58aeb1ca304835d65beed98db7f118d3c154cd4)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
FSP sends MBOX command (cmd : 0xEB, subcmd : 0x05, mod : 0x00) to get vNVRAM
statistics. OPAL doesn't maintain any such statistics. Hence return
FSP_STATUS_INVALID_SUBCMD.
Sample OPAL log:
[16944.384670488,3] FSP: Unhandled message eb0500
[16944.474110465,3] FSP: Unhandled message eb0500
[16945.111280784,3] FSP: Unhandled message eb0500
[16945.293393485,3] FSP: Unhandled message eb0500
With this patch, I don't think FSP will ever call "free vNVRAM" MBOX command.
But to be safer side lets return FSP_STATUS_INVALID_SUBCMD for this MBOX
command as well.
Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Tested-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 19d4f98e9483e4c1cae1d5a59491d8ab4f9a6e7f)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
OPAL sends MBOX message to FSP and updates message state from fsp_msg_queued
-> fsp_msg_sent. fsp_sync_msg() queues message and waits until we get response
from FSP. During FSP R/R we move outstanding MBOX messages from msgq to rr_queue
including inflight message (fsp_reset_cmdclass()). But we are not resetting
inflight message state.
In extreme croner case where we sent message to FSP via fsp_sync_msg() path
and FSP R/R happens before getting respose from FSP, then we will endup waiting
in fsp_sync_msg() until everything becomes normal.
This patch adds fsp_in_rr() check to fsp_sync_msg() and return error to caller
if FSP is in R/R.
CC: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit c74e88e8614de0a82cba5c30812d5aa39db747a9)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
We use TCE mapped area to write data to console. Console header
(fsp_serbuf_hdr) is modified by both FSP and OPAL (OPAL updates
next_in pointer in fsp_serbuf_hdr and FSP updates next_out pointer).
Kernel makes opal_console_write() OPAL call to write data to console.
OPAL write data to TCE mapped area and sends MBOX command to FSP.
If our console becomes full and we have data to write to console,
we keep on waiting until FSP reads data.
In some corner cases, where FSP is active but not responding to
console MBOX message (due to buggy IPMI) and we have heavy console
write happening from kernel, then eventually our console buffer
becomes full. At this point OPAL starts sending OPAL_BUSY_EVENT to
kernel. Kernel will keep on retrying. This is creating kernel soft
lockups. In some extreme case when every CPU is trying to write to
console, user will not be able to ssh and thinks system is hang.
If we reset FSP or restart IPMI daemon on FSP, system recovers and
everything becomes normal.
This patch adds workaround to above issue by returning OPAL_HARDWARE
when cosole is full. Side effect of this patch is, we may endup dropping
latest console data. But better to drop console data than system hang.
Alternative approach is to drop old data from console buffer, make space
for new data. But in normal condition only FSP can update 'next_out'
pointer and if we touch that pointer, it may introduce some other
race conditions. Hence we decided to just new console write request.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Acked-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit c8a7535f3539c79955645e6b3714b367a994b1e9)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
For timed out FSP messages, we set message status as "fsp_msg_timeout".
But most FSP driver users (like surviellance) are ignoring this field.
They always look for FSP returned status value in callback function
(second byte in word1). So we endup treating timed out message as success
response from FSP.
Sample output:
[69902.432509048,7] SURV: Sending the heartbeat command to FSP
[70023.226860117,4] FSP: Response from FSP timed out, word0 = d66a00d7, word1 = 0 state: 3
....
[70023.226901445,7] SURV: Received heartbeat acknowledge from FSP
[70023.226903251,3] FSP: fsp_trigger_reset() entry
Here SURV code thought it got valid response from FSP. But actually we didn't
receive response from FSP.
This patch fixes above issue by updating status field in response structure.
CC: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 4cef4d8d6000936b1a4e1065bf69ee2edd3fcc1f)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Presently we print word0 and word1 in error log. word0 contains
sequence number and command class. One has to understand word0
format to identify command class.
Lets explicitly print command class, sub command etc.
CC: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 807a3acc8fd66af1e1c6e7154aa5029c9b91bb3b)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Now that we are using fsp_in_rr() to detect FSP reset/reload, fsp_in_reset
become redundant. Lets remove this local variable.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit a34369631e6d85c26966eb0b8d5e4c44bcf96c7c)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
fsp_opal_rtc_write() checks FSP status before queueing message to FSP. But if
FSP R/R starts before getting response to queued message then we will continue
to return OPAL_BUSY_EVENT to host. In some extreme condition host may
experience hang. Once FSP is back we will repost message, get response from FSP
and return OPAL_SUCCES to host.
This patch caches new values and returns OPAL_SUCCESS if FSP R/R is happening.
And once FSP is back we will send cached value to FSP.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit f4757fbfcf616365c74b1aa6508b2ab27480cdd0)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Currently fsp-rtc reads/writes the cached RTC TOD on an fsp
reset. Use latest fsp_in_rr() function to properly read the cached rtc
value when fsp reset initiated by the hir.
Below is the kernel trace when we set hw clock, when hir process starts.
[ 1727.775824] NMI watchdog: BUG: soft lockup - CPU#57 stuck for 23s! [hwclock:7688]
[ 1727.775856] Modules linked in: vmx_crypto ibmpowernv ipmi_powernv uio_pdrv_genirq ipmi_devintf powernv_op_panel uio ipmi_msghandler powernv_rng leds_powernv ip_tables x_tables autofs4 ses enclosure scsi_transport_sas crc32c_vpmsum lpfc ipr tg3 scsi_transport_fc
[ 1727.775883] CPU: 57 PID: 7688 Comm: hwclock Not tainted 4.10.0-14-generic #16-Ubuntu
[ 1727.775883] task: c000000fdfdc8400 task.stack: c000000fdfef4000
[ 1727.775884] NIP: c00000000090540c LR: c0000000000846f4 CTR: 000000003006dd70
[ 1727.775885] REGS: c000000fdfef79a0 TRAP: 0901 Not tainted (4.10.0-14-generic)
[ 1727.775886] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
[ 1727.775889] CR: 28024442 XER: 20000000
[ 1727.775890] CFAR: c00000000008472c SOFTE: 1
GPR00: 0000000030005128 c000000fdfef7c20 c00000000144c900 fffffffffffffff4
GPR04: 0000000028024442 c00000000090540c 9000000000009033 0000000000000000
GPR08: 0000000000000000 0000000031fc4000 c000000000084710 9000000000001003
GPR12: c0000000000846e8 c00000000fba0100
[ 1727.775897] NIP [c00000000090540c] opal_set_rtc_time+0x4c/0xb0
[ 1727.775899] LR [c0000000000846f4] opal_return+0xc/0x48
[ 1727.775899] Call Trace:
[ 1727.775900] [c000000fdfef7c20] [c00000000090540c] opal_set_rtc_time+0x4c/0xb0 (unreliable)
[ 1727.775901] [c000000fdfef7c60] [c000000000900828] rtc_set_time+0xb8/0x1b0
[ 1727.775903] [c000000fdfef7ca0] [c000000000902364] rtc_dev_ioctl+0x454/0x630
[ 1727.775904] [c000000fdfef7d40] [c00000000035b1f4] do_vfs_ioctl+0xd4/0x8c0
[ 1727.775906] [c000000fdfef7de0] [c00000000035bab4] SyS_ioctl+0xd4/0xf0
[ 1727.775907] [c000000fdfef7e30] [c00000000000b184] system_call+0x38/0xe0
[ 1727.775908] Instruction dump:
[ 1727.775909] f821ffc1 39200000 7c832378 91210028 38a10020 39200000 38810028 f9210020
[ 1727.775911] 4bfffe6d e8810020 80610028 4b77f61d <60000000> 7c7f1b78 3860000a 2fbffff4
This is found when executing the testcase
https://github.com/open-power/op-test-framework/blob/master/testcases/fspresetReload.py
With this fix ran fsp hir torture testcase in the above test
which is working fine.
Signed-off-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 447ccc4de529f001271fd4dfd78401bc4c90832e)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
CC: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 740d00b1036188c6e248418fb0a13faf14723e7a)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Trigging a Host Initiated Reset (when the host detects the FSP has gone
out to lunch and should be rebooted), would cause "Unknown Command" messages
to appear in the OPAL log.
This patch implements those messages
How to trigger FSP RR(HIR):
$ putmemproc 300000f8 0x00000000deadbeef
s1 k0:n0:s0:p00
ecmd_ppc putmemproc 300000f8 0x00000000deadbeef
Log showing unknown command:
/ # cat /sys/firmware/opal/msglog | grep -i ,3
[ 110.232114723,3] FSP: fsp_trigger_reset() entry
[ 188.431793837,3] FSP #0: Link down, starting R&R
[ 464.109239162,3] FSP #0: Got XUP with no pending message !
[ 466.340598554,3] FSP-DPO: Unknown command 0xce0900
[ 466.340600126,3] FSP: Unhandled message ce0900
The message we need to handle is "Get PLID after host initiated FipS
reset/reload". When the FSP comes back from HIR, it asks "hey, so, which
error log explains why you rebooted me?". So, we tell it.
Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit f3a5741408a11be6992cf8779f2eae10b08c020a)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
During FSP R/R, the FSP is inaccessible and will lose state. Messages to the
FSP are generally queued for sending later.
It does seem like the FSP fails to process any subseuqent messages of certain
classes (SP info -- ipmi) if it receives queued mbox messages it isn't expecting.
In certain other cases (sensors), the FSP driver returns a default code (async
completion) even though there is no known bound from the time of this error
return to the actual data being available. The kernel driver keeps waiting
leading to soft-lockup on the host side.
Mitigate both these (known) cases by returning OPAL_BUSY so the host driver
knows to retry later.
With this change, the sensors command works fine when the FSP comes back.
This version also resolves the remaining IPMI issues
Signed-off-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Tested-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 4940b8148640c06e139aec8c6d0370af7dd3b184)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
If we were to terminate in a poller, we'd call op_display() which
called pollers which hit the recursive poller warning, which ended
in not much fun at all.
This patch will skip the running of pollers and instead run
the FSP poller to set the op-panel display before attn.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 9fcb109218b1374a8caa3cac62e83fbedb1f7f2f)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
elog_reject_head() routine makes the state 'elog_read_from_fsp_head_state'
either 'ELOG_STATE_REJECTED' or 'ELOG_STATE_NONE' depending on the current
state of 'elog_read_from_fsp_head_state'.
We can remove this elog_reject_head() from 'opal_kexec_elog_notify()' as just
after that it is called inside 'fsp_opal_resend_pending_logs()'. So, it is
redundant inside opal_kexec_elog_notify() routine.
Signed-off-by: Mukesh Ojha <mukesh02@linux.vnet.ibm.com>
Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
We use 'elog_enabled' flag to check whether host OS is ready to receive
error log or not. This is nothing to do with reading error log from
service processor.
This patch is to remove the check and keep this 'elog_enabled' free from
FSP specific code and move it into core/errorlog.c in later upcoming patches.
With this changes, in some corner cases we may endup reading same error
log twice from FSP. It happens as we call 'elog_reject_head' inside
'fsp_opal_resend_pending_logs' which makes the state either
'ELOG_STATE_REJECTED' or 'ELOG_STATE_NONE'. So, a call to
'fsp_elog_check_and_fetch_head' routine ends up reading the error
log from FSP which was already read. This case happens twice in a reboot
as whenever 'fsp_opal_resend_pending_logs' gets called.
So, we can ignore it.
Signed-off-by: Mukesh Ojha <mukesh02@linux.vnet.ibm.com>
Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Some modifications related to typo errors, alignment, case letter mismatch to add
more clarity to the code.
Signed-off-by: Mukesh Ojha <mukesh02@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
[stewart@linux.vnet.ibm.com: unlock before return (suggested by Mahesh/Andrew),
disable only on non-cancelling fsp codeupdate call (suggested by Vasant)]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This is an experimental patch that implements "Fast reboot" on P8
machines.
The basic idea is that when the OS calls OPAL reboot, we gather all
the threads in the system using a combination of patching the reset
vector and soft-resetting them, then cleanup a few bits of hardware
(we do re-probe PCIe for example), and reload & restart the bootloader.
For Trusted Boot, this means we *add* measurements to the TPM, so you
will get *different* PCR values as compared to a full IPL. This makes
sense as if you want to be sure you are running something known then,
well, do a full IPL as soft reset should never be trusted to clear any
malicious code.
This is very experimental and needs a lot of testing and also auditing
code for other bits of HW that might need to be cleaned up.
BenH TODO: I also need to check if we are properly PERST'ing PCI devices.
This is partially based on old code I had to do that on P7. I only
support it on P8 though as there are issues with the PSI interrupts
on P7 that cannot be reliably solved.
Even though this should be considered somewhat experimental, we've had
a lot of success on a variety of machines. Dozens/hundreds of reboots
across Tuleta, Garrison and Habanero.
Currently, we've hidden it behind a NVRAM config option, which *is*
liable to change in the future (to ensure that only those who know
what they're doing enable it)
You can enable the experimental support via nvram option:
nvram -p ibm,skiboot --update-config experimental-fast-reset=feeling-lucky
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[stewart@linux.vnet.ibm.com: hide behind nvram option, include Mambo fixes
from Mikey]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Allocate an irq number for each hvc console and set its interrupt-parent
property so that Linux can use the opal irqchip instead of the
OPAL_EVENT_CONSOLE_INPUT interface.
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Presently we set timeout value as soon as we add elog to queue. If
we have multiple elogs to write, it doesn't consider queue wait time.
Instead set timeout value when we are actually sending elog to FSP.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
OPAL gets elog notification from service processor which contains log
information. Once we get notification we start reading log data and
change elog state to ELOG_STATE_FETCHING.
Hence we don't need ELOG_STATE_FETCHED_INFO state. Lets remove this
variable.
Also in some places we have used this state after sending event information
to host. Replace such usage with better state (ELOG_STATE_HOST_INFO).
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
I was only ever set to true
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This issue is one of the corner case, which is related to recent change
went upstream and only observed in the petitboot prompt, where we see
only one error log instead of getting all error log in
/sys/firmware/opal/elog.
Below is snippet of the code, where elog module in the kernel
initialised.
{
..
...
rc = request_threaded_irq(irq, NULL, elog_event, =<=======
IRQF_TRIGGER_HIGH | IRQF_ONESHOT, "opal-elog", NULL); |
if (rc) { |
pr_err("%s: Can't request OPAL event irq (%d)\n", |
__func__, rc); |
return rc; |
} |
/* We are now ready to pull error logs from opal. */ |
if (opal_check_token(OPAL_ELOG_RESEND)) |
opal_resend_pending_logs(); =<=======
}
Scenario:
While elog_enabled is true, OPAL_EVENT_ERROR_LOG_AVAIL will be set from
OPAL, whenever it has error logs that are waiting to be fetched from the
kernel.
Race occurs between the code arrowed above, as soon as kernel registers
error log handler, it sees OPAL_EVENT_ERROR_LOG_AVAIL is set, so it
schedule the handler. Which makes 'opal_get_elog_size'(kernel) call on
the error log set the state from ELOG_STATE_FETCHED_DATA to
ELOG_STATE_FETCHED_INFO and clears OPAL_EVENT_ERROR_LOG_AVAIL. During
the same time 'opal_resend_pending_logs'(kernel) call which will set the
state machine from ELOG_STATE_FETCHED_INFO to ELOG_STATE_NONE in OPAL.
Because of that, read call from the kernel, which was to be made after
the 'opal_get_elog_size' ends up failing. But, the elog kobject was
created for the particular error log.
Further in the resend routine in the OPAL, we make opal_commit_elog_in_host()
call that sets OPAL_EVENT_ERROR_LOG_AVAIL. So, Kernel again makes
'opal_get_elog_size' which results in getting the error log info of the
same error log which was fetched earlier. It also changes the state
machine to ELOG_STATE_FETCHED_INFO and clears OPAL_EVENT_ERROR_LOG_AVAIL.
Below is the snippet from the elog_event registered handler call
{
...
...
/* we may get notified twice, let's handle
* that gracefully and not create two conflicting
* entries.
*/
if (kset_find_obj(elog_kset, name))
return IRQ_HANDLED;
...
...
}
In the kernel, we search kobject for the error log whether it already
exist. So kobject is found and it returns without reading error log
data.
So, this patch makes the flag which was true during initialisation
to false. And that solves the race.
Signed-off-by: Mukesh Ojha <mukesh02@linux.vnet.ibm.com>
Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
We have used TCE_MASK value (4095) instead of TCE_PSIZE (4096) to align memory
source address. In some corner cases (like source memory size = 4097) we may
endup doing wrong mapping and corrupting part of SYSDUMP.
This patch uses ALIGN_UP macro with TCE_PSIZE value for alignining memory.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Fix resend logic in opal_resend_pending_logs, so that it actually
restarts sending remaining logs.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
In some corner cases host may send acknowledgement without
reading actual data (fsp_opal_elog_info -> fsp_opal_elog_ack).
Because of this elog_read_from_fsp_head_state may be stuck in
wrong state (ELOG_STATE_HOST_INFO) and not able to send remaining
ELOG's to host. Hence reset ELOG state and start sending remaining
ELOG's.
Also in normal case we will ACK the logs which are already processed
(elog_read_processed). Hence rearrange the code such that we go
through elog_read_processed first.
Finally return OPAL_PARAMETER if we are not able to find ELOG ID.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
[stewart@linux.vnet.ibm.com: spelling fix]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Chances of elog_read_pending inconsistent state is very very
less. Just to be on safer side, disable notification if list
is not in consistent state.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Reviewed-by: Mukesh Ojha <mukesh02@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
ELOG enables event notification once new log is available. And this
will be disabled after host completes reading logs (it has to complete
both fsp_opal_elog_info and fsp_opal_elog_read).
Ideally we should disable notification as soon as host consumes event
(after fsp_opal_elog_info). Also if host fails to call fsp_opal_elog_read
(ex: situations like duplicate event), then we endup keeping notification
forever.
This patch introduces new ELOG state (ELOG_STATE_HOST_INFO). As soon
as host consumes event elog will move to this new state so that event
notification is disabled.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
We use elog notifier to notify logs from multiple sources (FSP generated
logs - fsp-elog-read.c and OPAL generated logs - fsp-elog-write.c).
OPAL generated logs sets elog event bit whenever it has new logs to send
to host. But it relies on fsp-elog-read.c to disable the event bit..which
is wrong!
This patch creates common function to enable/disable event notification.
It will enable event notification if any of the source is ready to send
error log to host and disables notification once it completes sending
all errors to host.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
ELOG enables event notification once new log is available. And this
will be disabled after host completes reading logs (it has to complete
both fsp_opal_elog_info and fsp_opal_elog_read).
In some corner cases like kexec, host may endup reading same ELOG id twice
(calling fsp_opal_elog_info twice because of resend request). Host finds it
as duplicate and it will not read actual log (fsp_opal_elog_read()). In such
situations we fails to disable event notification :-(
Scenario :
OPAL Host
-------------------------------------
OPAL_EVENT_ELOG_AVAIL --> kexec
OPAL_EVENT_ELOG_AVAIL --> elog client registered
<-- read ELOG (id=x)
<-- resend elog (opal_resend_pending_logs())
resend all ELOG --> read ELOG (id=x) -- Duplicate ELOG !
bhoom!!
kernel call trace:
------------------
[ 28.055923] CPU: 10 PID: 20 Comm: irq/29-opal-elo Not tainted 4.4.0-24-generic #43-Ubuntu
[ 28.056012] task: c0000000ef982a20 ti: c0000000efa38000 task.ti: c0000000efa38000
[ 28.056100] NIP: c000000008010a24 LR: c000000008010a24 CTR: 0000000030033758
[ 28.056188] REGS: c0000000efa3b9c0 TRAP: 0901 Not tainted (4.4.0-24-generic)
[ 28.056274] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 22000844 XER: 20000000
[ 28.056499] CFAR: c000000008009958 SOFTE: 1
GPR00: c000000008131e8c c0000000efa3bc40 c0000000095b4200 0000000000000900
GPR04: c0000000094a63c8 0000000000000001 9000000100009033 0000000000000062
GPR08: 0000000000000000 0000000000000000 c0000000ef960400 9000000100001003
GPR12: c00000000806de48 c00000000fb45f00
[ 28.057042] NIP [c000000008010a24] arch_local_irq_restore+0x74/0x90
[ 28.057117] LR [c000000008010a24] arch_local_irq_restore+0x74/0x90
[ 28.057189] Call Trace:
[ 28.057221] [c0000000efa3bc40] [c0000000f108a980] 0xc0000000f108a980 (unreliable)
[ 28.057326] [c0000000efa3bc60] [c000000008131e8c] irq_finalize_oneshot.part.2+0xbc/0x250
[ 28.057429] [c0000000efa3bcb0] [c000000008132170] irq_thread_fn+0x80/0xa0
[ 28.057519] [c0000000efa3bcf0] [c00000000813263c] irq_thread+0x1ac/0x280
[ 28.057609] [c0000000efa3bd80] [c0000000080e61e0] kthread+0x110/0x130
[ 28.057698] [c0000000efa3be30] [c000000008009538] ret_from_kernel_thread+0x5c/0xa4
[ 28.057799] Instruction dump:
[ 28.057844] 994d02ca 2fa30000 409e0024 e92d0020 61298000 7d210164 38210020 e8010010
[ 28.057995] 7c0803a6 4e800020 60420000 4bff17ad <60000000> 4bffffe4 60420000 e92d0020
This patch adds kexec notifier client. It will disable event notification
during kexec. Once host is ready to receive ELOG's again it will call
fsp_opal_resend_pending_logs(). This call re-enables ELOG notication.
It will fix above issue. I will add follow up patch to improve event state.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Linux kernels from v4.1 onwards will try to request an irq for each hvc
console using OPAL_EVENT_CONSOLE_INPUT, however because the IRQF_SHARED
flag is not set any console after the first will fail. If there is data
on one of these failed consoles OPAL will set OPAL_EVENT_CONSOLE_INPUT
every time fsp_console_read is called, leading to RCU stalls in the
kernel.
As a workaround for unpatched kernels, cease setting
OPAL_EVENT_CONSOLE_INPUT for consoles that we have noticed are not being
read.
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
We don't need to validate msg->resp message as its always
allocated.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Reviewed-by: Mukesh Ojha <mukesh02@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
fsp_allocmsg() returns true even if msg->resp memory allocation fails.
Validate msg->resp memory allocation as well.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Reviewed-by: Mukesh Ojha <mukesh02@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
In the function __opal_write_oppanel() coverity complains about an out
of bounds array access. While the pointer is never actually dereferenced,
this isn't immediately obvious from the code. Additionally the number and
length of the lines on the operator panel display are hard coded into
the function. While we are here we might as well move these into a #define
statement.
Rework the code in __opal_write_oppanel() where the message is copied into
the buffer so that coverity won't complain about an out of bounds array
access and so that it is line number and length agnostic (now relying on
the #defined values).
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Only the major ones at this point in time.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
These errors are essentially assert()s - something has gone wrong and
it's likely because of a bug somewhere. Things we should *never* it
regards to inconsistency, so have FWTS throw warnings on them.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This changes trigger_attn() to also enable attn via HID0, so callers
don't have to do it themselves.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Philippe Bergheaud <felix@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|