Age | Commit message (Collapse) | Author | Files | Lines |
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
We use TCE mapped area to write data to console. Console header
(fsp_serbuf_hdr) is modified by both FSP and OPAL (OPAL updates
next_in pointer in fsp_serbuf_hdr and FSP updates next_out pointer).
Kernel makes opal_console_write() OPAL call to write data to console.
OPAL write data to TCE mapped area and sends MBOX command to FSP.
If our console becomes full and we have data to write to console,
we keep on waiting until FSP reads data.
In some corner cases, where FSP is active but not responding to
console MBOX message (due to buggy IPMI) and we have heavy console
write happening from kernel, then eventually our console buffer
becomes full. At this point OPAL starts sending OPAL_BUSY_EVENT to
kernel. Kernel will keep on retrying. This is creating kernel soft
lockups. In some extreme case when every CPU is trying to write to
console, user will not be able to ssh and thinks system is hang.
If we reset FSP or restart IPMI daemon on FSP, system recovers and
everything becomes normal.
This patch adds workaround to above issue by returning OPAL_HARDWARE
when cosole is full. Side effect of this patch is, we may endup dropping
latest console data. But better to drop console data than system hang.
Alternative approach is to drop old data from console buffer, make space
for new data. But in normal condition only FSP can update 'next_out'
pointer and if we touch that pointer, it may introduce some other
race conditions. Hence we decided to just new console write request.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Acked-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit c8a7535f3539c79955645e6b3714b367a994b1e9)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
For timed out FSP messages, we set message status as "fsp_msg_timeout".
But most FSP driver users (like surviellance) are ignoring this field.
They always look for FSP returned status value in callback function
(second byte in word1). So we endup treating timed out message as success
response from FSP.
Sample output:
[69902.432509048,7] SURV: Sending the heartbeat command to FSP
[70023.226860117,4] FSP: Response from FSP timed out, word0 = d66a00d7, word1 = 0 state: 3
....
[70023.226901445,7] SURV: Received heartbeat acknowledge from FSP
[70023.226903251,3] FSP: fsp_trigger_reset() entry
Here SURV code thought it got valid response from FSP. But actually we didn't
receive response from FSP.
This patch fixes above issue by updating status field in response structure.
CC: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 4cef4d8d6000936b1a4e1065bf69ee2edd3fcc1f)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Presently we print word0 and word1 in error log. word0 contains
sequence number and command class. One has to understand word0
format to identify command class.
Lets explicitly print command class, sub command etc.
CC: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 807a3acc8fd66af1e1c6e7154aa5029c9b91bb3b)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Now that we are using fsp_in_rr() to detect FSP reset/reload, fsp_in_reset
become redundant. Lets remove this local variable.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit a34369631e6d85c26966eb0b8d5e4c44bcf96c7c)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
fsp_opal_rtc_write() checks FSP status before queueing message to FSP. But if
FSP R/R starts before getting response to queued message then we will continue
to return OPAL_BUSY_EVENT to host. In some extreme condition host may
experience hang. Once FSP is back we will repost message, get response from FSP
and return OPAL_SUCCES to host.
This patch caches new values and returns OPAL_SUCCESS if FSP R/R is happening.
And once FSP is back we will send cached value to FSP.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit f4757fbfcf616365c74b1aa6508b2ab27480cdd0)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Currently fsp-rtc reads/writes the cached RTC TOD on an fsp
reset. Use latest fsp_in_rr() function to properly read the cached rtc
value when fsp reset initiated by the hir.
Below is the kernel trace when we set hw clock, when hir process starts.
[ 1727.775824] NMI watchdog: BUG: soft lockup - CPU#57 stuck for 23s! [hwclock:7688]
[ 1727.775856] Modules linked in: vmx_crypto ibmpowernv ipmi_powernv uio_pdrv_genirq ipmi_devintf powernv_op_panel uio ipmi_msghandler powernv_rng leds_powernv ip_tables x_tables autofs4 ses enclosure scsi_transport_sas crc32c_vpmsum lpfc ipr tg3 scsi_transport_fc
[ 1727.775883] CPU: 57 PID: 7688 Comm: hwclock Not tainted 4.10.0-14-generic #16-Ubuntu
[ 1727.775883] task: c000000fdfdc8400 task.stack: c000000fdfef4000
[ 1727.775884] NIP: c00000000090540c LR: c0000000000846f4 CTR: 000000003006dd70
[ 1727.775885] REGS: c000000fdfef79a0 TRAP: 0901 Not tainted (4.10.0-14-generic)
[ 1727.775886] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
[ 1727.775889] CR: 28024442 XER: 20000000
[ 1727.775890] CFAR: c00000000008472c SOFTE: 1
GPR00: 0000000030005128 c000000fdfef7c20 c00000000144c900 fffffffffffffff4
GPR04: 0000000028024442 c00000000090540c 9000000000009033 0000000000000000
GPR08: 0000000000000000 0000000031fc4000 c000000000084710 9000000000001003
GPR12: c0000000000846e8 c00000000fba0100
[ 1727.775897] NIP [c00000000090540c] opal_set_rtc_time+0x4c/0xb0
[ 1727.775899] LR [c0000000000846f4] opal_return+0xc/0x48
[ 1727.775899] Call Trace:
[ 1727.775900] [c000000fdfef7c20] [c00000000090540c] opal_set_rtc_time+0x4c/0xb0 (unreliable)
[ 1727.775901] [c000000fdfef7c60] [c000000000900828] rtc_set_time+0xb8/0x1b0
[ 1727.775903] [c000000fdfef7ca0] [c000000000902364] rtc_dev_ioctl+0x454/0x630
[ 1727.775904] [c000000fdfef7d40] [c00000000035b1f4] do_vfs_ioctl+0xd4/0x8c0
[ 1727.775906] [c000000fdfef7de0] [c00000000035bab4] SyS_ioctl+0xd4/0xf0
[ 1727.775907] [c000000fdfef7e30] [c00000000000b184] system_call+0x38/0xe0
[ 1727.775908] Instruction dump:
[ 1727.775909] f821ffc1 39200000 7c832378 91210028 38a10020 39200000 38810028 f9210020
[ 1727.775911] 4bfffe6d e8810020 80610028 4b77f61d <60000000> 7c7f1b78 3860000a 2fbffff4
This is found when executing the testcase
https://github.com/open-power/op-test-framework/blob/master/testcases/fspresetReload.py
With this fix ran fsp hir torture testcase in the above test
which is working fine.
Signed-off-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 447ccc4de529f001271fd4dfd78401bc4c90832e)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
CC: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 740d00b1036188c6e248418fb0a13faf14723e7a)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Trigging a Host Initiated Reset (when the host detects the FSP has gone
out to lunch and should be rebooted), would cause "Unknown Command" messages
to appear in the OPAL log.
This patch implements those messages
How to trigger FSP RR(HIR):
$ putmemproc 300000f8 0x00000000deadbeef
s1 k0:n0:s0:p00
ecmd_ppc putmemproc 300000f8 0x00000000deadbeef
Log showing unknown command:
/ # cat /sys/firmware/opal/msglog | grep -i ,3
[ 110.232114723,3] FSP: fsp_trigger_reset() entry
[ 188.431793837,3] FSP #0: Link down, starting R&R
[ 464.109239162,3] FSP #0: Got XUP with no pending message !
[ 466.340598554,3] FSP-DPO: Unknown command 0xce0900
[ 466.340600126,3] FSP: Unhandled message ce0900
The message we need to handle is "Get PLID after host initiated FipS
reset/reload". When the FSP comes back from HIR, it asks "hey, so, which
error log explains why you rebooted me?". So, we tell it.
Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit f3a5741408a11be6992cf8779f2eae10b08c020a)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
When interacting with an I2C master the p8-i2c driver (common to p9)
aquires a per-master lock which it holds for the duration of it's
interaction with the master. Unfortunately, when
p8_i2c_check_initial_status() detects that the master is busy with
another transaction it drops the lock and returns OPAL_BUSY. This is
contrary to the driver's locking strategy which requires that the
caller aquire and drop the lock. This leads to a crash due to the
double unlock(), which skiboot treats as fatal.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit bb192fd55ffb20d619101c5e3e1f4fd24f844d11)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Long existing typo of r5 rather than r6, meaning we were storing CTR
instead of LR.
Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit d55194c5d9ada77eee2c9a69814708304f34d334)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
When saving the CTR and LR registers the skiboot exception handlers use the
'stw' instruction which only saves the lower 32 bits of the register. Given
these are both 64 bit registers this leads to some strange register dumps,
for example:
***********************************************
Unexpected exception 200 !
SRR0 : 0000000030016968 SRR1 : 9000000000201000
HSRR0: 0000000000000180 HSRR1: 9000000000001000
LR : 3003438830823f50 CTR : 3003438800000018
CFAR : 00000000300168fc
CR : 40004208 XER: 00000000
In this dump the upper 32 bits of LR and CTR are actually stack gunk
which obscures the underlying issue.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 70bc370883330c8b1076555c126647a3cdf88706)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
During FSP R/R, the FSP is inaccessible and will lose state. Messages to the
FSP are generally queued for sending later.
It does seem like the FSP fails to process any subseuqent messages of certain
classes (SP info -- ipmi) if it receives queued mbox messages it isn't expecting.
In certain other cases (sensors), the FSP driver returns a default code (async
completion) even though there is no known bound from the time of this error
return to the actual data being available. The kernel driver keeps waiting
leading to soft-lockup on the host side.
Mitigate both these (known) cases by returning OPAL_BUSY so the host driver
knows to retry later.
With this change, the sensors command works fine when the FSP comes back.
This version also resolves the remaining IPMI issues
Signed-off-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Tested-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 4940b8148640c06e139aec8c6d0370af7dd3b184)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit a6d5bc107e76123440d60a05698c151084604180)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Depending on how it was built, gcc will use the canary from a global
(works for us) or from the TLS (doesn't work for us and accesses
random stuff instead).
Fixing that would be tricky. There are talks of adding a gcc option
to force use of globals, but in the meantime, disable the stack
protector
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[stewart@linux.vnet.ibm.com: add -fno-stack-protector]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit fe6f1f982b562ba855bb68fb51545f104078f546)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit d7ffce9096d5a23ee4ff309910983d823e953bd2)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
We use r13 for our own stuff, make sure it's properly fixed
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit d45b9bc4f98dfeac3ce6ee906948b56944f6aa6b)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Those are called by the interrupts core and thus skip the locking
implicit in the PCI opal calls.
However IODA table access can be racy, so make sure we lock the PHB.
Signed-off-by: Benjamin Herrenschmidt
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 55af871041a4b09e53013671450980bdb36f91e3)
[stewart@linux.vnet.ibm.com: backport to phb3_lock()]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
OPAL_CLOSED is returned when fundamental reset is issued on the
PHB who doesn't have subordinate devices (root port excluded).
The kernel raises an error message, which is unnecessary. This
returns OPAL_SUCCESS for this case to avoid the error message.
The code change included in this has been upstream because of
below commits since skiboot-5.3.0. So this is only applied to
stable releases equal or ealier than skiboot-5.2.5.
commit 3d3303734d1 ("hw/p7ioc: Support PHB slot)
commit e1922cba179 ("hw/phb3: Support PHB slot")
Cc: stable # 5.2.5-
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
During a complete reset, when we get a timeout waiting for pending
transaction in state PHB3_STATE_CRESET_WAIT_CQ, we mark the PHB as broken
and return OPAL_PARAMETER.
Change the return code to OPAL_HARDWARE which is way more sensible, and set
the state to PHB3_STATE_FENCED so that the kernel can retry the complete
reset.
Reported-by: Pradipta Ghosh <pradghos@in.ibm.com>
Suggested-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
When we're doing a complete reset, after we complete the ETU reset and wait
for the PHB to return, we need to do a fundamental reset.
When we do the fundamental reset, we poll for a link up. This isn't always
successful on the first attempt. In phb3_sm_link_poll(), if we time out
while waiting for the link to come up, we call phb3_retry_state() to reset
p->state back to p->retry_state and poll again. On the second poll, we
clear the retry state so we don't retry again.
However, when we do the fundamental reset as part of a complete reset, we
don't explicitly set the retry state. This means that we only retry if
there wasn't an earlier fundamental reset that had to retry and thus
cleared the retry state. This reduces the reliability of the complete reset
process.
In phb3_sm_complete_reset(), when in state PHB3_STATE_CRESET_FRESET,
set the retry state to PHB3_STATE_FRESET_START, as is done in
phb3_fundamental_reset().
Reported-by: Pradipta Ghosh <pradghos@in.ibm.com>
Suggested-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Cc: Uma Krishnan <ukrishn@linux.vnet.ibm.com>
Cc: Matthew Ochs <mrochs@linux.vnet.ibm.com>
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
In some error conditions, we could spiral out of control on this
and spend all of our time printing the exact same backtrace.
Limit it to 16 times, because 16 is a nice number.
Cc: stable
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit b6a729e118f42dae88ebf70a09a7e2aa4f788fdc)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
If we were to terminate in a poller, we'd call op_display() which
called pollers which hit the recursive poller warning, which ended
in not much fun at all.
This patch will skip the running of pollers and instead run
the FSP poller to set the op-panel display before attn.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 9fcb109218b1374a8caa3cac62e83fbedb1f7f2f)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
There are chances where another interrupt can wake a CPU in 0x100
vector just when HMI for TOD error is also pending. In such a rare race
condition if CPU has woken up with tb_loss power saving mode, it will
invoke opal call to resync the TB. Since TOD is already in error state,
resync TB will timeout leaving TFMR bit 18 set to '1'. (TFMR[18]=1 means
TB is prepared to receive new value from TOD. Once the new value is
received this bit gets reset to '0', otherwise TB would stay in waiting
state). When HMI is delivered, it may find all TFMR errors are already
cleared but would fail to restore TB since TFMR bit 18 is already set.
This leads to HMI recovery failure causing a kernel crash.
This patch fixes this by clearing of TB errors if TFMR[18] is set to 1.
This makes sure that TB is in clean state before TB restore process starts.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 026b9a13bf8d61a7e72721d59961b40cbc98b410)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
The capi initialization sequence was revised in a circumvention
document when a 'link down' error was converted from fatal to Endpoint
Recoverable. Other, non-capi, register setup was corrected even before
the initial open-source release of skiboot, but a few capi-related
registers were not updated then, so this patch fixes it.
The point is that a link-down error detected by the UTL logic will
lead to an AIB fence, so that the CAPP unit can detect the error.
Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit e36f4f219b642c6c5032208fca7191fbd75fe1a3)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Below are the in-memory console log messages observed with error level(PR_ERROR)
[54460318,3] HBRT: Mem region 'ibm,homer-image' not found !
[54465404,3] HBRT: Mem region 'ibm,homer-image' not found !
[54470372,3] HBRT: Mem region 'ibm,homer-image' not found !
[54475369,3] HBRT: Mem region 'ibm,homer-image' not found !
[11540917382,3] NVRAM: Layout appears sane
[11694529822,3] OPAL: Trying a CPU re-init with flags: 0x2
[61291003267,3] OPAL: Trying a CPU re-init with flags: 0x1
[61394005956,3] OPAL: Trying a CPU re-init with flags: 0x2
Lowering the log level of mem region not found messages to PR_WARNING and remaining messages to PR_INFO level
[54811683,4] HBRT: Mem region 'ibm,homer-image' not found !
[10923382751,6] NVRAM: Layout appears sane
[55533988976,6] OPAL: Trying a CPU re-init with flags: 0x1
Signed-off-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 341daa8104af3231b908e6fcffeedb5e47b33990)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
flash_find_subpartition() accepts a pointer to a boolean variable
indicating ecc for a region of flash and passes the pointer directly
to flash_read_corrected() which actually only wants the value. This
has always worked probably because there has always been ECC on
sub partitions.
How there aren't any warnings triggered by this condition escapes me.
Fixes: 6c26bc7 ("libflash: move ffs_flash_read into libflash")
Signed-off-by: Cyril Bur <cyril.bur@au1.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 81a538a678edf666568ca4adffe074b3dbce6dc3)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Fix resend logic in opal_resend_pending_logs, so that it actually
restarts sending remaining logs.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit a6d4a7884e95cb9c918b8a217c11e46b01218358)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
In some corner cases host may send acknowledgement without
reading actual data (fsp_opal_elog_info -> fsp_opal_elog_ack).
Because of this elog_read_from_fsp_head_state may be stuck in
wrong state (ELOG_STATE_HOST_INFO) and not able to send remaining
ELOG's to host. Hence reset ELOG state and start sending remaining
ELOG's.
Also in normal case we will ACK the logs which are already processed
(elog_read_processed). Hence rearrange the code such that we go
through elog_read_processed first.
Finally return OPAL_PARAMETER if we are not able to find ELOG ID.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
[stewart@linux.vnet.ibm.com: spelling fix]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit e7c8cba4ad773055f390632c2996d3242b633bf4)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Chances of elog_read_pending inconsistent state is very very
less. Just to be on safer side, disable notification if list
is not in consistent state.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Reviewed-by: Mukesh Ojha <mukesh02@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 1fb10de164d3ca034193df81c1f5d007aec37781)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
ELOG enables event notification once new log is available. And this
will be disabled after host completes reading logs (it has to complete
both fsp_opal_elog_info and fsp_opal_elog_read).
Ideally we should disable notification as soon as host consumes event
(after fsp_opal_elog_info). Also if host fails to call fsp_opal_elog_read
(ex: situations like duplicate event), then we endup keeping notification
forever.
This patch introduces new ELOG state (ELOG_STATE_HOST_INFO). As soon
as host consumes event elog will move to this new state so that event
notification is disabled.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit cec5750a4a86ff3f69e1d8817eda023f4d40c492)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
We use elog notifier to notify logs from multiple sources (FSP generated
logs - fsp-elog-read.c and OPAL generated logs - fsp-elog-write.c).
OPAL generated logs sets elog event bit whenever it has new logs to send
to host. But it relies on fsp-elog-read.c to disable the event bit..which
is wrong!
This patch creates common function to enable/disable event notification.
It will enable event notification if any of the source is ready to send
error log to host and disables notification once it completes sending
all errors to host.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit ec366ad4e2e871096fa4c614ad7e89f5bb6f884f)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
ELOG enables event notification once new log is available. And this
will be disabled after host completes reading logs (it has to complete
both fsp_opal_elog_info and fsp_opal_elog_read).
In some corner cases like kexec, host may endup reading same ELOG id twice
(calling fsp_opal_elog_info twice because of resend request). Host finds it
as duplicate and it will not read actual log (fsp_opal_elog_read()). In such
situations we fails to disable event notification :-(
Scenario :
OPAL Host
-------------------------------------
OPAL_EVENT_ELOG_AVAIL --> kexec
OPAL_EVENT_ELOG_AVAIL --> elog client registered
<-- read ELOG (id=x)
<-- resend elog (opal_resend_pending_logs())
resend all ELOG --> read ELOG (id=x) -- Duplicate ELOG !
bhoom!!
kernel call trace:
------------------
[ 28.055923] CPU: 10 PID: 20 Comm: irq/29-opal-elo Not tainted 4.4.0-24-generic #43-Ubuntu
[ 28.056012] task: c0000000ef982a20 ti: c0000000efa38000 task.ti: c0000000efa38000
[ 28.056100] NIP: c000000008010a24 LR: c000000008010a24 CTR: 0000000030033758
[ 28.056188] REGS: c0000000efa3b9c0 TRAP: 0901 Not tainted (4.4.0-24-generic)
[ 28.056274] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 22000844 XER: 20000000
[ 28.056499] CFAR: c000000008009958 SOFTE: 1
GPR00: c000000008131e8c c0000000efa3bc40 c0000000095b4200 0000000000000900
GPR04: c0000000094a63c8 0000000000000001 9000000100009033 0000000000000062
GPR08: 0000000000000000 0000000000000000 c0000000ef960400 9000000100001003
GPR12: c00000000806de48 c00000000fb45f00
[ 28.057042] NIP [c000000008010a24] arch_local_irq_restore+0x74/0x90
[ 28.057117] LR [c000000008010a24] arch_local_irq_restore+0x74/0x90
[ 28.057189] Call Trace:
[ 28.057221] [c0000000efa3bc40] [c0000000f108a980] 0xc0000000f108a980 (unreliable)
[ 28.057326] [c0000000efa3bc60] [c000000008131e8c] irq_finalize_oneshot.part.2+0xbc/0x250
[ 28.057429] [c0000000efa3bcb0] [c000000008132170] irq_thread_fn+0x80/0xa0
[ 28.057519] [c0000000efa3bcf0] [c00000000813263c] irq_thread+0x1ac/0x280
[ 28.057609] [c0000000efa3bd80] [c0000000080e61e0] kthread+0x110/0x130
[ 28.057698] [c0000000efa3be30] [c000000008009538] ret_from_kernel_thread+0x5c/0xa4
[ 28.057799] Instruction dump:
[ 28.057844] 994d02ca 2fa30000 409e0024 e92d0020 61298000 7d210164 38210020 e8010010
[ 28.057995] 7c0803a6 4e800020 60420000 4bff17ad <60000000> 4bffffe4 60420000 e92d0020
This patch adds kexec notifier client. It will disable event notification
during kexec. Once host is ready to receive ELOG's again it will call
fsp_opal_resend_pending_logs(). This call re-enables ELOG notication.
It will fix above issue. I will add follow up patch to improve event state.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit d2ae07fd97bb9408456279cec799f72cb78680a6)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
XSCOM engine blocks subsequently after querying FIR of any
sleeping core. This causes subsequent XSCOM opertions to hang
forever due to XSCOM engine being continuously busy. Reset XSCOM
engine after querying FIR of any sleeping core.
Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com>
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 15cec493804ff14e6246eb1b65e9d0c7cb469a81)
Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
OPAL retries XSCOM read/write operations forever till it succeeds.
This can cause XSCOM ops to hang forever when XSCOM engine remains
busy for some reason. Changed it to retry XSCOM operations only
XSCOM_BUSY_MAX_RETRIES number of times instead of retrying forever.
Also added logic to reset XSCOM engine after XSCOM_BUSY_RESET_THRESHOLD
number of retries to unblock it when it remains busy.
Cc: stable # 9c2d82394fd2 ("xscom: Return OPAL_WRONG_STATE on XSCOM ops..")
Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com>
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit e761222593a1ae932cddbc81239b6a7cd98ddb70)
Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
xscom_read and xscom_write return OPAL_SUCCESS if they worked, and
OPAL_HARDWARE if they didn't. This doesn't provide information about why
the operation failed, such as if the CPU happens to be asleep.
This is specifically useful in error scanning, so if every CPU is being
scanned for errors, sleeping CPUs likely aren't the cause of failures.
So, return OPAL_WRONG_STATE in xscom_read and xscom_write if the CPU is
sleeping.
Signed-off-by: Russell Currey <ruscur@russell.cc>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 9c2d82394fd2303847cac4a665dee62556ca528a)
Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Backport of user visible typo fixes
partial cherry picked from 4c95b5e04e3c4f72e4005574f67cd6e365d3276f
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Fixes: f46c1e506d199332b0f9741278c8ec35b3e39135
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 348dacfaca9f139db2603f5c2e78d87e21938ca6)
|
|
On PCI Express, devices need to know their own bus number in order
to provide the correct source identification (aka RID) in upstream
packets they might send, such as error messages or DMAs.
However while devices know (and hard wire) their own device and
function number, they know nothing about bus numbers by default, those
are decoded by bridges for routing. All they know is that if their
parent bridge sends a "type 0" configuration access, they should decode
it provided the device and function numbers match.
The PCIe spec thus defines that when a device receive such a configuration
access and it's a write, it should "capture" the bus number in the source
field of the packet, and re-use as the originator bus number of all
subsequent outgoing requests.
In order to ensure that a device has this bus number firmly established
before it's likely to send error packets upstream, we should thus do a
dummy configuration write to it as soon as possible after probing.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
[stewart@linux.vnet.ibm.com: fix Evolution broken patch, write vdid rather than &vdid as per Gavin suggestion]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit f46c1e506d199332b0f9741278c8ec35b3e39135)
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
When we mask an interrupt, we may race with another interrupt coming
in from the hardware. If this occurs, the P and/or Q bit may end up
being set but we never EOI/clear them. This could result in a lost
interrupt or the next interrupt that comes in after re-enabling never
being presented.
This patch ensures that when masking an interrupt, any pending P/Q
bits are cleared.
This fixes a bug seen with some CAPI workloads which have lots of
interrupt masking at the same time as high interrupt load. The fix is
not specific to CAPI though.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Tested-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
When we EOI we need to clear the present (P) bit in the Interrupt
Vector Cache (IVC). We must clear P ensuring that any additional
interrupts that come in aren't lost while also maintaining coherency
with the Interrupt Vector Table (IVT).
To do this, the hardware provides a conditional update bit in the
IVC. This bit ensures that generation counts between the IVT and the
IVC updates are synchronised.
Unfortunately we never set this the bit to conditionally update the P
bit in the IVC based on the generation count. Also, we didn't set
what we wanted the new generation count to be if the update was
successful.
This patch fixes sets both of these. It also reworks and documents
the code so that mortals may eventually be able to understand this
process.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Tested-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
Tested-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Problem Description:
During FSP termination/reset, FSP received mbox command from OPAL for
"Fetching platform management function data". As FSP is in termination
state DMAE operation failed to write memory data to hypervisor,
so FSP sent mbox command with response status as 0x24 to OPAL and
OPAL committed a predictive log with SRC BB822411 and sent back
response status as 0xFE, which FSP IPMI will not understand the
failure at the Host and IPMI will log the error.
Fix:This patch is to fix when OPAL receives a bad response from FSP 0x24
due to DMAE error, commit informational log and return response status
as SUCCESS and for all other bad status response commit predictive log.
Signed-off-by: Mamatha Inamdar <mamatha4@linux.vnet.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
The current code sends partial hmi event (4 * 64bits instead of
5 * 64bits) to host. The last 64 bits contains chip id/pir info for
reporting checkstop events. This bug affects only checkstop events.
Host console o/p without this patch:
[ 305.628283] Fatal Hypervisor Maintenance interrupt [Not recovered]
[ 305.628341] Error detail: Malfunction Alert
[ 305.628388] HMER: 8040000000000000
[ 305.628423] CPU PIR: 00000000
[ 305.628458] [Unit: VSU] Logic core check stop
Host console o/p with this patch:
[ 200.122883] Fatal Hypervisor Maintenance interrupt [Not recovered]
[ 200.122941] Error detail: Malfunction Alert
[ 200.122986] HMER: 8040000000000000
[ 200.123021] CPU PIR: 000008e8
[ 200.123055] [Unit: VSU] Logic core check stop
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Commit a5299ba2 dropped non-severe event from logging to BMC, but I forgot
to releaes the error log structure.
Fixes: a5299ba2 (IPMI: Only log events that needs attention)
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|