aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2017-08-18skiboot 5.1.20 release notesskiboot-5.1.20Stewart Smith1-0/+173
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-08-18FSP/CONSOLE: Workaround for unresponsive ipmi daemonVasant Hegde2-1/+21
We use TCE mapped area to write data to console. Console header (fsp_serbuf_hdr) is modified by both FSP and OPAL (OPAL updates next_in pointer in fsp_serbuf_hdr and FSP updates next_out pointer). Kernel makes opal_console_write() OPAL call to write data to console. OPAL write data to TCE mapped area and sends MBOX command to FSP. If our console becomes full and we have data to write to console, we keep on waiting until FSP reads data. In some corner cases, where FSP is active but not responding to console MBOX message (due to buggy IPMI) and we have heavy console write happening from kernel, then eventually our console buffer becomes full. At this point OPAL starts sending OPAL_BUSY_EVENT to kernel. Kernel will keep on retrying. This is creating kernel soft lockups. In some extreme case when every CPU is trying to write to console, user will not be able to ssh and thinks system is hang. If we reset FSP or restart IPMI daemon on FSP, system recovers and everything becomes normal. This patch adds workaround to above issue by returning OPAL_HARDWARE when cosole is full. Side effect of this patch is, we may endup dropping latest console data. But better to drop console data than system hang. Alternative approach is to drop old data from console buffer, make space for new data. But in normal condition only FSP can update 'next_out' pointer and if we touch that pointer, it may introduce some other race conditions. Hence we decided to just new console write request. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Acked-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit c8a7535f3539c79955645e6b3714b367a994b1e9) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-08-18FSP: Set status field in response message for timed out messageVasant Hegde1-1/+4
For timed out FSP messages, we set message status as "fsp_msg_timeout". But most FSP driver users (like surviellance) are ignoring this field. They always look for FSP returned status value in callback function (second byte in word1). So we endup treating timed out message as success response from FSP. Sample output: [69902.432509048,7] SURV: Sending the heartbeat command to FSP [70023.226860117,4] FSP: Response from FSP timed out, word0 = d66a00d7, word1 = 0 state: 3 .... [70023.226901445,7] SURV: Received heartbeat acknowledge from FSP [70023.226903251,3] FSP: fsp_trigger_reset() entry Here SURV code thought it got valid response from FSP. But actually we didn't receive response from FSP. This patch fixes above issue by updating status field in response structure. CC: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit 4cef4d8d6000936b1a4e1065bf69ee2edd3fcc1f) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-08-18FSP: Improve timeout messageVasant Hegde1-4/+5
Presently we print word0 and word1 in error log. word0 contains sequence number and command class. One has to understand word0 format to identify command class. Lets explicitly print command class, sub command etc. CC: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit 807a3acc8fd66af1e1c6e7154aa5029c9b91bb3b) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-08-18FSP/RTC: Remove local fsp_in_reset variableVasant Hegde1-10/+0
Now that we are using fsp_in_rr() to detect FSP reset/reload, fsp_in_reset become redundant. Lets remove this local variable. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit a34369631e6d85c26966eb0b8d5e4c44bcf96c7c) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-08-18FSP/RTC: Fix possible FSP R/R issue in rtc write pathVasant Hegde1-9/+11
fsp_opal_rtc_write() checks FSP status before queueing message to FSP. But if FSP R/R starts before getting response to queued message then we will continue to return OPAL_BUSY_EVENT to host. In some extreme condition host may experience hang. Once FSP is back we will repost message, get response from FSP and return OPAL_SUCCES to host. This patch caches new values and returns OPAL_SUCCESS if FSP R/R is happening. And once FSP is back we will send cached value to FSP. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit f4757fbfcf616365c74b1aa6508b2ab27480cdd0) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-08-18hw/fsp/rtc: read/write cached rtc tod on fsp hir.ppaidipe@linux.vnet.ibm.com1-2/+2
Currently fsp-rtc reads/writes the cached RTC TOD on an fsp reset. Use latest fsp_in_rr() function to properly read the cached rtc value when fsp reset initiated by the hir. Below is the kernel trace when we set hw clock, when hir process starts. [ 1727.775824] NMI watchdog: BUG: soft lockup - CPU#57 stuck for 23s! [hwclock:7688] [ 1727.775856] Modules linked in: vmx_crypto ibmpowernv ipmi_powernv uio_pdrv_genirq ipmi_devintf powernv_op_panel uio ipmi_msghandler powernv_rng leds_powernv ip_tables x_tables autofs4 ses enclosure scsi_transport_sas crc32c_vpmsum lpfc ipr tg3 scsi_transport_fc [ 1727.775883] CPU: 57 PID: 7688 Comm: hwclock Not tainted 4.10.0-14-generic #16-Ubuntu [ 1727.775883] task: c000000fdfdc8400 task.stack: c000000fdfef4000 [ 1727.775884] NIP: c00000000090540c LR: c0000000000846f4 CTR: 000000003006dd70 [ 1727.775885] REGS: c000000fdfef79a0 TRAP: 0901 Not tainted (4.10.0-14-generic) [ 1727.775886] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> [ 1727.775889] CR: 28024442 XER: 20000000 [ 1727.775890] CFAR: c00000000008472c SOFTE: 1 GPR00: 0000000030005128 c000000fdfef7c20 c00000000144c900 fffffffffffffff4 GPR04: 0000000028024442 c00000000090540c 9000000000009033 0000000000000000 GPR08: 0000000000000000 0000000031fc4000 c000000000084710 9000000000001003 GPR12: c0000000000846e8 c00000000fba0100 [ 1727.775897] NIP [c00000000090540c] opal_set_rtc_time+0x4c/0xb0 [ 1727.775899] LR [c0000000000846f4] opal_return+0xc/0x48 [ 1727.775899] Call Trace: [ 1727.775900] [c000000fdfef7c20] [c00000000090540c] opal_set_rtc_time+0x4c/0xb0 (unreliable) [ 1727.775901] [c000000fdfef7c60] [c000000000900828] rtc_set_time+0xb8/0x1b0 [ 1727.775903] [c000000fdfef7ca0] [c000000000902364] rtc_dev_ioctl+0x454/0x630 [ 1727.775904] [c000000fdfef7d40] [c00000000035b1f4] do_vfs_ioctl+0xd4/0x8c0 [ 1727.775906] [c000000fdfef7de0] [c00000000035bab4] SyS_ioctl+0xd4/0xf0 [ 1727.775907] [c000000fdfef7e30] [c00000000000b184] system_call+0x38/0xe0 [ 1727.775908] Instruction dump: [ 1727.775909] f821ffc1 39200000 7c832378 91210028 38a10020 39200000 38810028 f9210020 [ 1727.775911] 4bfffe6d e8810020 80610028 4b77f61d <60000000> 7c7f1b78 3860000a 2fbffff4 This is found when executing the testcase https://github.com/open-power/op-test-framework/blob/master/testcases/fspresetReload.py With this fix ran fsp hir torture testcase in the above test which is working fine. Signed-off-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit 447ccc4de529f001271fd4dfd78401bc4c90832e) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-08-18FSP/CHIPTOD: Return false in error pathVasant Hegde1-0/+2
CC: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit 740d00b1036188c6e248418fb0a13faf14723e7a) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-08-18FSP: Notify FSP of Platform Log ID after Host Initiated Reset ReloadStewart Smith6-34/+84
Trigging a Host Initiated Reset (when the host detects the FSP has gone out to lunch and should be rebooted), would cause "Unknown Command" messages to appear in the OPAL log. This patch implements those messages How to trigger FSP RR(HIR): $ putmemproc 300000f8 0x00000000deadbeef s1 k0:n0:s0:p00 ecmd_ppc putmemproc 300000f8 0x00000000deadbeef Log showing unknown command: / # cat /sys/firmware/opal/msglog | grep -i ,3 [ 110.232114723,3] FSP: fsp_trigger_reset() entry [ 188.431793837,3] FSP #0: Link down, starting R&R [ 464.109239162,3] FSP #0: Got XUP with no pending message ! [ 466.340598554,3] FSP-DPO: Unknown command 0xce0900 [ 466.340600126,3] FSP: Unhandled message ce0900 The message we need to handle is "Get PLID after host initiated FipS reset/reload". When the FSP comes back from HIR, it asks "hey, so, which error log explains why you rebooted me?". So, we tell it. Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit f3a5741408a11be6992cf8779f2eae10b08c020a) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-08-18hw/i2c: Fix early lock dropOliver O'Halloran1-2/+0
When interacting with an I2C master the p8-i2c driver (common to p9) aquires a per-master lock which it holds for the duration of it's interaction with the master. Unfortunately, when p8_i2c_check_initial_status() detects that the master is busy with another transaction it drops the lock and returns OPAL_BUSY. This is contrary to the driver's locking strategy which requires that the caller aquire and drop the lock. This leads to a crash due to the double unlock(), which skiboot treats as fatal. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit bb192fd55ffb20d619101c5e3e1f4fd24f844d11) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-08-18head.S: store LR rather than CTR when trying to store LRStewart Smith1-1/+1
Long existing typo of r5 rather than r6, meaning we were storing CTR instead of LR. Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit d55194c5d9ada77eee2c9a69814708304f34d334) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-08-18head.S: store all of LR and CTROliver O'Halloran1-2/+2
When saving the CTR and LR registers the skiboot exception handlers use the 'stw' instruction which only saves the lower 32 bits of the register. Given these are both 64 bit registers this leads to some strange register dumps, for example: *********************************************** Unexpected exception 200 ! SRR0 : 0000000030016968 SRR1 : 9000000000201000 HSRR0: 0000000000000180 HSRR1: 9000000000001000 LR : 3003438830823f50 CTR : 3003438800000018 CFAR : 00000000300168fc CR : 40004208 XER: 00000000 In this dump the upper 32 bits of LR and CTR are actually stack gunk which obscures the underlying issue. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Acked-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit 70bc370883330c8b1076555c126647a3cdf88706) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-08-18hw/fsp: Do not queue SP and SPCN class messages during reset/reloadAnanth N Mavinakayanahalli4-1/+33
During FSP R/R, the FSP is inaccessible and will lose state. Messages to the FSP are generally queued for sending later. It does seem like the FSP fails to process any subseuqent messages of certain classes (SP info -- ipmi) if it receives queued mbox messages it isn't expecting. In certain other cases (sensors), the FSP driver returns a default code (async completion) even though there is no known bound from the time of this error return to the actual data being available. The kernel driver keeps waiting leading to soft-lockup on the host side. Mitigate both these (known) cases by returning OPAL_BUSY so the host driver knows to retry later. With this change, the sensors command works fine when the FSP comes back. This version also resolves the remaining IPMI issues Signed-off-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Tested-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit 4940b8148640c06e139aec8c6d0370af7dd3b184) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit a6d5bc107e76123440d60a05698c151084604180) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-01-16skiboot 5.1.19 release notesskiboot-5.1.19Stewart Smith1-0/+46
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-12-22Makefile: Disable stack protector due to gcc problemsBenjamin Herrenschmidt1-2/+8
Depending on how it was built, gcc will use the canary from a global (works for us) or from the TLS (doesn't work for us and accesses random stuff instead). Fixing that would be tricky. There are talks of adding a gcc option to force use of globals, but in the meantime, disable the stack protector Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [stewart@linux.vnet.ibm.com: add -fno-stack-protector] Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit fe6f1f982b562ba855bb68fb51545f104078f546) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-12-22stack: Don't recurse into __stack_chk_failBenjamin Herrenschmidt1-2/+7
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit d7ffce9096d5a23ee4ff309910983d823e953bd2) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-12-22Makefile: Use -ffixed-r13Benjamin Herrenschmidt1-0/+1
We use r13 for our own stuff, make sure it's properly fixed Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit d45b9bc4f98dfeac3ce6ee906948b56944f6aa6b) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-12-22phb3: Lock the PHB on set_xive callbacksBenjamin Herrenschmidt1-0/+8
Those are called by the interrupts core and thus skip the locking implicit in the PCI opal calls. However IODA table access can be racy, so make sure we lock the PHB. Signed-off-by: Benjamin Herrenschmidt Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit 55af871041a4b09e53013671450980bdb36f91e3) [stewart@linux.vnet.ibm.com: backport to phb3_lock()] Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-12-21hw/{phb3, p7ioc}: Return success for freset on empty PHBGavin Shan2-2/+2
OPAL_CLOSED is returned when fundamental reset is issued on the PHB who doesn't have subordinate devices (root port excluded). The kernel raises an error message, which is unnecessary. This returns OPAL_SUCCESS for this case to avoid the error message. The code change included in this has been upstream because of below commits since skiboot-5.3.0. So this is only applied to stable releases equal or ealier than skiboot-5.2.5. commit 3d3303734d1 ("hw/p7ioc: Support PHB slot) commit e1922cba179 ("hw/phb3: Support PHB slot") Cc: stable # 5.2.5- Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-12-13hw/phb3: fix error handling in complete resetAndrew Donnellan1-3/+2
During a complete reset, when we get a timeout waiting for pending transaction in state PHB3_STATE_CRESET_WAIT_CQ, we mark the PHB as broken and return OPAL_PARAMETER. Change the return code to OPAL_HARDWARE which is way more sensible, and set the state to PHB3_STATE_FENCED so that the kernel can retry the complete reset. Reported-by: Pradipta Ghosh <pradghos@in.ibm.com> Suggested-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-12-02hw/phb3: set PHB retry state correctly when fresetting during a cresetAndrew Donnellan1-0/+1
When we're doing a complete reset, after we complete the ETU reset and wait for the PHB to return, we need to do a fundamental reset. When we do the fundamental reset, we poll for a link up. This isn't always successful on the first attempt. In phb3_sm_link_poll(), if we time out while waiting for the link to come up, we call phb3_retry_state() to reset p->state back to p->retry_state and poll again. On the second poll, we clear the retry state so we don't retry again. However, when we do the fundamental reset as part of a complete reset, we don't explicitly set the retry state. This means that we only retry if there wasn't an earlier fundamental reset that had to retry and thus cleared the retry state. This reduces the reliability of the complete reset process. In phb3_sm_complete_reset(), when in state PHB3_STATE_CRESET_FRESET, set the retry state to PHB3_STATE_FRESET_START, as is done in phb3_fundamental_reset(). Reported-by: Pradipta Ghosh <pradghos@in.ibm.com> Suggested-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Cc: Uma Krishnan <ukrishn@linux.vnet.ibm.com> Cc: Matthew Ochs <mrochs@linux.vnet.ibm.com> Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-11-24Limit number of "Poller recursion detected" errors to displayStewart Smith1-1/+5
In some error conditions, we could spiral out of control on this and spend all of our time printing the exact same backtrace. Limit it to 16 times, because 16 is a nice number. Cc: stable Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit b6a729e118f42dae88ebf70a09a7e2aa4f788fdc) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-11-24fsp: Don't recurse pollers in ibm_fsp_terminateStewart Smith3-1/+45
If we were to terminate in a poller, we'd call op_display() which called pollers which hit the recursive poller warning, which ended in not much fun at all. This patch will skip the running of pollers and instead run the FSP poller to set the op-panel display before attn. Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit 9fcb109218b1374a8caa3cac62e83fbedb1f7f2f) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-08-26Add skiboot-5.1.18 release notesskiboot-5.1.18Stewart Smith1-0/+31
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-08-25opal/hmi: Fix a TOD HMI failure during a race condition.Mahesh Salgaonkar1-0/+7
There are chances where another interrupt can wake a CPU in 0x100 vector just when HMI for TOD error is also pending. In such a rare race condition if CPU has woken up with tb_loss power saving mode, it will invoke opal call to resync the TB. Since TOD is already in error state, resync TB will timeout leaving TFMR bit 18 set to '1'. (TFMR[18]=1 means TB is prepared to receive new value from TOD. Once the new value is received this bit gets reset to '0', otherwise TB would stay in waiting state). When HMI is delivered, it may find all TFMR errors are already cleared but would fail to restore TB since TFMR bit 18 is already set. This leads to HMI recovery failure causing a kernel crash. This patch fixes this by clearing of TB errors if TFMR[18] is set to 1. This makes sure that TB is in clean state before TB restore process starts. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit 026b9a13bf8d61a7e72721d59961b40cbc98b410) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-08-24hw/phb3: Update capi initialization sequenceFrederic Barrat1-1/+2
The capi initialization sequence was revised in a circumvention document when a 'link down' error was converted from fatal to Endpoint Recoverable. Other, non-capi, register setup was corrected even before the initial open-source release of skiboot, but a few capi-related registers were not updated then, so this patch fixes it. The point is that a link-down error detected by the UTL logic will lead to an AIB fence, so that the CAPP unit can detect the error. Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit e36f4f219b642c6c5032208fca7191fbd75fe1a3) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-08-02log_level: Reduce the in memory console log_level to lower priorityPridhiviraj Paidipeddi3-3/+3
Below are the in-memory console log messages observed with error level(PR_ERROR) [54460318,3] HBRT: Mem region 'ibm,homer-image' not found ! [54465404,3] HBRT: Mem region 'ibm,homer-image' not found ! [54470372,3] HBRT: Mem region 'ibm,homer-image' not found ! [54475369,3] HBRT: Mem region 'ibm,homer-image' not found ! [11540917382,3] NVRAM: Layout appears sane [11694529822,3] OPAL: Trying a CPU re-init with flags: 0x2 [61291003267,3] OPAL: Trying a CPU re-init with flags: 0x1 [61394005956,3] OPAL: Trying a CPU re-init with flags: 0x2 Lowering the log level of mem region not found messages to PR_WARNING and remaining messages to PR_INFO level [54811683,4] HBRT: Mem region 'ibm,homer-image' not found ! [10923382751,6] NVRAM: Layout appears sane [55533988976,6] OPAL: Trying a CPU re-init with flags: 0x1 Signed-off-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit 341daa8104af3231b908e6fcffeedb5e47b33990) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-07-28core/flash: Fix passing pointer instead of valueCyril Bur1-1/+1
flash_find_subpartition() accepts a pointer to a boolean variable indicating ecc for a region of flash and passes the pointer directly to flash_read_corrected() which actually only wants the value. This has always worked probably because there has always been ECC on sub partitions. How there aren't any warnings triggered by this condition escapes me. Fixes: 6c26bc7 ("libflash: move ffs_flash_read into libflash") Signed-off-by: Cyril Bur <cyril.bur@au1.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit 81a538a678edf666568ca4adffe074b3dbce6dc3) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-07-21FSP/ELOG: Fix OPAL generated elog resend logicVasant Hegde1-5/+0
Fix resend logic in opal_resend_pending_logs, so that it actually restarts sending remaining logs. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit a6d4a7884e95cb9c918b8a217c11e46b01218358) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-07-21FSP/ELOG: Fix possible event notifier hangsVasant Hegde2-3/+18
In some corner cases host may send acknowledgement without reading actual data (fsp_opal_elog_info -> fsp_opal_elog_ack). Because of this elog_read_from_fsp_head_state may be stuck in wrong state (ELOG_STATE_HOST_INFO) and not able to send remaining ELOG's to host. Hence reset ELOG state and start sending remaining ELOG's. Also in normal case we will ACK the logs which are already processed (elog_read_processed). Hence rearrange the code such that we go through elog_read_processed first. Finally return OPAL_PARAMETER if we are not able to find ELOG ID. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> [stewart@linux.vnet.ibm.com: spelling fix] Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit e7c8cba4ad773055f390632c2996d3242b633bf4) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-07-21FSP/ELOG: Disable event notification if list is not consistentVasant Hegde1-0/+2
Chances of elog_read_pending inconsistent state is very very less. Just to be on safer side, disable notification if list is not in consistent state. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Reviewed-by: Mukesh Ojha <mukesh02@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit 1fb10de164d3ca034193df81c1f5d007aec37781) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-07-21FSP/ELOG: Improve elog event statesVasant Hegde2-1/+3
ELOG enables event notification once new log is available. And this will be disabled after host completes reading logs (it has to complete both fsp_opal_elog_info and fsp_opal_elog_read). Ideally we should disable notification as soon as host consumes event (after fsp_opal_elog_info). Also if host fails to call fsp_opal_elog_read (ex: situations like duplicate event), then we endup keeping notification forever. This patch introduces new ELOG state (ELOG_STATE_HOST_INFO). As soon as host consumes event elog will move to this new state so that event notification is disabled. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit cec5750a4a86ff3f69e1d8817eda023f4d40c492) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-07-21FSP/ELOG: Fix OPAL generated elog event notificationVasant Hegde3-18/+34
We use elog notifier to notify logs from multiple sources (FSP generated logs - fsp-elog-read.c and OPAL generated logs - fsp-elog-write.c). OPAL generated logs sets elog event bit whenever it has new logs to send to host. But it relies on fsp-elog-read.c to disable the event bit..which is wrong! This patch creates common function to enable/disable event notification. It will enable event notification if any of the source is ready to send error log to host and disables notification once it completes sending all errors to host. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit ec366ad4e2e871096fa4c614ad7e89f5bb6f884f) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-07-21FSP/ELOG: Disable event notification during kexecVasant Hegde1-13/+33
ELOG enables event notification once new log is available. And this will be disabled after host completes reading logs (it has to complete both fsp_opal_elog_info and fsp_opal_elog_read). In some corner cases like kexec, host may endup reading same ELOG id twice (calling fsp_opal_elog_info twice because of resend request). Host finds it as duplicate and it will not read actual log (fsp_opal_elog_read()). In such situations we fails to disable event notification :-( Scenario : OPAL Host ------------------------------------- OPAL_EVENT_ELOG_AVAIL --> kexec OPAL_EVENT_ELOG_AVAIL --> elog client registered <-- read ELOG (id=x) <-- resend elog (opal_resend_pending_logs()) resend all ELOG --> read ELOG (id=x) -- Duplicate ELOG ! bhoom!! kernel call trace: ------------------ [ 28.055923] CPU: 10 PID: 20 Comm: irq/29-opal-elo Not tainted 4.4.0-24-generic #43-Ubuntu [ 28.056012] task: c0000000ef982a20 ti: c0000000efa38000 task.ti: c0000000efa38000 [ 28.056100] NIP: c000000008010a24 LR: c000000008010a24 CTR: 0000000030033758 [ 28.056188] REGS: c0000000efa3b9c0 TRAP: 0901 Not tainted (4.4.0-24-generic) [ 28.056274] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 22000844 XER: 20000000 [ 28.056499] CFAR: c000000008009958 SOFTE: 1 GPR00: c000000008131e8c c0000000efa3bc40 c0000000095b4200 0000000000000900 GPR04: c0000000094a63c8 0000000000000001 9000000100009033 0000000000000062 GPR08: 0000000000000000 0000000000000000 c0000000ef960400 9000000100001003 GPR12: c00000000806de48 c00000000fb45f00 [ 28.057042] NIP [c000000008010a24] arch_local_irq_restore+0x74/0x90 [ 28.057117] LR [c000000008010a24] arch_local_irq_restore+0x74/0x90 [ 28.057189] Call Trace: [ 28.057221] [c0000000efa3bc40] [c0000000f108a980] 0xc0000000f108a980 (unreliable) [ 28.057326] [c0000000efa3bc60] [c000000008131e8c] irq_finalize_oneshot.part.2+0xbc/0x250 [ 28.057429] [c0000000efa3bcb0] [c000000008132170] irq_thread_fn+0x80/0xa0 [ 28.057519] [c0000000efa3bcf0] [c00000000813263c] irq_thread+0x1ac/0x280 [ 28.057609] [c0000000efa3bd80] [c0000000080e61e0] kthread+0x110/0x130 [ 28.057698] [c0000000efa3be30] [c000000008009538] ret_from_kernel_thread+0x5c/0xa4 [ 28.057799] Instruction dump: [ 28.057844] 994d02ca 2fa30000 409e0024 e92d0020 61298000 7d210164 38210020 e8010010 [ 28.057995] 7c0803a6 4e800020 60420000 4bff17ad <60000000> 4bffffe4 60420000 e92d0020 This patch adds kexec notifier client. It will disable event notification during kexec. Once host is ready to receive ELOG's again it will call fsp_opal_resend_pending_logs(). This call re-enables ELOG notication. It will fix above issue. I will add follow up patch to improve event state. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit d2ae07fd97bb9408456279cec799f72cb78680a6) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-07-21Add skiboot-5.1.17 release notesskiboot-5.1.17Stewart Smith1-0/+19
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-07-21hw/xscom: Reset XSCOM engine after querying sleeping core FIRVipin K Parashar1-2/+2
XSCOM engine blocks subsequently after querying FIR of any sleeping core. This causes subsequent XSCOM opertions to hang forever due to XSCOM engine being continuously busy. Reset XSCOM engine after querying FIR of any sleeping core. Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit 15cec493804ff14e6246eb1b65e9d0c7cb469a81) Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-07-21hw/xscom: Reset XSCOM engine after finite number of retries when busyVipin K Parashar3-17/+64
OPAL retries XSCOM read/write operations forever till it succeeds. This can cause XSCOM ops to hang forever when XSCOM engine remains busy for some reason. Changed it to retry XSCOM operations only XSCOM_BUSY_MAX_RETRIES number of times instead of retrying forever. Also added logic to reset XSCOM engine after XSCOM_BUSY_RESET_THRESHOLD number of retries to unblock it when it remains busy. Cc: stable # 9c2d82394fd2 ("xscom: Return OPAL_WRONG_STATE on XSCOM ops..") Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit e761222593a1ae932cddbc81239b6a7cd98ddb70) Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-07-21xscom: Return OPAL_WRONG_STATE on XSCOM ops if CPU is asleepRussell Currey2-11/+23
xscom_read and xscom_write return OPAL_SUCCESS if they worked, and OPAL_HARDWARE if they didn't. This doesn't provide information about why the operation failed, such as if the CPU happens to be asleep. This is specifically useful in error scanning, so if every CPU is being scanned for errors, sleeping CPUs likely aren't the cause of failures. So, return OPAL_WRONG_STATE in xscom_read and xscom_write if the CPU is sleeping. Signed-off-by: Russell Currey <ruscur@russell.cc> Reviewed-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit 9c2d82394fd2303847cac4a665dee62556ca528a) Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-06-20Fix typosStewart Smith8-12/+12
Backport of user visible typo fixes partial cherry picked from 4c95b5e04e3c4f72e4005574f67cd6e365d3276f Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-06-09Fixup whitespace and build warning/errorStewart Smith1-4/+4
Fixes: f46c1e506d199332b0f9741278c8ec35b3e39135 Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit 348dacfaca9f139db2603f5c2e78d87e21938ca6)
2016-06-09pci: Do a dummy config write to devices to establish bus numberBenjamin Herrenschmidt1-0/+7
On PCI Express, devices need to know their own bus number in order to provide the correct source identification (aka RID) in upstream packets they might send, such as error messages or DMAs. However while devices know (and hard wire) their own device and function number, they know nothing about bus numbers by default, those are decoded by bridges for routing. All they know is that if their parent bridge sends a "type 0" configuration access, they should decode it provided the device and function numbers match. The PCIe spec thus defines that when a device receive such a configuration access and it's a write, it should "capture" the bus number in the source field of the packet, and re-use as the originator bus number of all subsequent outgoing requests. In order to ensure that a device has this bus number firmly established before it's likely to send error packets upstream, we should thus do a dummy configuration write to it as soon as possible after probing. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> [stewart@linux.vnet.ibm.com: fix Evolution broken patch, write vdid rather than &vdid as per Gavin suggestion] Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit f46c1e506d199332b0f9741278c8ec35b3e39135)
2016-04-29Add skiboot-5.1.16 release notesskiboot-5.1.16Stewart Smith1-0/+52
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-04-27hw/phb3: Ensure PQ bits are cleared in the IVC when masking IRQMichael Neuling1-6/+26
When we mask an interrupt, we may race with another interrupt coming in from the hardware. If this occurs, the P and/or Q bit may end up being set but we never EOI/clear them. This could result in a lost interrupt or the next interrupt that comes in after re-enabling never being presented. This patch ensures that when masking an interrupt, any pending P/Q bits are cleared. This fixes a bug seen with some CAPI workloads which have lots of interrupt masking at the same time as high interrupt load. The fix is not specific to CAPI though. Signed-off-by: Michael Neuling <mikey@neuling.org> Tested-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-04-27hw/phb3: Fix potential race in EOIMichael Neuling1-5/+29
When we EOI we need to clear the present (P) bit in the Interrupt Vector Cache (IVC). We must clear P ensuring that any additional interrupts that come in aren't lost while also maintaining coherency with the Interrupt Vector Table (IVT). To do this, the hardware provides a conditional update bit in the IVC. This bit ensures that generation counts between the IVT and the IVC updates are synchronised. Unfortunately we never set this the bit to conditionally update the P bit in the IVC based on the generation count. Also, we didn't set what we wanted the new generation count to be if the update was successful. This patch fixes sets both of these. It also reworks and documents the code so that mortals may eventually be able to understand this process. Signed-off-by: Michael Neuling <mikey@neuling.org> Tested-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com> Tested-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-04-06OPAL:Handle mbox response with bad status:0x24 during FSP terminationMamatha2-3/+15
Problem Description: During FSP termination/reset, FSP received mbox command from OPAL for "Fetching platform management function data". As FSP is in termination state DMAE operation failed to write memory data to hypervisor, so FSP sent mbox command with response status as 0x24 to OPAL and OPAL committed a predictive log with SRC BB822411 and sent back response status as 0xFE, which FSP IPMI will not understand the failure at the Host and IPMI will log the error. Fix:This patch is to fix when OPAL receives a bad response from FSP 0x24 due to DMAE error, commit informational log and return response status as SUCCESS and for all other bad status response commit predictive log. Signed-off-by: Mamatha Inamdar <mamatha4@linux.vnet.ibm.com> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-04-01hmi: Fix a bug where partial hmi event was reported to host.Mahesh Salgaonkar1-7/+8
The current code sends partial hmi event (4 * 64bits instead of 5 * 64bits) to host. The last 64 bits contains chip id/pir info for reporting checkstop events. This bug affects only checkstop events. Host console o/p without this patch: [ 305.628283] Fatal Hypervisor Maintenance interrupt [Not recovered] [ 305.628341] Error detail: Malfunction Alert [ 305.628388] HMER: 8040000000000000 [ 305.628423] CPU PIR: 00000000 [ 305.628458] [Unit: VSU] Logic core check stop Host console o/p with this patch: [ 200.122883] Fatal Hypervisor Maintenance interrupt [Not recovered] [ 200.122941] Error detail: Malfunction Alert [ 200.122986] HMER: 8040000000000000 [ 200.123021] CPU PIR: 000008e8 [ 200.123055] [Unit: VSU] Logic core check stop Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-03-16Add skiboot-5.1.15 release notesskiboot-5.1.15Stewart Smith1-0/+9
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-03-11ipmi-sel: Fix memory leak in error pathVasant Hegde1-0/+1
Commit a5299ba2 dropped non-severe event from logging to BMC, but I forgot to releaes the error log structure. Fixes: a5299ba2 (IPMI: Only log events that needs attention) Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-03-09add skiboot-5.1.14 release notesskiboot-5.1.14Stewart Smith1-0/+20
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-02-23fsp: fix spelling of "advertise" in log messageAndrew Donnellan1-1/+1
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>