aboutsummaryrefslogtreecommitdiff
path: root/core/hmi.c
AgeCommit message (Collapse)AuthorFilesLines
2018-10-16opal/hmi: Wakeup the cpu before reading core_firVaibhav Jain1-6/+15
When stop state 5 is enabled, reading the core_fir during an HMI can result in a xscom read error with xscom_read() returning an OPAL_XSCOM_PARTIAL_GOOD error code and core_fir value of all FFs. At present this return error code is not handled in decode_core_fir() hence the invalid core_fir value is sent to the kernel where it interprets it as a FATAL hmi causing a system check-stop. This can be prevented by forcing the core to wake-up using before reading the core_fir. Hence this patch wraps the call to read_core_fir() within calls to dctl_set_special_wakeup() and dctl_clear_special_wakeup(). Suggested-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> Acked-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Reviewed-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-09-27opal/hmi: Handle early HMIs on thread0 when secondaries are still in OPAL.Mahesh Salgaonkar1-0/+49
When primary thread receives a CORE level HMI for timer facility errors while secondaries are still in OPAL, thread 0 ends up in rendez-vous waiting for secondaries to get into hmi handling. This is because OPAL runs with MSR(EE=0) and hence HMIs are delayed on secondary threads until they are given to Linux OS. Fix this by adding a check for secondary state and force them in hmi handling by queuing job on secondary threads. I have tested this by injecting HDEC parity error very early during Linux kernel boot. Recovery works fine for non-TB errors. But if TB is bad at this very eary stage we already doomed. Without this patch we see: [ 285.046347408,7] OPAL: Start CPU 0x0843 (PIR 0x0843) -> 0x000000000000a83c [ 285.051160609,7] OPAL: Start CPU 0x0844 (PIR 0x0844) -> 0x000000000000a83c [ 285.055359021,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 285.055361439,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:0: TFMR(2e12002870e14000) Timer Facility Error [ 286.232183823,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc1) [ 287.409002056,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc1) [ 289.073820164,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc1) [ 290.250638683,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc2) [ 291.427456821,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc2) [ 293.092274807,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc2) [ 294.269092904,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc3) [ 295.445910944,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc3) [ 297.110728970,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc3) After this patch: [ 259.401719351,7] OPAL: Start CPU 0x0841 (PIR 0x0841) -> 0x000000000000a83c [ 259.406259572,7] OPAL: Start CPU 0x0842 (PIR 0x0842) -> 0x000000000000a83c [ 259.410615534,7] OPAL: Start CPU 0x0843 (PIR 0x0843) -> 0x000000000000a83c [ 259.415444519,7] OPAL: Start CPU 0x0844 (PIR 0x0844) -> 0x000000000000a83c [ 259.419641401,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 259.419644124,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:0: TFMR(2e12002870e04000) Timer Facility Error [ 259.419650678,7] HMI: Sending hmi job to thread 1 [ 259.419652744,7] HMI: Sending hmi job to thread 2 [ 259.419653051,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 259.419654725,7] HMI: Sending hmi job to thread 3 [ 259.419654916,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 259.419658025,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 259.419658406,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:2: TFMR(2e12002870e04000) Timer Facility Error [ 259.419663095,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:3: TFMR(2e12002870e04000) Timer Facility Error [ 259.419655234,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:1: TFMR(2e12002870e04000) Timer Facility Error [ 259.425109779,7] OPAL: Start CPU 0x0845 (PIR 0x0845) -> 0x000000000000a83c [ 259.429870681,7] OPAL: Start CPU 0x0846 (PIR 0x0846) -> 0x000000000000a83c [ 259.434549250,7] OPAL: Start CPU 0x0847 (PIR 0x0847) -> 0x000000000000a83c Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-09-17opal/hmi: Ignore debug trigger inject core FIR.Mahesh Salgaonkar1-1/+0
Core FIR[60] is a side effect of the work around for the CI Vector Load issue in DD2.1. Usually this gets delivered as HMI with HMER[17] where Linux already ignores it. But it looks like in some cases we may happen to see CORE_FIR[60] while we are already in Malfunction Alert HMI (HMER[0]) due to other reasons e.g. CAPI recovery or NPU xstop. If that happens then just ignore it instead of crashing kernel as not recoverable. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-08-13opal/hmi: Catch NPU2 HMIs for opencapiFrederic Barrat1-5/+10
HMIs for NPU2 are filtered with the 'compatible' string of the PHB, so add opencapi to the mix. Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-06-27hw/npu2, core/hmi: Use NPU instead of NPU2 as log message prefixAndrew Donnellan1-3/+3
The NPU2{DBG,INF,ERR} macros use "NPU%d" as a prefix to identify messages relating to a particular NPU. It's slightly confusing to have per-NPU messages prefixed with "NPU0" or "NPU1" and NPU-generic messages prefixed with "NPU2". On some future system we could potentially have a NPU #2 in which case it'd be really confusing. Use NPU rather than NPU2 for NPU-generic log messages. There's no risk of confusion with the original npu.c code since that's only for P8. Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Acked-by: Reza Arbab <arbab@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-06-04opal/hmi: Display correct chip id while printing NPU FIRs.Mahesh Salgaonkar1-4/+4
HMIs for NPU xstops are broadcasted to all chips. All cores on all the chips receive HMI. HMI handler correctly identifies and extracts the NPU FIR details from affected chip, but while printing FIR data it prints chip id and location code details of this_cpu()->chip_id which may not be correct. This patch fixes this issue. CC: stable # v6.0+ Fixes: 7bcbc78c ("Add location code to NPU2 HMI logging") Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> [stewart: add fixes and cc stable] Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-05-15Add location code to NPU2 HMI loggingBalbir Singh1-8/+14
The current HMI error message does not specifiy where the HMI error occured. The original error message was NPU: FIR#0 FIR 0x0080100000000000 mask 0x009a48180f01ffff The enhanced error message is NPU2: [Loc: UOPWR.0000000-Node0-Proc0] P:0 FIR#0 FIR 0x0000100000000000 mask 0x009a48180f03ffff Signed-off-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-05-09hmi: Fix clearing HMER on debug triggerMichael Neuling1-0/+1
In the recent patch: eddff9bf40 hmi: Clear unknown debug trigger I rebased the code from an older skiboot before the HMI rework. When I did this, I missed the handled flag. Without this the HMER is not cleared properly and the HMI keeps happening. This properly sets the handled flag and hence clears the HMER bit. Signed-off-by: Michael Neuling <mikey@neuling.org> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-05-04hmi: Clear unknown debug triggerRyan Grimm1-0/+10
On some systems, seeing hangs like this when Linux starts: [ 170.027252763,5] OCC: All Chip Rdy after 0 ms [ 170.062930145,5] INIT: Starting kernel at 0x20011000, fdt at 0x30ae0530 366247 bytes) [ 171.238270428,5] OPAL: Switch to little-endian OS If you look at the in memory skiboot console (or do 'nvram -p ibm,skiboot --update-config log-level-driver=7') we see the console get spammed with: [ 5209.109790675,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000 [ 5209.109792716,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000 [ 5209.109794695,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000 [ 5209.109796689,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000 We're taking the debug trigger (bit 17) early on, before the hmi_debug_trigger function in the kernel is set up. This clears the HMI in Skiboot and reports to the kernel instead of bringing down the machine. Signed-off-by: Ryan Grimm <grimm@linux.vnet.ibm.com> Signed-off-by: Michael Neuling <mikey@neuling.org> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-05-04core/hmi: assign flags=0 in case nothing set by handle_hmi_exceptionStewart Smith1-1/+1
Practically speaking, I don't think you'd *currently* hit this. Found with Clang's scan-build. Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Acked-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-24opal/hmi: Generate one event per core for processor recovery.Mahesh Salgaonkar1-3/+3
Processor recovery is per core error. All threads on that core receive HMI. All threads don't need to generate HMI event for same error. Let thread 0 only generate the event. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-24opal:hmi: Add missing processor recovery reason string.Mahesh Salgaonkar1-0/+1
With this patch now we see reason string printed for CORE_WOF[43] bit. [ 477.352234986,7] HMI: [Loc: U78D3.001.WZS004A-P1-C48]: P:8 C:22 T:3: Processor recovery occurred. [ 477.352240742,7] HMI: Core WOF = 0x0000000000100000 recovered error: [ 477.352242181,7] HMI: PC - Thread hang recovery Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Generate hmi event for recovered HDEC parity error.Mahesh Salgaonkar1-3/+2
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: check thread 0 tfmr to validate latched tfmr errors.Mahesh Salgaonkar1-19/+42
Due to P9 errata, HDEC parity and TB residue errors are latched for non-zero threads 1-3 even if they are cleared. But these are not latched on thread 0. Hence, use xscom SCOMC/SCOMD to read thread 0 tfmr value and ignore them on non-zero threads if they are not present on thread 0. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Print additional debug information in rendezvous.Mahesh Salgaonkar1-2/+4
Helps in debugging... Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Fix handling of TFMR parity/corrupt error.Mahesh Salgaonkar1-5/+4
While testing TFMR parity/corrupt error it has been observed that HMIs are delivered twice for this error - First time HMI is delivered with HMER[4,5]=1 and TFMR[60]=1. - Second time HMI is delivered with HMER[4,5]=1 and TFMR[60]=0 with valid TB. On second HMI we end up throwing below error message even though TB is in valid state. "HMI: TB invalid without core error reported" This patch fixes this issue by ignoring HMER[5] and checking only for TFMR[60] before setting this_cpu()->tb_invalid to true. Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Fix soft lockups during TOD errorsMahesh Salgaonkar1-1/+15
There are some TOD errors which do not affect working of TOD and TB. They stay in valid state. Hence we don't need rendez vous for TOD errors that does not affect TB working. TOD errors that affects TOD/TB will report a global error on TFMR[44] alongwith bit 51, and they will go in rendez vous path as expected. But the TOD errors that does not affect TB register sets only TFMR bit 51. The TFMR bit 51 is cleared when any single thread clears the TOD error. Once cleared, the bit 51 is reflected to all the cores on that chip. Any thread that reads the TFMR register after the error is cleared will see TFMR bit 51 reset. Hence the threads that see TFMR[51]=1, falls through rendez-vous path and threads that see TFMR[51]=0, returns doing nothing. This ends up in a soft lockups in host kernel. This patch fixes this issue by not considering TOD interrupt (TFMR[51]) as a core-global error and hence avoiding rendez-vous path completely. Instead threads that see TFMR[51]=1 will now take different path that just do the TOD error recovery. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Do not send HMI event if no errors are found.Mahesh Salgaonkar1-8/+13
For TOD errors, all the cores in the chip get HMIs. Any one thread from any core can fix the issue and TFMR will have error conditions cleared. Rest of the threads need take any action if TOD errors are already cleared. Hence thread 0 of every core should get a fresh copy of TFMR before going ahead recovery path. Initialize recover = -1, so that if no errors found that thread need not send a HMI event to linux. This helps in stop flooding host with hmi event by every thread even there are no errors found. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Initialize the hmi event with old value of HMER.Mahesh Salgaonkar1-3/+6
Do this before we check for TFAC errors. Otherwise the event at host console shows no error reported in HMER register. Without this patch the console event show HMER with all zeros [ 216.753417] Severe Hypervisor Maintenance interrupt [Recovered] [ 216.753498] Error detail: Timer facility experienced an error [ 216.753509] HMER: 0000000000000000 [ 216.753518] TFMR: 3c12000870e04000 After this patch it shows old HMER values on host console: [ 2237.652533] Severe Hypervisor Maintenance interrupt [Recovered] [ 2237.652651] Error detail: Timer facility experienced an error [ 2237.652766] HMER: 0840000000000000 [ 2237.652837] TFMR: 3c12000870e04000 Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Rework HMI handling of TFAC errorsBenjamin Herrenschmidt1-292/+231
This patch reworks the HMI handling for TFAC errors by introducing 4 rendez-vous points improve the thread synchronization while handling timebase errors that requires all thread to clear dirty data from TB/HDEC register before clearing the errors. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Don't bother passing HMER to pre-recovery cleanupBenjamin Herrenschmidt1-14/+6
The test for TFAC error is now redundant so we remove it and remove the HMER argument. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Move timer related error handling to a separate functionBenjamin Herrenschmidt1-48/+58
Currently no functional change. This is a first step to completely rewriting how these things are handled. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Add a new opal_handle_hmi2 that returns direct info to LinuxBenjamin Herrenschmidt1-45/+82
It returns a 64-bit flags mask currently set to provide info about which timer facilities were lost, and whether an event was generated. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Remove races in clearing HMERBenjamin Herrenschmidt1-10/+12
Writing to HMER acts as an "AND". The current code writes back the value we originally read with the bits we handled cleared. This is racy, if a new bit gets set in HW after the original read, we'll end up clearing it without handling it. Instead, use an all 1's mask with only the bit handled cleared. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Don't re-read HMER multiple timesBenjamin Herrenschmidt1-21/+14
We want to make sure all reporting and actions are based upon the same snapshot of HMER in case bits get added by HW while we are in OPAL. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-03-27NPU2: dump NPU2 registers on npu2 HMIStewart Smith1-2/+73
Due to the nature of debugging npu2 issues, folk are wanting the full list of NPU2 registers dumped when there's a problem. We have to list out each register as traversing the range triggers FIR bits that confuse PRD. Suggested-by: Ryan Black <rblack@us.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2018-03-27Revert "NPU2 HMIs: dump out a *LOT* of npu2 registers for debugging"Stewart Smith1-37/+1
This reverts commit fbdc91e693fc3103f7e2a65054ed32bfb26a2e17. We don't need this as we need to do it a different way, with a explicit set of registers as otherwise we trip other random FIR bits and everything becomes even more terrible. I suggest alcohol. Cc: stable Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2018-03-01core/hmi: report processor recovery reason from core FIR bits on P9Nicholas Piggin1-3/+59
When an error is encountered that causes processor recovery, HMI is generated if the recovery was successful. The reason is recorded in the core FIR, which gets copied into the WOF. In this case dump the WOF register and an error string into the OPAL msglog. A broken init setting led to HMIs reported in Linux as: [ 3.591547] Harmless Hypervisor Maintenance interrupt [Recovered] [ 3.591648] Error detail: Processor Recovery done [ 3.591714] HMER: 2040000000000000 This patch would have been useful because it tells us exactly that the problem is in the d-side ERAT: [ 414.489690798,7] HMI: Received HMI interrupt: HMER = 0x2040000000000000 [ 414.489693339,7] HMI: [Loc: UOPWR.0000000-Node0-Proc0]: P:0 C:1 T:1: Processor recovery occurred. [ 414.489699837,7] HMI: Core WOF = 0x0000000410000000 recovered error: [ 414.489701543,7] HMI: LSU - SRAM (DCACHE parity, etc) [ 414.489702341,7] HMI: LSU - ERAT multi hit In future it will be good to unify this reporting, so Linux could print something more useful. Until then, this gives some good data. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2018-02-28NPU2 HMIs: dump out a *LOT* of npu2 registers for debuggingStewart Smith1-1/+37
This is not the way we want to end up doing this. This is a hack to make folk happy and not require crondump to debug nvidia/npu2 issues. Cc: stable Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2018-02-08hw/npu2: Implement logging HMI actionsBalbir Singh1-1/+82
Log HMI errors as step 1. OS will need to deduce and interpret the HMI event. Signed-off-by: Balbir Singh <bsingharora@gmail.com> Acked-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-12-13core/hmi: Display chip location code while displaying core FIR.Mahesh Salgaonkar1-1/+4
No functionality change. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-12-13core/hmi: Do not display FIR details if none of the bits are set.Mahesh Salgaonkar1-0/+3
So that we don't flood OPAL console logs with information that is not useful. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-12-13opal/hmi: HMI logging with location code info.Mahesh Salgaonkar1-4/+36
Add few HMI debug prints with location code info few additional info. No functionality change. With this patch the log messages will look like: [210612.175196744,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [210612.175200449,7] HMI: [Loc: UOPWR.1302LFA-Node0-Proc1]: P:8 C:16 T:1: TFMR(2d12000870e04020) Timer Facility Error [210660.259689526,7] HMI: Received HMI interrupt: HMER = 0x2040000000000000 [210660.259695649,7] HMI: [Loc: UOPWR.1302LFA-Node0-Proc0]: P:0 C:16 T:1: Processor recovery Done. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-12-13core/hmi: Use pr_fmt macro for tagging log messagesMahesh Salgaonkar1-16/+19
No functionality changes. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-23opal/hmi: Workaround Power9 hw logic bug for couple of TFMR TB errors.Mahesh Salgaonkar1-1/+27
Add a workaround for a HW logic bug in Power9 where TB residue and HDEC parity errors cleared by one thread aren't visible to other threads of same core. The TB reside and HDEC parity error are reported through TFMR bit 45 and 26 respectively. If any of the thread from the core clears the TFMR bit 26 and 45, only thread 0 is able to see that errors are cleared but rest of the threads 1, 2 and 3 do not see those as cleared. This causes TB error recovery to fail for TB residue and HDEC parity errors. TFMR is per core register and any changes made by a one thread should be visible by other threads of the same core. On TB residue error (TFMR bit 45), TB goes into invalid state. Hence avoid handling/clearing TB residue error if TB is valid and running. Use TFMR bit 41 to check validity of TB state. For HDEC parity error (TFMR bit 26), check for other errors on TFMR register and ignore the pre-recovery for HDEC parity error. If TFMR has any other TB error bits set alongwith HDEC parity error we can safely ignore handling of HDEC parity error. Also, while clearing HDEC parity error bit from TFMR, allow only thread 0 to clear it. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-23opal/hmi: Fix TB reside and HDEC parity error recovery for power9Mahesh Salgaonkar1-3/+102
On TB/HDEC errors, all 4 threads on the affected receives HMI. On power9, every thread on the core has its own copy of TB/HDEC and hence every thread has to clear the dirty data from its own TB/HDEC register before we clear tb errors through TFMR[24]. The HMI recovery would fail even if one thread do not cleanup the respective TB/HDEC register. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-08-15core/hmi: Fix use of uninitialised value (CID 147808)Cyril Bur1-3/+3
A rework of where some of the xscom regs are for POWER9 has resulted in a scope issue where the same line attempts to simultaneously reference a variable by the same name in global and function scope. Change the value read by xscom_read to *_val Fixes: CID 147808 Fixes: bda5e0ea Fix scom addresses for power9 nx checkstop hmi handling. Signed-off-by: Cyril Bur <cyril.bur@au1.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-07-07Fix scom addresses for power9 nx checkstop hmi handling.Mahesh Salgaonkar1-7/+20
Scom addresses for NX status, DMA & ENGINE FIR and PBI FIR has changed for Power9. Fixup thoes while handling nx checkstop for Power9. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-07-07Fix scom addresses for power9 core checkstop hmi handling.Mahesh Salgaonkar1-5/+56
Scom addresses for CORE FIR (Fault Isolation Register) and Malfunction Alert Register has changed for Power9. Fixup those while handling core checkstop for Power9. Without this change HMI handler fails to check for correct reason for core checkstop on Power9. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-06-19capi: Handle HMI eventsChristophe Lombard1-83/+43
Find the CAPP on the chip associated with the HMI event for PHB4. The recovery mode (re-initialization of the capp, resume of functional operations) is only available with P9 DD2. A new patch will be provided to support this feature. Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-06-19capi: Move phb3 capp registers to specialized filesChristophe Lombard1-1/+1
The definitions of the CAPP registers for PHB3 are moved in a specific file. The updated file capp.h will be used for the common functionalities about the CAPP for PHB3 and PHB4. Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-06-06Convert important polling loops to spin at lowest SMT priorityNicholas Piggin1-1/+3
The pattern of calling cpu_relax() inside a polling loop does not suit the powerpc SMT priority instructions. Prefrred is to set a low priority then spin until break condition is reached, then restore priority. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [stewart@linux.vnet.ibm.com: fixup lpc-uart wait_tx_room() and unit test] Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-03-16hmi: Print CAPP FIR information when handling CAPP malfunction alertsAndrew Donnellan1-0/+29
When diagnosing or debugging CAPP errors, it's rather useful to have the CAPP FIR, which often provides very helpful information. Print the CAPP FIR to the log when we handle a Malfunction Alert HMI for a CAPP error. Cc: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-03-16hmi: refactor malfunction alert handlingAndrew Donnellan1-46/+42
The logic in decode_malfunction() is rather tricky to follow. Refactor the code to make it easier to follow. No functional change. Cc: Russell Currey <ruscur@russell.cc> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: Ryan Grimm <grimm@linux.vnet.ibm.com> Cc: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Cc: Daniel Axtens <dja@axtens.net> Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-10-25fast-reboot: disable on FSP code update or unrecoverable HMIStewart Smith1-0/+2
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> [stewart@linux.vnet.ibm.com: unlock before return (suggested by Mahesh/Andrew), disable only on non-cancelling fsp codeupdate call (suggested by Vasant)] Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-08-10hmi: Clean up NPU FIR debug messagesRussell Currey1-3/+4
With the skiboot log set to debug, the FIR (and related registers) were logged all in the same message. It was too much for one line, didn't clarify if the numbers were in hex, and didn't show leading zeroes. So, split it into two lines, with leading zeroes and a "0x" prefix. Signed-off-by: Russell Currey <ruscur@russell.cc> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-06-21nvlink: Print error message when NPU is fencedRussell Currey1-1/+1
NPU fences aren't recoverable, and as such, would require user intervention to have a working system again. The fence will be picked up by the kernel through EEH, but this doesn't happen until the NPU is used for something. So, let's print a message so it's obvious when this happens. A helper function was added to reduce duplication. This also enables code in skiboot to un-fence a NPU, which is useful to NPU developers but very stupid otherwise. Signed-off-by: Russell Currey <ruscur@russell.cc> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-05-03PCI: Move PHB lock to generic layerGavin Shan1-3/+2
All kinds of PHBs are maintaining a spinlock. At mean while, the spinlock is acquired or released by backends for phb_ops->lock() or phb_ops->unlock(). There're no difference of the logic on all kinds of PHBs. So it's reasonable to maintain the lock in the generic layer (struct phb). This moves lock from specific PHB to generic one. The spinlock is initialized when the generic PHB is registered in pci_register_phb(). Also, two inline functions phb_{lock, unlock}() are introduced to acquire/release it. No logical changes introduced. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-04-27hmi: Recover both CAPP units on Naples after malfunction alertPhilippe Bergheaud1-12/+35
Naples has two capp units. Probe both units to identify the card in error. Use the xscom register offset to operate on the right unit. Signed-off-by: Philippe Bergheaud <felix@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2016-04-01Merge branch 'skiboot-5.1.x' into skiboot-5.2.x - partial HMI fixStewart Smith1-7/+8