aboutsummaryrefslogtreecommitdiff
path: root/core/hmi.c
AgeCommit message (Collapse)AuthorFilesLines
2023-07-11pau: Fix a build error with GCC 13.1.1Vaibhav Jain1-1/+1
Compiling hmi.c with GCC 13.1.1 results in following error core/hmi.c:860:25: error: ‘pau’ may be used uninitialized [-Werror=maybe-uninitialized] 860 | pau_opencapi_dump_scoms(pau); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fix this by initializing variable 'pau' with NULL when defined in npu_fir_errors(). Variable 'pau' is only assigned/referenced in switch-case blocks when phb_type == phb_type_pau_opencapi. Hence this patch shouldn't introduce any behavioral changes. Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
2023-03-28opal/hmi: Recover from unusual HMI with no TB error reported.Mahesh Salgaonkar1-4/+36
Timer facility HMIs are reported with TB error reason set in the TFMR register. With help from reason set in TFMR register, OPAL hmi handler carry out appropriate recovery procedure to successfully recover from Timer facility HMIs. However, On p10, in a very rare situation when core is waking up from stop2 or higher stop state, timer facility goes into error state due to Missing step, causing an HMI with no error reason set in TFMR register other than TFMR[41]=0 (tb_valid) and TFMR[28:31]=9 (tbst_encoded). Ideally, "Missing step" error should be reported in TFMR[44]=1. It looks like in this rare case, while generating HMI, HW fails to sync up the TFMR register with the core which is waking up from stop2. Hence, in absence of proper error reason, OPAL fails to recover from this unusual HMI, resulting in system panic/crash. In order to recover from this HMI it needs to reset the core level error "Missing step". Handle this as special case by treating this as TFMR corrupt error (TFMR[60]) which will then force reset core level errors including Missing step. Reported-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.ibm.com> Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
2021-10-19pau: hmi scom dumpChristophe Lombard1-144/+132
This patch add a new function to dump PAU registers when a HMI has been raised and an OpenCAPI link has been hit by an error. For each register, the scom address and the register value are printed. The hmi.c has been redesigned in order to support the new PHB/PCIEX type (PAU OpenCapi). Now, the *npu* functions support NPU and PAU units of P8, P9 and P10 chips. Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2021-08-06hw/phb5: Add initial supportJordan Niethe1-0/+4
The PHB5 logic on P10 is pretty close to the P9's version. So we keep our base phb4 implementation and just add the few changes within if statements. Signed-off-by: Jordan Niethe <jpn@ozlabs.au.ibm.com> [clg: misc cleanups and fixes ] Signed-off-by: Cédric Le Goater <clg@kaod.org> [Fixed compilation issue - Vasant] Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> [Nick: Unify PHB4/PHB5 drivers ] Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [Mikey: set default lane eq settings for phb5] Signed-off-by: Michael Neuling <mikey@neuling.org> [FB: squash commits + small cleanup ] Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2021-08-06Initial POWER10 enablementNicholas Piggin1-11/+210
Co-authored-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Co-authored-by: Vaidyanathan Srinivasan <svaidy@linux.ibm.com> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.ibm.com> Co-authored-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Neuling <mikey@neuling.org> Co-authored-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Co-authored-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Co-authored-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2020-03-12Re-license IBM written files as Apache 2.0 OR GPLv2+Stewart Smith1-1/+1
SPDX makes it a simpler diff. I have audited the commit history of each file to ensure that they are exclusively authored by IBM and thus we have the right to relicense. The motivation behind this is twofold: 1) We want to enable experiments with coreboot, which is GPLv2 licensed 2) An upcoming firmware component wants to incorporate code from skiboot and code from the Linux kernel, which is GPLv2 licensed. I have gone through the IBM internal way of gaining approval for this. The following files are not exclusively authored by IBM, so are *not* included in this update (I will be seeking approval from contributors): core/direct-controls.c core/flash.c core/pcie-slot.c external/common/arch_flash_unknown.c external/common/rules.mk external/gard/Makefile external/gard/rules.mk external/opal-prd/Makefile external/pflash/Makefile external/xscom-utils/Makefile hdata/vpd.c hw/dts.c hw/ipmi/ipmi-watchdog.c hw/phb4.c include/cpu.h include/phb4.h include/platform.h libflash/libffs.c libstb/mbedtls/sha512.c libstb/mbedtls/sha512.h platforms/astbmc/barreleye.c platforms/astbmc/garrison.c platforms/astbmc/mihawk.c platforms/astbmc/nicole.c platforms/astbmc/p8dnu.c platforms/astbmc/p8dtu.c platforms/astbmc/p9dsu.c platforms/astbmc/vesnin.c platforms/rhesus/ec/config.h platforms/rhesus/ec/gpio.h platforms/rhesus/gpio.c platforms/rhesus/rhesus.c platforms/astbmc/talos.c platforms/astbmc/romulus.c Signed-off-by: Stewart Smith <stewart@linux.ibm.com> [oliver: fixed up the drift] Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
2019-12-16hmi: endian conversionsNicholas Piggin1-10/+10
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
2019-07-26npu2: Add checks to npu2-only codepathsReza Arbab1-0/+4
To prepare for npu3, add a few checks in codepaths that are only for npu2. Compare against PVR_TYPE_P9, as npu3 will be in systems of PVR_TYPE_P9P (or greater). Alternatively, check for dt compatibility with "ibm,power9-npu" because npu3 will use "ibm,power9-npu3". Signed-off-by: Reza Arbab <arbab@linux.ibm.com> Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
2019-07-26SPDX-ify all skiboot codeStewart Smith1-13/+4
Use Software Package Data Exchange (SPDX) to indicate license for each file that is unique to skiboot. At the same time, ensure the (C) who and years are correct. See https://spdx.org/ Signed-off-by: Stewart Smith <stewart@linux.ibm.com> [oliver: Added a few missing files] Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
2019-06-04opal/hmi: Report NPU2 checkstop reasonFrederic Barrat1-0/+44
The NPU2 is currently not passing any information to linux to explain the cause of an HMI. NPU2 has three Fault Isolation Registers and over 30 of those FIR bits are configured to raise an HMI by default. We won't be able to fit all possible state in the 32-bit xstop_reason field of the HMI event, but we can still try to encode up to 4 HMI reasons. Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-06-03opal-msg: Pass parameter size to _opal_queue_msg()Vasant Hegde1-3/+3
Currently _opal_queue_msg() takes number of parameters. So far this was fine as opal_queue_msg() was supporting only fixed number of parameters (8 * 8 bytes). Soon we are going to introduce variable size parameter. Hence num_params -> params_size. Cc: Jeremy Kerr <jk@ozlabs.org> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Acked-by Jeremy Kerr <jk@ozlabs.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-05-15xscom: move more register definitions into processor-specific includesNicholas Piggin1-25/+1
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-05-15nvram: Flag dangerous NVRAM optionsMichael Neuling1-1/+1
Most nvram options used by skiboot are just for debug or testing for regressions. They should never be used long term. We've hit a number of issues in testing and the field where nvram options have been set "temporarily" but haven't been properly cleared after, resulting in crashes or real bugs being masked. This patch marks most nvram options used by skiboot as dangerous and prints a chicken to remind users of the problem. Signed-off-by: Michael Neuling <mikey@neuling.org> Reviewed-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Acked-By: Alistair Popple <alistair@popple.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-04-17opal/hmi: Initialize the hmi event with old value of TFMR.Mahesh Salgaonkar1-1/+3
Do this before we fix TFAC errors. Otherwise the event at host console shows no thread error reported in TFMR register. Without this patch the console event show TFMR with no thread error: (DEC parity error TFMR[59] injection) [ 53.737572] Severe Hypervisor Maintenance interrupt [Recovered] [ 53.737596] Error detail: Timer facility experienced an error [ 53.737611] HMER: 0840000000000000 [ 53.737621] TFMR: 3212000870e04000 After this patch it shows old TFMR value on host console: [ 2302.267271] Severe Hypervisor Maintenance interrupt [Recovered] [ 2302.267305] Error detail: Timer facility experienced an error [ 2302.267320] HMER: 0840000000000000 [ 2302.267330] TFMR: 3212000870e14010 Fixes: 674f7696f ("opal/hmi: Rework HMI handling of TFAC errors") Cc: skiboot-stable@lists.ozlabs.org Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-04-09opal/hmi: Never trust a cow!Frederic Barrat1-1/+1
With opencapi, it's fairly common to trigger HMIs during AFU development on the FPGA, by not replying in time to an NPU command, for example. So shift the blame reported by that cow to avoid crowding my mailbox. Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-04-09hw/npu2: Dump (more) npu2 registers on link error and HMIsFrederic Barrat1-57/+1
We were already logging some NPU registers during an HMI. This patch cleans up a bit how it is done and separates what is global from what is specific to nvlink or opencapi. Since we can now receive an error interrupt when an opencapi link goes down unexpectedly, we also dump the NPU state but we limit it to the registers of the brick which hit the error. The list of registers to dump was worked out with the hw team to allow for proper debugging. For each register, we print the name as found in the NPU workbook, the scom address and the register value. Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-03-05opal/hmi: set a flag to inform OS that TOD/TB has failed.Mahesh Salgaonkar1-1/+3
Set a flag to indicate OS about TOD/TB failure as part of new opal_handle_hmi2 handler. This flag then can be used by OS to make sure functions depending on TB value (e.g. udelay()) are aware of TB not ticking. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-03-05opal/hmi: Fix double unlock of hmi lock in failure path.Mahesh Salgaonkar1-5/+1
unlock once and goto error_out. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-10-16opal/hmi: Wakeup the cpu before reading core_firVaibhav Jain1-6/+15
When stop state 5 is enabled, reading the core_fir during an HMI can result in a xscom read error with xscom_read() returning an OPAL_XSCOM_PARTIAL_GOOD error code and core_fir value of all FFs. At present this return error code is not handled in decode_core_fir() hence the invalid core_fir value is sent to the kernel where it interprets it as a FATAL hmi causing a system check-stop. This can be prevented by forcing the core to wake-up using before reading the core_fir. Hence this patch wraps the call to read_core_fir() within calls to dctl_set_special_wakeup() and dctl_clear_special_wakeup(). Suggested-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> Acked-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Reviewed-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-09-27opal/hmi: Handle early HMIs on thread0 when secondaries are still in OPAL.Mahesh Salgaonkar1-0/+49
When primary thread receives a CORE level HMI for timer facility errors while secondaries are still in OPAL, thread 0 ends up in rendez-vous waiting for secondaries to get into hmi handling. This is because OPAL runs with MSR(EE=0) and hence HMIs are delayed on secondary threads until they are given to Linux OS. Fix this by adding a check for secondary state and force them in hmi handling by queuing job on secondary threads. I have tested this by injecting HDEC parity error very early during Linux kernel boot. Recovery works fine for non-TB errors. But if TB is bad at this very eary stage we already doomed. Without this patch we see: [ 285.046347408,7] OPAL: Start CPU 0x0843 (PIR 0x0843) -> 0x000000000000a83c [ 285.051160609,7] OPAL: Start CPU 0x0844 (PIR 0x0844) -> 0x000000000000a83c [ 285.055359021,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 285.055361439,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:0: TFMR(2e12002870e14000) Timer Facility Error [ 286.232183823,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc1) [ 287.409002056,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc1) [ 289.073820164,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc1) [ 290.250638683,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc2) [ 291.427456821,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc2) [ 293.092274807,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc2) [ 294.269092904,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc3) [ 295.445910944,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc3) [ 297.110728970,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc3) After this patch: [ 259.401719351,7] OPAL: Start CPU 0x0841 (PIR 0x0841) -> 0x000000000000a83c [ 259.406259572,7] OPAL: Start CPU 0x0842 (PIR 0x0842) -> 0x000000000000a83c [ 259.410615534,7] OPAL: Start CPU 0x0843 (PIR 0x0843) -> 0x000000000000a83c [ 259.415444519,7] OPAL: Start CPU 0x0844 (PIR 0x0844) -> 0x000000000000a83c [ 259.419641401,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 259.419644124,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:0: TFMR(2e12002870e04000) Timer Facility Error [ 259.419650678,7] HMI: Sending hmi job to thread 1 [ 259.419652744,7] HMI: Sending hmi job to thread 2 [ 259.419653051,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 259.419654725,7] HMI: Sending hmi job to thread 3 [ 259.419654916,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 259.419658025,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 259.419658406,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:2: TFMR(2e12002870e04000) Timer Facility Error [ 259.419663095,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:3: TFMR(2e12002870e04000) Timer Facility Error [ 259.419655234,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:1: TFMR(2e12002870e04000) Timer Facility Error [ 259.425109779,7] OPAL: Start CPU 0x0845 (PIR 0x0845) -> 0x000000000000a83c [ 259.429870681,7] OPAL: Start CPU 0x0846 (PIR 0x0846) -> 0x000000000000a83c [ 259.434549250,7] OPAL: Start CPU 0x0847 (PIR 0x0847) -> 0x000000000000a83c Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-09-17opal/hmi: Ignore debug trigger inject core FIR.Mahesh Salgaonkar1-1/+0
Core FIR[60] is a side effect of the work around for the CI Vector Load issue in DD2.1. Usually this gets delivered as HMI with HMER[17] where Linux already ignores it. But it looks like in some cases we may happen to see CORE_FIR[60] while we are already in Malfunction Alert HMI (HMER[0]) due to other reasons e.g. CAPI recovery or NPU xstop. If that happens then just ignore it instead of crashing kernel as not recoverable. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-08-13opal/hmi: Catch NPU2 HMIs for opencapiFrederic Barrat1-5/+10
HMIs for NPU2 are filtered with the 'compatible' string of the PHB, so add opencapi to the mix. Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-06-27hw/npu2, core/hmi: Use NPU instead of NPU2 as log message prefixAndrew Donnellan1-3/+3
The NPU2{DBG,INF,ERR} macros use "NPU%d" as a prefix to identify messages relating to a particular NPU. It's slightly confusing to have per-NPU messages prefixed with "NPU0" or "NPU1" and NPU-generic messages prefixed with "NPU2". On some future system we could potentially have a NPU #2 in which case it'd be really confusing. Use NPU rather than NPU2 for NPU-generic log messages. There's no risk of confusion with the original npu.c code since that's only for P8. Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Acked-by: Reza Arbab <arbab@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-06-04opal/hmi: Display correct chip id while printing NPU FIRs.Mahesh Salgaonkar1-4/+4
HMIs for NPU xstops are broadcasted to all chips. All cores on all the chips receive HMI. HMI handler correctly identifies and extracts the NPU FIR details from affected chip, but while printing FIR data it prints chip id and location code details of this_cpu()->chip_id which may not be correct. This patch fixes this issue. CC: stable # v6.0+ Fixes: 7bcbc78c ("Add location code to NPU2 HMI logging") Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> [stewart: add fixes and cc stable] Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-05-15Add location code to NPU2 HMI loggingBalbir Singh1-8/+14
The current HMI error message does not specifiy where the HMI error occured. The original error message was NPU: FIR#0 FIR 0x0080100000000000 mask 0x009a48180f01ffff The enhanced error message is NPU2: [Loc: UOPWR.0000000-Node0-Proc0] P:0 FIR#0 FIR 0x0000100000000000 mask 0x009a48180f03ffff Signed-off-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-05-09hmi: Fix clearing HMER on debug triggerMichael Neuling1-0/+1
In the recent patch: eddff9bf40 hmi: Clear unknown debug trigger I rebased the code from an older skiboot before the HMI rework. When I did this, I missed the handled flag. Without this the HMER is not cleared properly and the HMI keeps happening. This properly sets the handled flag and hence clears the HMER bit. Signed-off-by: Michael Neuling <mikey@neuling.org> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-05-04hmi: Clear unknown debug triggerRyan Grimm1-0/+10
On some systems, seeing hangs like this when Linux starts: [ 170.027252763,5] OCC: All Chip Rdy after 0 ms [ 170.062930145,5] INIT: Starting kernel at 0x20011000, fdt at 0x30ae0530 366247 bytes) [ 171.238270428,5] OPAL: Switch to little-endian OS If you look at the in memory skiboot console (or do 'nvram -p ibm,skiboot --update-config log-level-driver=7') we see the console get spammed with: [ 5209.109790675,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000 [ 5209.109792716,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000 [ 5209.109794695,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000 [ 5209.109796689,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000 We're taking the debug trigger (bit 17) early on, before the hmi_debug_trigger function in the kernel is set up. This clears the HMI in Skiboot and reports to the kernel instead of bringing down the machine. Signed-off-by: Ryan Grimm <grimm@linux.vnet.ibm.com> Signed-off-by: Michael Neuling <mikey@neuling.org> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-05-04core/hmi: assign flags=0 in case nothing set by handle_hmi_exceptionStewart Smith1-1/+1
Practically speaking, I don't think you'd *currently* hit this. Found with Clang's scan-build. Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Acked-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-24opal/hmi: Generate one event per core for processor recovery.Mahesh Salgaonkar1-3/+3
Processor recovery is per core error. All threads on that core receive HMI. All threads don't need to generate HMI event for same error. Let thread 0 only generate the event. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-24opal:hmi: Add missing processor recovery reason string.Mahesh Salgaonkar1-0/+1
With this patch now we see reason string printed for CORE_WOF[43] bit. [ 477.352234986,7] HMI: [Loc: U78D3.001.WZS004A-P1-C48]: P:8 C:22 T:3: Processor recovery occurred. [ 477.352240742,7] HMI: Core WOF = 0x0000000000100000 recovered error: [ 477.352242181,7] HMI: PC - Thread hang recovery Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Generate hmi event for recovered HDEC parity error.Mahesh Salgaonkar1-3/+2
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: check thread 0 tfmr to validate latched tfmr errors.Mahesh Salgaonkar1-19/+42
Due to P9 errata, HDEC parity and TB residue errors are latched for non-zero threads 1-3 even if they are cleared. But these are not latched on thread 0. Hence, use xscom SCOMC/SCOMD to read thread 0 tfmr value and ignore them on non-zero threads if they are not present on thread 0. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Print additional debug information in rendezvous.Mahesh Salgaonkar1-2/+4
Helps in debugging... Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Fix handling of TFMR parity/corrupt error.Mahesh Salgaonkar1-5/+4
While testing TFMR parity/corrupt error it has been observed that HMIs are delivered twice for this error - First time HMI is delivered with HMER[4,5]=1 and TFMR[60]=1. - Second time HMI is delivered with HMER[4,5]=1 and TFMR[60]=0 with valid TB. On second HMI we end up throwing below error message even though TB is in valid state. "HMI: TB invalid without core error reported" This patch fixes this issue by ignoring HMER[5] and checking only for TFMR[60] before setting this_cpu()->tb_invalid to true. Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Fix soft lockups during TOD errorsMahesh Salgaonkar1-1/+15
There are some TOD errors which do not affect working of TOD and TB. They stay in valid state. Hence we don't need rendez vous for TOD errors that does not affect TB working. TOD errors that affects TOD/TB will report a global error on TFMR[44] alongwith bit 51, and they will go in rendez vous path as expected. But the TOD errors that does not affect TB register sets only TFMR bit 51. The TFMR bit 51 is cleared when any single thread clears the TOD error. Once cleared, the bit 51 is reflected to all the cores on that chip. Any thread that reads the TFMR register after the error is cleared will see TFMR bit 51 reset. Hence the threads that see TFMR[51]=1, falls through rendez-vous path and threads that see TFMR[51]=0, returns doing nothing. This ends up in a soft lockups in host kernel. This patch fixes this issue by not considering TOD interrupt (TFMR[51]) as a core-global error and hence avoiding rendez-vous path completely. Instead threads that see TFMR[51]=1 will now take different path that just do the TOD error recovery. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Do not send HMI event if no errors are found.Mahesh Salgaonkar1-8/+13
For TOD errors, all the cores in the chip get HMIs. Any one thread from any core can fix the issue and TFMR will have error conditions cleared. Rest of the threads need take any action if TOD errors are already cleared. Hence thread 0 of every core should get a fresh copy of TFMR before going ahead recovery path. Initialize recover = -1, so that if no errors found that thread need not send a HMI event to linux. This helps in stop flooding host with hmi event by every thread even there are no errors found. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Initialize the hmi event with old value of HMER.Mahesh Salgaonkar1-3/+6
Do this before we check for TFAC errors. Otherwise the event at host console shows no error reported in HMER register. Without this patch the console event show HMER with all zeros [ 216.753417] Severe Hypervisor Maintenance interrupt [Recovered] [ 216.753498] Error detail: Timer facility experienced an error [ 216.753509] HMER: 0000000000000000 [ 216.753518] TFMR: 3c12000870e04000 After this patch it shows old HMER values on host console: [ 2237.652533] Severe Hypervisor Maintenance interrupt [Recovered] [ 2237.652651] Error detail: Timer facility experienced an error [ 2237.652766] HMER: 0840000000000000 [ 2237.652837] TFMR: 3c12000870e04000 Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Rework HMI handling of TFAC errorsBenjamin Herrenschmidt1-292/+231
This patch reworks the HMI handling for TFAC errors by introducing 4 rendez-vous points improve the thread synchronization while handling timebase errors that requires all thread to clear dirty data from TB/HDEC register before clearing the errors. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Don't bother passing HMER to pre-recovery cleanupBenjamin Herrenschmidt1-14/+6
The test for TFAC error is now redundant so we remove it and remove the HMER argument. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Move timer related error handling to a separate functionBenjamin Herrenschmidt1-48/+58
Currently no functional change. This is a first step to completely rewriting how these things are handled. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Add a new opal_handle_hmi2 that returns direct info to LinuxBenjamin Herrenschmidt1-45/+82
It returns a 64-bit flags mask currently set to provide info about which timer facilities were lost, and whether an event was generated. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Remove races in clearing HMERBenjamin Herrenschmidt1-10/+12
Writing to HMER acts as an "AND". The current code writes back the value we originally read with the bits we handled cleared. This is racy, if a new bit gets set in HW after the original read, we'll end up clearing it without handling it. Instead, use an all 1's mask with only the bit handled cleared. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-04-17opal/hmi: Don't re-read HMER multiple timesBenjamin Herrenschmidt1-21/+14
We want to make sure all reporting and actions are based upon the same snapshot of HMER in case bits get added by HW while we are in OPAL. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-03-27NPU2: dump NPU2 registers on npu2 HMIStewart Smith1-2/+73
Due to the nature of debugging npu2 issues, folk are wanting the full list of NPU2 registers dumped when there's a problem. We have to list out each register as traversing the range triggers FIR bits that confuse PRD. Suggested-by: Ryan Black <rblack@us.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2018-03-27Revert "NPU2 HMIs: dump out a *LOT* of npu2 registers for debugging"Stewart Smith1-37/+1
This reverts commit fbdc91e693fc3103f7e2a65054ed32bfb26a2e17. We don't need this as we need to do it a different way, with a explicit set of registers as otherwise we trip other random FIR bits and everything becomes even more terrible. I suggest alcohol. Cc: stable Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2018-03-01core/hmi: report processor recovery reason from core FIR bits on P9Nicholas Piggin1-3/+59
When an error is encountered that causes processor recovery, HMI is generated if the recovery was successful. The reason is recorded in the core FIR, which gets copied into the WOF. In this case dump the WOF register and an error string into the OPAL msglog. A broken init setting led to HMIs reported in Linux as: [ 3.591547] Harmless Hypervisor Maintenance interrupt [Recovered] [ 3.591648] Error detail: Processor Recovery done [ 3.591714] HMER: 2040000000000000 This patch would have been useful because it tells us exactly that the problem is in the d-side ERAT: [ 414.489690798,7] HMI: Received HMI interrupt: HMER = 0x2040000000000000 [ 414.489693339,7] HMI: [Loc: UOPWR.0000000-Node0-Proc0]: P:0 C:1 T:1: Processor recovery occurred. [ 414.489699837,7] HMI: Core WOF = 0x0000000410000000 recovered error: [ 414.489701543,7] HMI: LSU - SRAM (DCACHE parity, etc) [ 414.489702341,7] HMI: LSU - ERAT multi hit In future it will be good to unify this reporting, so Linux could print something more useful. Until then, this gives some good data. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2018-02-28NPU2 HMIs: dump out a *LOT* of npu2 registers for debuggingStewart Smith1-1/+37
This is not the way we want to end up doing this. This is a hack to make folk happy and not require crondump to debug nvidia/npu2 issues. Cc: stable Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2018-02-08hw/npu2: Implement logging HMI actionsBalbir Singh1-1/+82
Log HMI errors as step 1. OS will need to deduce and interpret the HMI event. Signed-off-by: Balbir Singh <bsingharora@gmail.com> Acked-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-12-13core/hmi: Display chip location code while displaying core FIR.Mahesh Salgaonkar1-1/+4
No functionality change. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-12-13core/hmi: Do not display FIR details if none of the bits are set.Mahesh Salgaonkar1-0/+3
So that we don't flood OPAL console logs with information that is not useful. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>