aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-09-30skiboot v6.0.24 release notesv6.0.24skiboot-6.0.xFrederic Barrat1-0/+22
Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
2021-09-30SBE: Account cancelled timer requestVasant Hegde1-0/+3
[ Upstream commit b44c7594523d20945179e497c45ec9007981ac75 ] Currently we are not accounting cancelled timer request. So in some corner cases we may schedule new timer request with new-timer-value > inflight-timer-value. Lets explicit check new_target value with inflight timer value. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
2021-09-30SBE: Rate limit timer requestsVasant Hegde1-0/+22
[ Upstream commit 2e654443050acdd4deffdbb44723a847ca11e6b2 ] We schedule timer and wait for `timer expiry` interrupt from SBE. If we get new timer request which is lesser than inflight timer expiry value we can update timer (essentially sending new timer chip-op and SBE takes care of stoping inflight timer and scheduling new one). SBE runs at much slower speed than host CPU. If we do continuous timer update like below then SBE will be busy with handling PSU side timer message and will not get time to handle FIFO side requests. send timer chip-op -> Got ACK -> send timer chip-op Hence this patch limits number of continuous timer update and we will restart sending timer request as soon as we get timer expiry interrupt. Rate limit value (2) is suggested by SBE team. With this patch: If our timer requests are : 2ms, 1500us, 1000us and 800us (and requests are coming after sending each message) We will schedule timer for 2ms and then update timer for 1500us and 1000us (These update happens after getting ACK interrupt from SBE) We will not send 800us request. At 1000us we get `timer expiry` and we are good to send next timer requests (At this stage both 1000us and 800us timeout happens. We will schedule next timer request with timeout value 500us (1500-1000)). Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
2021-09-30SBE: Check timer state before scheduling timerVasant Hegde1-2/+4
[ Upstream commit 47ab3a92298e72e44b9477a02b1312a09272a54a ] Timer flow: - OPAL sends timer chip-op to SBE and waits for ACK - Until we get ACK interrupt from SBE we will not schedule any new timer - Once we get ACK either we wait for timer expiry -OR- schedule new one if new-timer-request < inflight-timer-timeout value. - If we get new timer request while processing current one p9_sbe_update_timer_expiry code sets `has_new_target` and we schedule it in ACK path (p9_sbe_timer_resp()). p9_sbe_timer_resp() is callback handler and its called without lock. It does not check whether timer message is busy or not (timer_ctrl_msg). So in theory we may hit below scenario and corrupt msg_list. CPU 1 -> Timer ACK (callback handler) -- its not holding any lock CPU 2 -> Grabbed sbe_timer_lock -> scheduled timer --> done CPU 3 -> p9_sbe_update_timer_expiry() -> see timer is busy -> sets has_new_timer -> done CPU 1 -> gets chance to grab sbe_timer_lock -> saw has_new_timer -> Called p9_sbe_timer_schedule() --> List corrupted ! This patch adds timer message busy check in p9_sbe_timer_resp(). Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
2021-09-30phb4: Disable TCE cache line bufferFrederic Barrat2-0/+2
[ Upstream commit 15b93a301509ba7813343540e25b47ba395674b9 ] This patch implements a circumvention for HW557787. It disables the TCE cache line buffer as, under heavy loads, there's a possibility of an entry being re-allocated incorrectly. Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> [FB]: adjust due to branch supporting DD1 Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
2020-04-07skiboot v6.0.23 release notesv6.0.23Vasant Hegde1-0/+17
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2020-03-27npu2: Clear fence on all bricksAlexey Kardashevskiy1-5/+12
[ Upstream commit 9be9a77a8352aee0bb74ac0d79f55e1238f76285 ] A bug in the NVidia driver can cause an UR HMI which fences bricks (links). At the moment we clear fence status only for bricks of a specific devices, however this does not appear to be enough and we need to clear fences for all bricks. This is ok as we do not allow using GPUs individually anyway. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Reza Arbab <arbab@linux.ibm.com> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2020-03-27npu2: Clear fence state for a brick being resetAlexey Kardashevskiy1-0/+8
[ Upstream commit d496bb141c978a6dc8a106b3d92e5fc1ad0f8663 ] Resetting a GPU before resetting an NVLink leads to occasional HMIs which fence some bricks and prevent the "reset_ntl" procedure from succeeding at the "reset_ntl_release" step - the host system requires reboot; there may be other cases like this as well. This adds clearing of the fence bit in NPU.MISC.FENCE_STATE for the NVLink which we are about to reset. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2020-03-27skiboot v6.0.22 release notesv6.0.22Vasant Hegde1-0/+21
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2020-03-27Bump allowed stack frame size for unit tests/host programsStewart Smith1-1/+6
[ Upstream commit 11374f20e8f1c179571077292dd2b3011dea4ae2 ] We tend to have a lot more things inlined when building unit tests, so let's just up the -Wframe-larger-than to avoid hitting it. This time, it was noticed in travis-ci with the ubuntu:latest image. Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2020-03-26errorlog: Increase the severity of abnormal reboot eventsVasant Hegde1-1/+1
[ Upstream commit daf9215c85f910043c472984683948baaf18da39 ] Currently Linux will usually call opal_cec_reboot2() in response to unrecoverable HMIs and other serious hardware errors. OPAL handles platform errors by sending an error log to the BMC / FSP and triggering a software checkstop. Sending error logs to the BMC / FSP is normally an async operation, but in this path we need to ensure that error logs are sent out before the xstop is triggered. The easiest way to do that is to escalate the severity of the generated error log from "abnormal reboot" to "panic" since we force panic logs to be send synchronusly. It's also a more accurate description of what's happening. CC: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Reviewed-by: Klaus Heinrich Kiwi <klaus@linux.vnet.ibm.com> [oliver: commit message] Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2020-03-26eSEL: Make sure PANIC logs are sent to BMC before calling assertVasant Hegde1-2/+15
[ Upstream commit 033e797cb0d77b151ab6e47d8fb7666a09641107 ] eSEL logs are split into multiple smaller chunks and sent to BMC. We use ipmi_queue_msg_sync() interface for sending OPAL_ERROR_PANIC severity events to BMC. But callback handler (ipmi_cmd_done()) clears 'sync_msg' after getting response to first chunk as its not aware that we have more data to send. So in assert()/checkstop path we may endup checkstoping system before error log is sent to BMC completely. We will miss useful error log. This patch introduces new wait loop in ipmi_elog_commit(). It will wait until error log is sent to BMC. I think this is safe because even if something goes wrong (like BMC reset) we will hit timeout and eventually we will come out of this loop. Alternatively we can add additional check in ipmi_cmd_done() path. But I don't wanted to make this path aware of message type. Reviewed-by: Klaus Heinrich Kiwi <klaus@linux.vnet.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2020-03-26core/ipmi: Fix use-after-freeVasant Hegde1-3/+11
[ Upstream commit d75e82dbfbb9443efeb3f9a5921ac23605aab469 ] Commit f01cd77 introduced backend poller() for ipmi message. But in some corner cases its possible that we endup calling poller() after freeing ipmi message. Thread 1 : ipmi_queue_msg_sync() Waiting for ipmi sync message to complete Thread 2 : bt_poll() -> ipmi_cmd_done() -> callback handler -> free message Oliver hit this issue during fast-reboot test with skiboot DEBUG build. In debug build we poision the memory after free. That helped us to catch this issue. [ 460.295570781,3] *********************************************** [ 460.295773157,3] Fatal MCE at 0000000030035cb4 .ipmi_queue_msg_sync+0x110 MSR 9000000000201002 [ 460.295887496,3] CFAR : 0000000030035ce8 MSR : 9000000000000000 [ 460.295956419,3] SRR0 : 0000000030035cb4 SRR1 : 9000000000201002 [ 460.296035015,3] HSRR0: 0000000030012624 HSRR1: 9000000002803002 [ 460.296102413,3] DSISR: 00000008 DAR : 99999999999999d1 [ 460.296169710,3] LR : 0000000030035ce4 CTR : 0000000030002880 [ 460.296248482,3] CR : 28002422 XER : 20040000 [ 460.296336621,3] GPR00: 0000000030035ce4 GPR16: 00000000301d36d8 [ 460.296415449,3] GPR01: 0000000031c133d0 GPR17: 00000000300f5cd8 [ 460.296482811,3] GPR02: 0000000030142700 GPR18: 0000000030407ff0 [ 460.296550265,3] GPR03: 0000000000000100 GPR19: 0000000000000000 [ 460.296629041,3] GPR04: 0000000028002424 GPR20: 0000000000000000 [ 460.296696369,3] GPR05: 0000000020040000 GPR21: 0000000030121d73 [ 460.296820977,3] GPR06: c000001fffffd480 GPR22: 0000000030121dd2 [ 460.296888226,3] GPR07: c000001fffffd480 GPR23: 0000000030613400 [ 460.296978218,3] GPR08: 0000000000000001 GPR24: 0000000000000001 [ 460.297056871,3] GPR09: 9999999999999999 GPR25: 0000000031c13960 [ 460.297124647,3] GPR10: 0000000000000000 GPR26: 0000000000000004 [ 460.297203811,3] GPR11: 0000000000000000 GPR27: 0000000000000003 [ 460.297271250,3] GPR12: 0000000028002424 GPR28: 0000000030613400 [ 460.297339026,3] GPR13: 0000000031c10000 GPR29: 0000000030406b50 [ 460.297417605,3] GPR14: 00000000300f58f8 GPR30: 0000000030406b40 [ 460.297485176,3] GPR15: 00000000300f58d8 GPR31: 00000000309249c8 Reported-by: Oliver O'Halloran <oohall@gmail.com> Fixes: f01cd77 (ipmi: ensure forward progress on ipmi_queue_msg_sync()) Cc: skiboot-stable@lists.ozlabs.org # v6.3+ Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2020-03-26ipmi: ensure forward progress on ipmi_queue_msg_sync()Stewart Smith4-1/+28
[ Upstream commit f01cd777adb16cbab93215d26159aa1c4606112c ] BT responses are handled using a timer doing the polling. To hope to get an answer to an IPMI synchronous message, the timer needs to run. We can't just check all timers though as there may be a timer that wants a lock that's held by a code path calling ipmi_queue_msg_sync(), and if we did enforce that as a requirement, it's a pretty subtle API that is asking to be broken. So, if we just run a poll function to crank anything that the IPMI backend needs, then we should be fine. This issue shows up very quickly under QEMU when loading the first flash resource with the IPMI HIOMAP backend. Reported-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Reviewed-by: Andrew Jeffery <andrew@aj.id.au> Reviewed-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2020-01-07skiboot v6.0.21 release notesv6.0.21Vasant Hegde1-0/+15
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2020-01-07hdata/vpd: fix printing (char*)0x00Stewart Smith1-4/+5
[ Upstream commit ba977f2e4406f9de318afcdf5d666e77585ef269 ] GCC9 now catches this bug: In file included from hdata/vpd.c:17: In function ‘vpd_vini_parse’, inlined from ‘vpd_data_parse’ at hdata/vpd.c:416:3: /home/stewart/skiboot/include/skiboot.h:93:31: error: ‘%s’ directive argument is null [-Werror=format-overflow=] 93 | #define prlog(l, f, ...) do { _prlog(l, pr_fmt(f), ##__VA_ARGS__); } while(0) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ hdata/vpd.c:390:5: note: in expansion of macro ‘prlog’ 390 | prlog(PR_WARNING, | ^~~~~ hdata/vpd.c: In function ‘vpd_data_parse’: hdata/vpd.c:391:46: note: format string is defined here 391 | "VPD: CCIN desc not available for: %s\n", | ^~ cc1: all warnings being treated as errors Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2020-01-07libflash/ecc: Fix compilation warningVasant Hegde1-1/+1
[ Upstream commit b570fb646289b34750d14518c77036dc14c732e3 ] We are hitting below warning on gcc9. gcc -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mcpu=power8 -mtune=power8 -fasynchronous-unwind-tables -fstack-clash-protection -O2 -Wall -Werror -Wno-stringop-truncation -I. -c libflash/ecc.c -o libflash-ecc.o libflash/ecc.c: In function 'memcpy_to_ecc_unaligned': libflash/ecc.c:419:24: error: taking address of packed member of 'struct ecc64' may result in an unaligned pointer value [-Werror=address-of-packed-member] 419 | memcpy(inc_uint64_by(&ecc_word.data, alignment), src, bytes_wanted); | ^~~~~~~~~~~~~~ libflash/ecc.c:448:24: error: taking address of packed member of 'struct ecc64' may result in an unaligned pointer value [-Werror=address-of-packed-member] 448 | memcpy(inc_uint64_by(&ecc_word.data, len), inc_ecc64_by(dst, len), | ^~~~~~~~~~~~~~ cc1: all warnings being treated as errors Fixes: https://github.com/open-power/skiboot/issues/218 Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2020-01-07errorlog: Prevent alignment error building with gcc9.Michal Suchanek1-1/+1
[ Upstream commit 6080c106e797ea8375ac164e8f53de3308d42abb ] Fixes this build error: [ 52s] hw/fsp/fsp-elog-write.c: In function 'opal_elog_read': [ 52s] hw/fsp/fsp-elog-write.c:213:12: error: taking address of packed member of 'struct errorlog' may result in an unaligned pointer value [-Werror=address-of-packed-member] [ 52s] 213 | list_del(&log_data->link); [ 52s] | ^~~~~~~~~~~~~~~ Fixes: https://github.com/open-power/skiboot/issues/247 Signed-off-by: Michal Suchanek <msuchanek@suse.de> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2020-01-07struct p9_sbe_msg doesn't need to be packedStewart Smith1-1/+1
[ Upstream commit ef691db3533742d9dd6eed1a311472a7c52be94b ] Only the reg member is sent anywhere (via xscom_write), so the structure does not need to be packed. Fixes GCC9 build problem: hw/sbe-p9.c: In function ‘p9_sbe_msg_send’: hw/sbe-p9.c:270:9: error: taking address of packed member of ‘struct p9_sbe_msg’ may result in an unaligned pointer value [-Werror=address-of-packed-member] 270 | data = &msg->reg[0]; | ^~~~~~~~~~~~ Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-12-10npu2/hw-procedures: Remove assertion from check_credits()Reza Arbab1-9/+6
[ Upstream commit 24664b48642845d620e225111bf6184f3c102f60 ] The RX clock mux in the NVLink PHY can glitch, which will manifest in hard to diagnose behavior--at best, a checkstop during the first link traffic. The only reliable way we found to detect this was by checking for a discrepancy in the credits we expect to receive during link training. Since the time the check was added, we've found that * Commit ac6f1599ff33 ("npu2: hw-procedures: Add phy_rx_clock_sel()") does work around the original glitch. * Asserting is too harsh. Before root cause was established, it was thought this could have been a manufacturing defect and we wanted to loudly fail hardware acceptance boot cycle tests. * It seems there is a valid situation in which credits are off from the expected value. During GPU hot reset, a CPU prefetch across the link can affect the credit count before we check. Given all of the above, remove the assert(). Cc: stable # 6.0.x Signed-off-by: Reza Arbab <arbab@linux.ibm.com> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-05-09skiboot v6.0.20 release notesv6.0.20Vasant Hegde1-0/+204
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-05-09core/flash: Retry requests as necessary in flash_load_resource()Andrew Jeffery1-2/+31
[ Upstream commit cccf5d79de07844cf095b8f45146b92944d15c2e ] We would like to successfully boot if we have a dependency on the BMC for flash even if the BMC is not current ready to service flash requests. On the assumption that it will become ready, retry for several minutes to cover a BMC reboot cycle and *eventually* rather than *immediately* crash out with: [ 269.549748] reboot: Restarting system [ 390.297462587,5] OPAL: Reboot request... [ 390.297737995,5] RESET: Initiating fast reboot 1... [ 391.074707590,5] Clearing unused memory: [ 391.075198880,5] PCI: Clearing all devices... [ 391.075201618,7] Clearing region 201ffe000000-201fff800000 [ 391.086235699,5] PCI: Resetting PHBs and training links... [ 391.254089525,3] FFS: Error 17 reading flash header [ 391.254159668,3] FLASH: Can't open ffs handle: 17 [ 392.307245135,5] PCI: Probing slots... [ 392.363723191,5] PCI Summary: ... [ 393.423255262,5] OCC: All Chip Rdy after 0 ms [ 393.453092828,5] INIT: Starting kernel at 0x20000000, fdt at 0x30800a88 390645 bytes [ 393.453202605,0] FATAL: Kernel is zeros, can't execute! [ 393.453247064,0] Assert fail: core/init.c:593:0 [ 393.453289682,0] Aborting! CPU 0040 Backtrace: S: 0000000031e03ca0 R: 000000003001af60 ._abort+0x4c S: 0000000031e03d20 R: 000000003001afdc .assert_fail+0x34 S: 0000000031e03da0 R: 00000000300146d8 .load_and_boot_kernel+0xb30 S: 0000000031e03e70 R: 0000000030026cf0 .fast_reboot_entry+0x39c S: 0000000031e03f00 R: 0000000030002a4c fast_reset_entry+0x2c --- OPAL boot --- The OPAL flash API hooks directly into the blocklevel layer, so there's no delay for e.g. the host kernel, just for asynchronously loaded resources during boot. Signed-off-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-05-09pci/iov: Remove skiboot VF trackingOliver O'Halloran4-305/+1
[ Upstream commit 22057f868f3b2b1fd02647a738f6da0858b5eb6c ] This feature was added a few years ago in response to a request to make the MaxPayloadSize (MPS) field of a Virtual Function match the MPS of the Physical Function that hosts it. The SR-IOV specification states the the MPS field of the VF is "ResvP". This indicates the VF will use whatever MPS is configured on the PF and that the field should be treated as a reserved field in the config space of the VF. In other words, a SR-IOV spec compliant VF should always return zero in the MPS field. Adding hacks in OPAL to make it non-zero is... misguided at best. Additionally, there is a bug in the way pci_device structures are handled by VFs that results in a crash on fast-reboot that occurs if VFs are enabled and then disabled prior to rebooting. This patch fixes the bug by removing the code entirely. This patch has no impact on SR-IOV support on the host operating system. Cc: Sergey Miroshnichenko <s.miroshnichenko@yadro.com> Cc: skiboot-stable@lists.ozlabs.org Tested-by: Santwana Samantray <santwana.samantray@in.ibm.com> Tested-by: Satheesh Rajendran <satheera@in.ibm.com> [oliver: added tested-bys] Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-05-09hw/xscom: Enable sw xstop by default on p9Oliver O'Halloran1-24/+2
[ Upstream commit af5a3ee925d11f4e4e5276ccd5c6ec20b2d2df9f ] This was disabled at some point during bringup to make life easier for the lab folks trying to debug NVLink issues. This hack really should have never made it out into the wild though, so we now have the following situation occuring in the field: 1) A bad happens 2) The host kernel recieves an unrecoverable HMI and calls into OPAL to request a platform reboot. 3) OPAL rejects the reboot attempt and returns to the kernel with OPAL_PARAMETER. 4) Kernel panics and attempts to kexec into a kdump kernel. A side effect of the HMI seems to be CPUs becoming stuck which results in the initialisation of the kdump kernel taking a extremely long time (6+ hours). It's also been observed that after performing a dump the kdump kernel then crashes itself because OPAL has ended up in a bad state as a side effect of the HMI. All up, it's not very good so re-enable the software checkstop by default. If people still want to turn it off they can using the nvram override. Cc: skiboot-stable@lists.ozlabs.org Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Acked-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-05-09opal/hmi: Initialize the hmi event with old value of TFMR.Mahesh Salgaonkar1-1/+3
[ Upstream commit 5f339b4b5d805d8d6bb50e11674dca01255402b4 ] Do this before we fix TFAC errors. Otherwise the event at host console shows no thread error reported in TFMR register. Without this patch the console event show TFMR with no thread error: (DEC parity error TFMR[59] injection) [ 53.737572] Severe Hypervisor Maintenance interrupt [Recovered] [ 53.737596] Error detail: Timer facility experienced an error [ 53.737611] HMER: 0840000000000000 [ 53.737621] TFMR: 3212000870e04000 After this patch it shows old TFMR value on host console: [ 2302.267271] Severe Hypervisor Maintenance interrupt [Recovered] [ 2302.267305] Error detail: Timer facility experienced an error [ 2302.267320] HMER: 0840000000000000 [ 2302.267330] TFMR: 3212000870e14010 Fixes: 674f7696f ("opal/hmi: Rework HMI handling of TFAC errors") Cc: skiboot-stable@lists.ozlabs.org Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-05-09test-ipmi-hiomap: Add read-one-byte testVasant Hegde1-0/+40
[ Upstream commit 857f046d3ab00ec12dcb06ddabfed6bdfe00a819 ] Add test case to read: - 1 byte - 1 block and 1 byte data Cc: Andrew Jeffery <andrew@aj.id.au> Cc: skiboot-stable@lists.ozlabs.org Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Reviewed-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-05-09test-ipmi-hiomap: Fix lpc-read-successVasant Hegde1-1/+3
[ Upstream commit 9fd1495fba7a549c6cc1df239b3035750285ed61 ] Cc: Andrew Jeffery <andrew@aj.id.au> Cc: skiboot-stable@lists.ozlabs.org Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Reviewed-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-05-09test-ipmi-hiomap: Add write-one-byte testVasant Hegde1-0/+38
[ Upstream commit 9facf33360549ce3ab82b4d1c4478332735cd16e ] Add test case to write: - 1 byte - 1 block and 1 byte data Cc: Andrew Jeffery <andrew@aj.id.au> Cc: skiboot-stable@lists.ozlabs.org Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Reviewed-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-05-09test-ipmi-hiomap: Assert if size is zeroVasant Hegde1-1/+2
[ Upstream commit dfa935ebaa5438bfd2b4623743a54686fecce353 ] Cc: Andrew Jeffery <andrew@aj.id.au> Cc: skiboot-stable@lists.ozlabs.org Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Reviewed-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-05-09libflash/ipmi-hiomap: Fix blocks count issueVasant Hegde1-3/+18
[ Upstream commit 7f291166283f0f02e75787023326279036feebe1 ] We convert data size to block count and pass block count to BMC. If data size is not block aligned then we endup sending block count less than actual data. BMC will write partial data to flash memory. Sample log : [ 594.388458416,7] HIOMAP: Marked flash dirty at 0x42010 for 8 [ 594.398756487,7] HIOMAP: Flushed writes [ 594.409596439,7] HIOMAP: Marked flash dirty at 0x42018 for 3970 [ 594.419897507,7] HIOMAP: Flushed writes In this case HIOMAP sent data with block count=0 and hence BMC didn't flush data to flash. Lets fix this issue by adjusting block count before sending it to BMC. Cc: Andrew Jeffery <andrew@aj.id.au> Cc: skiboot-stable@lists.ozlabs.org Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Reviewed-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-05-09Fix hang in pnv_platform_error_reboot path due to TOD failure.Mahesh Salgaonkar1-0/+10
[ Upstream commit 2c71d7032484f4267ca425b5a58c5a6f06a2eaf3 ] On TOD failure, with TB stuck, when linux heads down to pnv_platform_error_reboot() path due to unrecoverable hmi event, the panic cpu gets stuck in OPAL inside ipmi_queue_msg_sync(). At this time, rest all other cpus are in smp_handle_nmi_ipi() waiting for panic cpu to proceed. But with panic cpu stuck inside OPAL, linux never recovers/reboot. p0 c1 t0 NIA : 0x000000003001dd3c <.time_wait+0x64> CFAR : 0x000000003001dce4 <.time_wait+0xc> MSR : 0x9000000002803002 LR : 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec> STACK: SP NIA 0x0000000031c236e0 0x0000000031c23760 (big-endian) 0x0000000031c23760 0x000000003002ecf8 <.ipmi_queue_msg_sync+0xec> 0x0000000031c237f0 0x00000000300aa5f8 <.hiomap_queue_msg_sync+0x7c> 0x0000000031c23880 0x00000000300aaadc <.hiomap_window_move+0x150> 0x0000000031c23950 0x00000000300ab1d8 <.ipmi_hiomap_write+0xcc> 0x0000000031c23a90 0x00000000300a7b18 <.blocklevel_raw_write+0xbc> 0x0000000031c23b30 0x00000000300a7c34 <.blocklevel_write+0xfc> 0x0000000031c23bf0 0x0000000030030be0 <.flash_nvram_write+0xd4> 0x0000000031c23c90 0x000000003002c128 <.opal_write_nvram+0xd0> 0x0000000031c23d20 0x00000000300051e4 <opal_entry+0x134> 0xc000001fea6e7870 0xc0000000000a9060 <opal_nvram_write+0x80> 0xc000001fea6e78c0 0xc000000000030b84 <nvram_write_os_partition+0x94> 0xc000001fea6e7960 0xc0000000000310b0 <nvram_pstore_write+0xb0> 0xc000001fea6e7990 0xc0000000004792d4 <pstore_dump+0x1d4> 0xc000001fea6e7ad0 0xc00000000018a570 <kmsg_dump+0x140> 0xc000001fea6e7b40 0xc000000000028e5c <panic_flush_kmsg_end+0x2c> 0xc000001fea6e7b60 0xc0000000000a7168 <pnv_platform_error_reboot+0x68> 0xc000001fea6e7bd0 0xc0000000000ac9b8 <hmi_event_handler+0x1d8> 0xc000001fea6e7c80 0xc00000000012d6c8 <process_one_work+0x1b8> 0xc000001fea6e7d20 0xc00000000012da28 <worker_thread+0x88> 0xc000001fea6e7db0 0xc0000000001366f4 <kthread+0x164> 0xc000001fea6e7e20 0xc00000000000b65c <ret_from_kernel_thread+0x5c> This is because, there is a while loop towards the end of ipmi_queue_msg_sync() which keeps looping until "sync_msg" does not match with "msg". It loops over time_wait_ms() until exit condition is met. In normal scenario time_wait_ms() calls run pollers so that ipmi backend gets a chance to check ipmi response and set sync_msg to NULL. while (sync_msg == msg) time_wait_ms(10); But in the event when TB is in failed state time_wait_ms()->time_wait_poll() returns immediately without calling pollers and hence we end up looping forever. This patch fixes this hang by calling opal_run_pollers() in TB failed state as well. Fixes: 1764f2452 ("opal: Fix hang in time_wait* calls on HMI for TB errors.") Cc: skiboot-stable@lists.ozlabs.org Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-05-09core/ipmi: Print correct netfn valueVasant Hegde1-1/+1
[ Upstream commit 08de14aaa937ad703182302470c3525bb328b6de ] Fixes: 7516e382 (core/ipmi: Improve error message) Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-05-09core/lock: don't set bust_locks on lock errorNicholas Piggin1-2/+0
[ Upstream commit 527286706ab1861774742402c4e69e1b3b6f6cde ] bust_locks is a big hammer that guarantees a mess if it's set while all other threads are not stopped. I propose removing this in the lock error paths. In debugging the previous deadlock false positive, none of the error messages printed, and the in-memory console was totally garbled due to lack of locking. I think it's generally better for debugging and system integrity to keep locks held when lock errors occur. Lock busting should be used carefully, just to allow messages to be printed out or machine to be restarted, probably when the whole system is single-threaded. Skiboot is slowly working toward that being feasible with co-operative debug APIs between firmware and host, but for the time being, difficult lock crashes are better not to corrupt everything by busting locks. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-03-18skiboot v6.0.19 release notesv6.0.19Vasant Hegde1-0/+37
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-03-18p9dsu: Undo slot label name changesDeb McLemore1-16/+16
[ Upstream commit c470806a2e5e25894aae772fc04069bbe35a21e4 ] During some code updates the slot labels were updated to reflect the phb layout, however expectations were that the slot labels be aligned with the riser card slots and not the system planar slots. [stewart: The tale of how we got here is long and varied and not at all clear. The first ESS systems went out with a skiboot v5.9.8 with additional SuperMicro patches. It was probably a slot table, but who knows, we don't have the code so can't check. It's possible it was all coming in through HDAT instead). The op-build tree (thus the exact patches) shipped on systems that work correct seems to not be around anywhere anymore (if it ever was). It was only in skiboot v6.0 that a slot table made it in, and, of course, only having remote machines in random configs, including possibly with riser cards from Briggs&Stratton rather than the ones destined for this system, doesn't make for verifying this at all. It also doesn't help that *consistently* there is *never* any review on slot tables, and we've had things be wrong in the past. Combine this with not upstream Hostboot patches.] Cc: skiboot-stable@lists.ozlabs.org Cc: Benjamin Mashak <mashak@us.ibm.com> Cc: Michael Lim <youhour@us.ibm.com> Fixes: 64a16ae05bb2 ("p9dsu: Fix slot labels for p9dsu2u") Fixes: 87517c8737b9 ("p9dsu: Fix p9dsu slot tables") Fixes: 31231ed300f2 ("p9dsu: Fix p9dsu default variant") Signed-off-by: Deb McLemore <debmc@linux.ibm.com> [stewart: added more detailed explanation, cc stable] Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-03-18p9dsu: Fix slot labels for p9dsu2uDeb McLemore1-5/+5
[ Upstream commit 64a16ae05bb28c075506efb10b569b833680068d ] Update the slot labels for the p9dsu2u tables. Signed-off-by: Deb McLemore <debmc@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-03-05skiboot v6.0.18 release notesv6.0.18Vasant Hegde1-0/+184
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-03-05hw/bt: Do not disable ipmi message retry during OPAL bootVasant Hegde1-1/+1
[ Upstream commit c0ab7b45db3dc44daf001f61324bd1418091dede ] Currently OPAL doesn't know whether BMC is functioning or not. If BMC is down (like BMC reboot), then we keep on retry sending message to BMC. So in some corner cases we may hit hard lockup issue in kernel. Ideally we should avoid using synchronous path as much as possible. But for now commit 01f977c3 added option to disable message retry in synchronous. But this fix is not required during boot. Hence lets disable IPMI message retry during OPAL boot. Fixes: 01f977c3 (hw/bt: Add backend interface to disable ipmi message) Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-03-05core/ipmi: Add ipmi sync messages to top of the listVasant Hegde1-1/+1
[ Upstream commit 968c30905d7a61d777606a5f5c7949027564efd8 ] In ipmi_queue_msg_sync() path OPAL will wait until it gets response from BMC. If we do not get response ontime we may endup in kernel hardlockups. Hence lets add sync messages to top of the queue. This will reduces the chance of hardlockups. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-03-05hw/bt: Introduce separate list for synchronous messagesVasant Hegde1-45/+63
[ Upstream commit 61978c2c54d082b6f29f492e898e59d6198cfb5f ] BT send logic always sends top of bt message list to BMC. Once BMC reads the message, it clears the interrupt and bt_idle() becomes true. bt_add_ipmi_msg_head() adds message to top of the list. If bt message list is not empty then: - if bt_idle() is true then we will endup sending message to BMC before getting response from BMC for inflight message. Looks like on some BMC implementation this results in message timeout. - else we endup starting message timer without actually sending message to BMC.. which is not correct. This patch introduces separate list to track synchronous messages. bt_add_ipmi_msg_head() will add messages to tail of this new list. We will always process this queue before processing normal queue. Finally this patch introduces new variable (inflight_bt_msg) to track inflight message. This will point to current inflight message. Suggested-by: Oliver O'Halloran <oohall@gmail.com> Suggested-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-03-05powercap: occ: Fix the powercapping range allowed for userShilpasri G Bhat2-9/+30
[ Upstream commit a96739c6c1cdf02521cab703733cef8cb54debbf ] OCC provides two limits for minimum powercap. One being hard powercap minimum which is guaranteed by OCC and the other one is a soft powercap minimum which is lesser than hard-min and may or may not be asserted due to various power-thermal reasons. So to allow the users to access the entire powercap range, this patch exports soft powercap minimum as the "powercap-min" DT property. And it also adds a new DT property called "powercap-hard-min" to export the hard-min powercap limit. Fixes: c6aabe3f2eb5("powercap: occ: Add a generic powercap framework") Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com> Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-03-05test-ipmi-hiomap: Add erase-one-block-twice testAndrew Jeffery1-0/+61
[ Upstream commit a3777e58990f0b540ab0acbf89c999cde09e2f64 ] Cc: stable Signed-off-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-03-05test-ipmi-hiomap: Add write-one-block-twice testAndrew Jeffery1-0/+65
[ Upstream commit ed35a7d04bde4d3811cfcd87ff316be04ac6bdcf ] Cc: stable Signed-off-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-03-05test-ipmi-hiomap: Add read-one-block-twice testAndrew Jeffery1-0/+33
[ Upstream commit b94bb54f6569af09e2b866871acc192862670126 ] Cc: stable Signed-off-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-03-05test-ipmi-hiomap: Add protocol-recovery-get-flash-info-failure testAndrew Jeffery1-0/+195
[ Upstream commit 9398b84fad0eaeb0dbfffb2f1f2c88a42ab5cc1e ] Cc: stable Signed-off-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-03-05test-ipmi-hiomap: Add protocol-recovery-failure-get-info testAndrew Jeffery1-0/+191
[ Upstream commit ce81b936413139b7952f858d530a7a170e1becaf ] Cc: stable Signed-off-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-03-05test-ipmi-hiomap: Add protocol-recovery-failure-ack testAndrew Jeffery1-0/+170
[ Upstream commit d0c798252521eb3c81f8f3883408d51ff9dc8ad0 ] Cc: stable Signed-off-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-03-05test-ipmi-hiomap: Add erase-malformed testsAndrew Jeffery1-0/+92
[ Upstream commit 7735cc3546098c2970765d4d1c216895c651c8ad ] Cc: stable Signed-off-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-03-05test-ipmi-hiomap: Add flush-malformed testsAndrew Jeffery1-0/+95
[ Upstream commit 985c7a26bcb105f39c57b97675ca05e952df4914 ] Cc: stable Signed-off-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-03-05test-ipmi-hiomap: Add mark-dirty-malformed testsAndrew Jeffery1-0/+101
[ Upstream commit c7b8293d867ad2456d3946e06346a9994b13c94d ] Cc: stable Signed-off-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>