aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2020-06-04skiboot v6.3.5 release notesv6.3.5skiboot-6.3.xVasant Hegde1-0/+17
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2020-06-04core/ipmi: Fix use-after-freeVasant Hegde1-3/+11
[ Upstream commit d75e82dbfbb9443efeb3f9a5921ac23605aab469 ] Commit f01cd77 introduced backend poller() for ipmi message. But in some corner cases its possible that we endup calling poller() after freeing ipmi message. Thread 1 : ipmi_queue_msg_sync() Waiting for ipmi sync message to complete Thread 2 : bt_poll() -> ipmi_cmd_done() -> callback handler -> free message Oliver hit this issue during fast-reboot test with skiboot DEBUG build. In debug build we poision the memory after free. That helped us to catch this issue. [ 460.295570781,3] *********************************************** [ 460.295773157,3] Fatal MCE at 0000000030035cb4 .ipmi_queue_msg_sync+0x110 MSR 9000000000201002 [ 460.295887496,3] CFAR : 0000000030035ce8 MSR : 9000000000000000 [ 460.295956419,3] SRR0 : 0000000030035cb4 SRR1 : 9000000000201002 [ 460.296035015,3] HSRR0: 0000000030012624 HSRR1: 9000000002803002 [ 460.296102413,3] DSISR: 00000008 DAR : 99999999999999d1 [ 460.296169710,3] LR : 0000000030035ce4 CTR : 0000000030002880 [ 460.296248482,3] CR : 28002422 XER : 20040000 [ 460.296336621,3] GPR00: 0000000030035ce4 GPR16: 00000000301d36d8 [ 460.296415449,3] GPR01: 0000000031c133d0 GPR17: 00000000300f5cd8 [ 460.296482811,3] GPR02: 0000000030142700 GPR18: 0000000030407ff0 [ 460.296550265,3] GPR03: 0000000000000100 GPR19: 0000000000000000 [ 460.296629041,3] GPR04: 0000000028002424 GPR20: 0000000000000000 [ 460.296696369,3] GPR05: 0000000020040000 GPR21: 0000000030121d73 [ 460.296820977,3] GPR06: c000001fffffd480 GPR22: 0000000030121dd2 [ 460.296888226,3] GPR07: c000001fffffd480 GPR23: 0000000030613400 [ 460.296978218,3] GPR08: 0000000000000001 GPR24: 0000000000000001 [ 460.297056871,3] GPR09: 9999999999999999 GPR25: 0000000031c13960 [ 460.297124647,3] GPR10: 0000000000000000 GPR26: 0000000000000004 [ 460.297203811,3] GPR11: 0000000000000000 GPR27: 0000000000000003 [ 460.297271250,3] GPR12: 0000000028002424 GPR28: 0000000030613400 [ 460.297339026,3] GPR13: 0000000031c10000 GPR29: 0000000030406b50 [ 460.297417605,3] GPR14: 00000000300f58f8 GPR30: 0000000030406b40 [ 460.297485176,3] GPR15: 00000000300f58d8 GPR31: 00000000309249c8 Reported-by: Oliver O'Halloran <oohall@gmail.com> Fixes: f01cd77 (ipmi: ensure forward progress on ipmi_queue_msg_sync()) Cc: skiboot-stable@lists.ozlabs.org # v6.3+ Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2020-06-04uart: Drop console write data if BMC becomes unresponsiveVasant Hegde1-26/+74
[ Upstream commit 6bf21350da32776aac8ba75bf48933854647bd7e ] If BMC becomes unresponsive (ex: during BMC reboot) during console write then we may get stuck in uart_wait_tx_room(). This will result in CPU to get stuck in OPAL. This will result in kernel lockups and in some cases host becomes unresponsive. This patch introduces timeout option. If UART operation doesn't complete within predefined time then it will drop write data and comes out. Note that this patch fixes both OPAL internal console as well as console write APIs. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [Various fixes on top of Nick's proposal to have single timer - Vasant] Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-10-03skiboot v6.3.4 release notesv6.3.4Vasant Hegde1-0/+29
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-10-03hw/phb4: Prevent register accesses when in resetOliver O'Halloran2-0/+11
[ Upstream commit b310e8f79e6817e18bd0e3c606da50a00b425ef0 ] While the the ETU is in reset we cannot access any of the PHB registers. If a PHB register is accessed via the XSCOM indirect interface then we'll cause an ETU reset error which may prevent the PHB from being re-initialised once the reset is lifted. Prevent register accesses while in reset by adding a flag that is set while the ETU reset bit is high and checking that flag in the XSCOM (ASB) backdoor register access path. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-10-03core/platform: Actually disable fast-reboot on P8Oliver O'Halloran1-3/+5
[ Upstream commit 923b5a5342a7a37bd376327e35c7fcb98138d41c ] There was an attempt. It was not successful. Fixes: 14f709b8eeda ("Disable fast-reset for POWER8") Cc: skiboot-stable@lists.ozlabs.org Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-10-03xive: fix return value of opal_xive_allocate_irq()Cédric Le Goater1-1/+1
[ Upstream commit e97391ae2bb5a146a5041453f9185326654264d9 ] When the maximum number of interrupts per chip is reached, xive_try_allocate_irq() returns an internal XIVE error: XIVE_ALLOC_NO_SPACE. But its value 0xffffffff is interpreted as a positive value by its caller opal_xive_allocate_irq() and not as an error. opal_xive_allocate_irq() returns this value to Linux which also considers 0xffffffff as a valid interrupt number and tries to get the interrupt characteritics using opal_xive_get_irq_info(). This OPAL calls finally fails leading to all sort of errors on the host which is not prepared for such a scenario. Code impacted are the IPI setup and the both XIVE KVM devices. Fix by returning OPAL_RESOURCE from xive_try_allocate_irq() which is consistent with the other errors returned by this routine. This fixes the behavior in opal_xive_allocate_irq() and in Linux. A workaround could be introduced in Linux to consider 0xffffffff as a OPAL_RESOURCE value. This assumption is valid with the current XIVE IRQ number encoding. Fixes: 07946e68f47a ("xive: Add interrupt allocator") Reported-by: Greg Kurz <groug@kaod.org> Signed-off-by: Cédric Le Goater <clg@kaod.org> [oliver: Added fixes tag] Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-10-03hw/phb4: Use standard MIN/MAX macro definitionsJordan Niethe1-6/+3
[ Upstream commit 41f6c806091627dff980d6d0d96f04f19517394d ] The max() macro definition incorrectly returns the minimum value. The max() macro is used to ensure that PERST has been asserted for 250ms and that we wait 100ms seconds for the ETU logic in the CRESET_START PHB4 PCI slot state. However, by returning the minimum value there is no guarantee that either of these requirements are met. Correct macro definitions for MIN and MAX are already provided in skiboot.h. Remove the redundant/incorrect versions here and switch to using the standard ones. Fixes: 70edcbb4b39d ("hw/phb4: Skip FRESET PERST when coming from CRESET") Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-10-03doc/requirements.txt: pin docutils at 0.14Stewart Smith1-0/+2
[ Upstream commit 8995ad6165b4a1518f3efd5a48ca5a8cd0be6b4b ] docutils is a dependency for sphinx. The recently released 0.15 version throws a syntax error like so: + cd doc + make html sphinx-build -b html -d _build/doctrees . _build/html Traceback (most recent call last): File "/usr/bin/sphinx-build", line 6, in <module> from sphinx.cmd.build import main File "/usr/lib64/python2.7/site-packages/sphinx/cmd/build.py", line 20, in <module> from docutils.utils import SystemMessage File "/usr/lib/python2.7/site-packages/docutils/utils/__init__.py", line 21, in <module> import docutils.io File "/usr/lib/python2.7/site-packages/docutils/io.py", line 348 (self.destination.mode, mode)), file=self._stderr) ^ SyntaxError: invalid syntax make: *** [Makefile:53: html] Error 1 obviously, this isn't ideal - so let's pin our version to one that actually works. Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-08-06skiboot v6.3.3 release notesv6.3.3Vasant Hegde1-0/+73
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-08-06struct p9_sbe_msg doesn't need to be packedStewart Smith1-1/+1
[ Upstream commit ef691db3533742d9dd6eed1a311472a7c52be94b ] Only the reg member is sent anywhere (via xscom_write), so the structure does not need to be packed. Fixes GCC9 build problem: hw/sbe-p9.c: In function ‘p9_sbe_msg_send’: hw/sbe-p9.c:270:9: error: taking address of packed member of ‘struct p9_sbe_msg’ may result in an unaligned pointer value [-Werror=address-of-packed-member] 270 | data = &msg->reg[0]; | ^~~~~~~~~~~~ Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-08-06hdata/vpd: fix printing (char*)0x00Stewart Smith1-4/+5
[ Upstream commit ba977f2e4406f9de318afcdf5d666e77585ef269 ] GCC9 now catches this bug: In file included from hdata/vpd.c:17: In function ‘vpd_vini_parse’, inlined from ‘vpd_data_parse’ at hdata/vpd.c:416:3: /home/stewart/skiboot/include/skiboot.h:93:31: error: ‘%s’ directive argument is null [-Werror=format-overflow=] 93 | #define prlog(l, f, ...) do { _prlog(l, pr_fmt(f), ##__VA_ARGS__); } while(0) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ hdata/vpd.c:390:5: note: in expansion of macro ‘prlog’ 390 | prlog(PR_WARNING, | ^~~~~ hdata/vpd.c: In function ‘vpd_data_parse’: hdata/vpd.c:391:46: note: format string is defined here 391 | "VPD: CCIN desc not available for: %s\n", | ^~ cc1: all warnings being treated as errors Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-08-06errorlog: Prevent alignment error building with gcc9.Michal Suchanek1-1/+1
[ Upstream commit 6080c106e797ea8375ac164e8f53de3308d42abb ] Fixes this build error: [ 52s] hw/fsp/fsp-elog-write.c: In function 'opal_elog_read': [ 52s] hw/fsp/fsp-elog-write.c:213:12: error: taking address of packed member of 'struct errorlog' may result in an unaligned pointer value [-Werror=address-of-packed-member] [ 52s] 213 | list_del(&log_data->link); [ 52s] | ^~~~~~~~~~~~~~~ Fixes: https://github.com/open-power/skiboot/issues/247 Signed-off-by: Michal Suchanek <msuchanek@suse.de> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-08-06Support BMC IPMI heartbeat commandAndrew Geissler1-0/+14
[ Upstream commit 2554cac82da530acfcb1a575c571e760de92dde4 ] A few years ago, the OpenBMC code added support for a "heartbeat" command to send to the host. This command is used after the BMC is reset to check if the host is running. Support was never added to the host side however so currently when the BMC sends this command, this appears in the host console: IPMI: unknown OEM SEL command ff received There is no response needed by the host (other then the low level acknowledge of the command which already occurs). This commit handles the command so the error is no longer printed (does nothing with the command though since no action is needed). Here's the tested output of this patch in the host console (with debug enabled): IPMI: BMC issued heartbeat command: 00 Signed-off-by: Andrew Geissler <geissonator@yahoo.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-08-06Add: add mihawk platform filejoy_chu2-1/+268
[ Upstream commit 9570730f5f8506126cc56efb591cfe370b8c63e2 ] Signed-off-by: joy_chu <joy_chu@wistron.com> Acked-by: Stewart Smith <stewart@linux.ibm.com> [oliver: use SPDX for license; removed whitespace error] Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-07-01skiboot v6.3.2 release notesv6.3.2Vasant Hegde1-0/+219
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-07-01npu2: Purge cache when resetting a GPUReza Arbab1-0/+6
[ Upstream commit d4f2f77377dab27eaf792aa089090bcdd953a5cc ] After putting all a GPU's links in reset, do a cache purge in case we have CPU cache lines belonging to the now-unaccessible GPU memory. Fixes: 68d11e4460ec ("npu2: Reset NVLinks when resetting a GPU") Cc: skiboot-stable@lists.ozlabs.org Signed-off-by: Reza Arbab <arbab@linux.ibm.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-07-01npu2: Reset NVLinks when resetting a GPUAlexey Kardashevskiy1-0/+55
[ Upstream commit 68d11e4460ecaaa7f6253f836d787a1582266074 ] Resetting a V100 GPU brings its NVLinks down and if an NPU tries using those, an HMI occurs. We were lucky not to observe this as the bare metal does not normally reset a GPU and when passed through, GPUs are usually before NPUs in QEMU command line or Libvirt XML and because of that NPUs are naturally reset first. However simple change of the device order brings HMIs. This defines a bus control filter for a PCI slot with a GPU with NVLinks so when the host system issues secondary bus reset to the slot, it resets associated NVLinks. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-06-30hw/phb4: Assert Link Disable bit after ETU initOliver O'Halloran1-0/+6
[ Upstream commit 02a683bf09d94757a72eec00e602c5609aa8d754 ] The cursed RAID card in ozrom1 has a bug where it ignores PERST being asserted. The PCIe Base spec is a little vague about what happens while PERST is asserted, but it does clearly specify that when PERST is de-asserted the Link Training and Status State Machine (LTSSM) of a device should return to the initial state (Detect) defined in the spec and the link training process should restart. This bug was worked around in 9078f8268922 ("phb4: Delay training till after PERST is deasserted") by setting the link disable bit at the start of the FRESET process and clearing it after PERST was de-asserted. Although this fixed the bug, the patch offered no explaination of why the fix worked. In b8b4c79d4419 ("hw/phb4: Factor out PERST control") the link disable workaround was moved into phb4_assert_perst(). This is called always in the CRESET case, but a following patch resulted in assert_perst() not being called if phb4_freset() was entered following a CRESET since p->skip_perst was set in the CRESET handler. This is bad since a side-effect of the CRESET is that the Link Disable bit is cleared. This, combined with the RAID card ignoring PERST results in the PCIe link being trained by the PHB while we're waiting out the 100ms ETU reset time. If we hack skiboot to print a DLP trace after returning from phb4_hw_init() we get: PHB#0001[0:1]: Initialization complete PHB#0001[0:1]: TRACE:0x0000102101000000 0ms presence GEN1:x16:polling PHB#0001[0:1]: TRACE:0x0000001101000000 23ms GEN1:x16:detect PHB#0001[0:1]: TRACE:0x0000102101000000 23ms presence GEN1:x16:polling PHB#0001[0:1]: TRACE:0x0000183101000000 29ms training GEN1:x16:config PHB#0001[0:1]: TRACE:0x00001c5881000000 30ms training GEN1:x08:recovery PHB#0001[0:1]: TRACE:0x00001c5883000000 30ms training GEN3:x08:recovery PHB#0001[0:1]: TRACE:0x0000144883000000 33ms presence GEN3:x08:L0 PHB#0001[0:1]: TRACE:0x0000154883000000 33ms trained GEN3:x08:L0 PHB#0001[0:1]: CRESET: wait_time = 100 PHB#0001[0:1]: FRESET: Starts PHB#0001[0:1]: FRESET: Prepare for link down PHB#0001[0:1]: FRESET: Assert skipped PHB#0001[0:1]: FRESET: Deassert PHB#0001[0:1]: TRACE:0x0000154883000000 0ms trained GEN3:x08:L0 PHB#0001[0:1]: TRACE: Reached target state PHB#0001[0:1]: LINK: Start polling PHB#0001[0:1]: LINK: Electrical link detected PHB#0001[0:1]: LINK: Link is up PHB#0001[0:1]: LINK: Went down waiting for stabilty PHB#0001[0:1]: LINK: DLP train control: 0x0000105101000000 PHB#0001[0:1]: CRESET: Starts What has happened here is that the link is trained to 8x Gen3 33ms after we return from phb4_init_hw(), and before we've waitined to 100ms that we normally wait after re-initialising the ETU. When we "deassert" PERST later on in the FRESET handler the link in L0 (normal) state. At this point we try to read from the Vendor/Device ID register to verify that the link is stable and immediately get a PHB fence due to a PCIe Completion Timeout. Skiboot attempts to recover by doing another CRESET, but this will encounter the same issue. This patch fixes the problem by setting the Link Disable bit (by calling phb4_assert_perst()) immediately after we return from phb4_init_hw(). This prevents the link from being trained while PERST is asserted which seems to avoid the Completion Timeout. With the patch applied we get: PHB#0001[0:1]: Initialization complete PHB#0001[0:1]: TRACE:0x0000102101000000 0ms presence GEN1:x16:polling PHB#0001[0:1]: TRACE:0x0000001101000000 23ms GEN1:x16:detect PHB#0001[0:1]: TRACE:0x0000102101000000 23ms presence GEN1:x16:polling PHB#0001[0:1]: TRACE:0x0000909101000000 29ms presence GEN1:x16:disabled PHB#0001[0:1]: CRESET: wait_time = 100 PHB#0001[0:1]: FRESET: Starts PHB#0001[0:1]: FRESET: Prepare for link down PHB#0001[0:1]: FRESET: Assert skipped PHB#0001[0:1]: FRESET: Deassert PHB#0001[0:1]: TRACE:0x0000001101000000 0ms GEN1:x16:detect PHB#0001[0:1]: TRACE:0x0000102101000000 0ms presence GEN1:x16:polling PHB#0001[0:1]: TRACE:0x0000001101000000 24ms GEN1:x16:detect PHB#0001[0:1]: TRACE:0x0000102101000000 36ms presence GEN1:x16:polling PHB#0001[0:1]: TRACE:0x0000183101000000 97ms training GEN1:x16:config PHB#0001[0:1]: TRACE:0x00001c5881000000 97ms training GEN1:x08:recovery PHB#0001[0:1]: TRACE:0x00001c5883000000 97ms training GEN3:x08:recovery PHB#0001[0:1]: TRACE:0x0000144883000000 99ms presence GEN3:x08:L0 PHB#0001[0:1]: TRACE: Reached target state PHB#0001[0:1]: LINK: Start polling PHB#0001[0:1]: LINK: Electrical link detected PHB#0001[0:1]: LINK: Link is up PHB#0001[0:1]: LINK: Link is stable PHB#0001[0:1]: LINK: Card [9005:028c] Optimal Retry:disabled PHB#0001[0:1]: LINK: Speed Train:GEN3 PHB:GEN4 DEV:GEN3 PHB#0001[0:1]: LINK: Width Train:x08 PHB:x08 DEV:x08 PHB#0001[0:1]: LINK: RX Errors Now:0 Max:8 Lane:0x0000 Cc: Michael Neuling <mikey@neuling.org> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-06-08npu2: Reset PID wildcard and refcounter when mapped to LPIDAlexey Kardashevskiy1-0/+7
[ Upstream commit 7c977c734e1c4d3be9a036a075798530d352d8e3 ] Since 105d80f85b "npu2: Use unfiltered mode in XTS tables" we do not register every PID in the XTS table so the table has one entry per LPID. Then we added a reference counter to keep track of the entry use when switching GPU between the host and guest systems (the "Fixes:" tag below). The POWERNV platform setup creates such entries and references them at the boot time when initializing IOMMUs and only removes it when a GPU is passed through to a guest. This creates a problem as POWERNV boots via kexec and no defererencing happens; the XTS table state remains undefined. So when the host kernel boots, skiboot thinks there are valid XTS entries and does not update the XTS table which breaks ATS. This adds the reference counter and the XTS entry reset when a GPU is assigned to LPID and we cannot rely on the kernel to clean that up. Fixes: ba1d95a1d460 ("npu2: Add XTS_BDF_MAP wildcard refcount") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Tested-by: Reza Arbab <arbab@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-06-08hw/phb4: Use read/write_reg in assert_perstOliver O'Halloran1-2/+2
[ Upstream commit 771497098efded8d3a2c0688bab1c1d48d093443 ] While the PHB is fenced we can't use the MMIO interface to access PHB registers. While processing a complete reset we inject a PHB fence to isolate the PHB from the rest of the system because the PHB won't respond to MMIOs from the rest of the system while being reset. We assert PERST after the fence has been erected which requires us to use the XSCOM indirect interface to access the PHB registers rather than the MMIO interface. Previously we did that when asserting PERST in the CRESET path. However in b8b4c79d4419 ("hw/phb4: Factor out PERST control"). This was re-written to use the raw in_be64() accessor. This means that CRESET would not be asserted in the reset path. On some Mellanox cards this would prevent them from re-loading their firmware when the system was fast-reset. This patch fixes the problem by replacing the raw {in|out}_be64() accessors with the phb4_{read|write}_reg() functions. Reported-by: Carol L Soto <clsoto@us.ibm.com> Fixes: b8b4c79d4419 ("hw/phb4: Factor out PERST control") Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Tested-by: Carol L Soto <clsoto@us.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-06-08opal-prd: Fix prd message size issueVasant Hegde1-4/+23
[ Upstream commit 9cae036fafea468219892406a846639f2715854d ] If prd messages size is insufficient then read_prd_msg() call fails with below error. And caller is not reallocating sufficient buffer. Also its hard to guess the size. sample log: ----------- Mar 28 03:31:43 zz24p1 opal-prd: FW: error reading from firmware: alloc 32 rc -1: Invalid argument Mar 28 03:31:43 zz24p1 opal-prd: FW: error reading from firmware: alloc 32 rc -1: Invalid argument Mar 28 03:31:43 zz24p1 opal-prd: FW: error reading from firmware: alloc 32 rc -1: Invalid argument .... Lets use `opal-msg-size` device tree property to allocate memory for prd message. Cc: Skiboot Stable <skiboot-stable@lists.ozlabs.org> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-06-08npu2: Fix clearing the FIR bitsAlexey Kardashevskiy1-1/+1
[ Upstream commit 5ed3884f8d04db684e2288a7db011c7b59f1501e ] FIR registers are SCOM-only so they cannot be accesses with the indirect write, and yet we use SCOM-based addresses for these; fix this. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-By: Alistair Popple <alistair@popple.id.au> Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-06-08opal-gard: Account for ECC size when clearing partitionOliver O'Halloran1-1/+2
[ Upstream commit 27d1ef2ebabb07a1958b94049cac3d90a101c3d5 ] When 'opal-gard clear all' is run, it works by erasing the GUARD then using blockevel_smart_write() to write nothing to the partition. This second write call is needed because we rely on libflash to set the ECC bits appropriately when the partition contained ECCed data. The API for this is a little odd with the caller specifying how much actual data to write, and libflash writing size + size/8 bytes since there is one additional ECC byte for every eight bytes of data. We currently do not account for the extra space consumed by the ECC data in reset_partition() which is used to handle the 'clear all' command. Which results in the paritition following the GUARD partition being partially overwritten when the command is used. This patch fixes the problem by reducing the length we would normally write by the number of ECC bytes required. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Tested-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-06-08nvram: Flag dangerous NVRAM optionsMichael Neuling15-34/+78
[ Upstream commit 5beda3c6fe5b72aac95b4c13746ae598dfd64c01 ] Most nvram options used by skiboot are just for debug or testing for regressions. They should never be used long term. We've hit a number of issues in testing and the field where nvram options have been set "temporarily" but haven't been properly cleared after, resulting in crashes or real bugs being masked. This patch marks most nvram options used by skiboot as dangerous and prints a chicken to remind users of the problem. Signed-off-by: Michael Neuling <mikey@neuling.org> Reviewed-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Acked-By: Alistair Popple <alistair@popple.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-06-08devicetree: Don't set path to dtc in makefileJoel Stanley1-1/+1
[ Upstream commit c8b5e8a95caf029ffe73ea18769fdd7f2da48ab4 ] By setting the path we fail to build under buildroot which has it's own set of host tools in PATH, but not at /usr/bin. Keep the variable so it can be set if need be but default to whatever 'dtc' is in the users path. Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-05-10skiboot v6.3.1 release notesv6.3.1Vasant Hegde1-0/+60
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-05-10doc/bmc: Document SBE validation on P8 platformsSamuel Mendoza-Jonas1-0/+27
[ Upstream commit 5e8a373ebe4dea501245e1103de9ca3abc7ab976 ] Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Reviewed-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-05-10platforms/astbmc: Check for SBE validation stepSamuel Mendoza-Jonas10-3/+196
[ Upstream commit 1bc63b896405ccea4584d764a28d01858e81efc3 ] On some POWER8 astbmc systems an update to the SBE requires pausing at runtime to ensure integrity of the SBE. If this is required the BMC will set a chassis boot option IPMI flag using the OEM parameter 0x62. If Skiboot sees this flag is set it waits until the SBE update is complete and the flag is cleared. Unfortunately the mystery operation that validates the SBE also leaves it in a bad state and unable to be used for timer operations. To workaround this the flag is checked as soon as possible (ie. when IPMI and the console are set up), and once complete the system is rebooted. Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Reviewed-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-05-10include/ipmi: Fix incorrect chassis commandsSamuel Mendoza-Jonas1-7/+7
[ Upstream commit bc2b1de3beb2ee7904d936b10c8a57cd220d8ddc ] These commands are listed in the order they appear in the IPMI specification but with the wrong values - correct them! Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Reviewed-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-05-10ipmi: ensure forward progress on ipmi_queue_msg_sync()Stewart Smith4-1/+28
[ Upstream commit f01cd777adb16cbab93215d26159aa1c4606112c ] BT responses are handled using a timer doing the polling. To hope to get an answer to an IPMI synchronous message, the timer needs to run. We can't just check all timers though as there may be a timer that wants a lock that's held by a code path calling ipmi_queue_msg_sync(), and if we did enforce that as a requirement, it's a pretty subtle API that is asking to be broken. So, if we just run a poll function to crank anything that the IPMI backend needs, then we should be fine. This issue shows up very quickly under QEMU when loading the first flash resource with the IPMI HIOMAP backend. Reported-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Reviewed-by: Andrew Jeffery <andrew@aj.id.au> Reviewed-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-05-10pci/iov: Remove skiboot VF trackingOliver O'Halloran4-305/+1
[ Upstream commit 22057f868f3b2b1fd02647a738f6da0858b5eb6c ] This feature was added a few years ago in response to a request to make the MaxPayloadSize (MPS) field of a Virtual Function match the MPS of the Physical Function that hosts it. The SR-IOV specification states the the MPS field of the VF is "ResvP". This indicates the VF will use whatever MPS is configured on the PF and that the field should be treated as a reserved field in the config space of the VF. In other words, a SR-IOV spec compliant VF should always return zero in the MPS field. Adding hacks in OPAL to make it non-zero is... misguided at best. Additionally, there is a bug in the way pci_device structures are handled by VFs that results in a crash on fast-reboot that occurs if VFs are enabled and then disabled prior to rebooting. This patch fixes the bug by removing the code entirely. This patch has no impact on SR-IOV support on the host operating system. Cc: Sergey Miroshnichenko <s.miroshnichenko@yadro.com> Cc: skiboot-stable@lists.ozlabs.org Tested-by: Santwana Samantray <santwana.samantray@in.ibm.com> Tested-by: Satheesh Rajendran <satheera@in.ibm.com> [oliver: added tested-bys] Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
2019-05-03skiboot v6.3 release notesv6.3Stewart Smith1-0/+1275
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-05-03Disable fast-reset for POWER8Stewart Smith1-2/+9
There is a bug with fast-reset when CPU cores are busy, which can be reproduced by running `stress` and then trying `reboot -ff` (this is what the op-test test cases FastRebootHostStress and FastRebootHostStressTorture do). What happens is the cores lock up, which isn't the best thing in the world when you want them to start executing instructions again. A workaround is to use instruction ramming, which while greatly increasing the reliability of fast-reset on p8, doesn't make it perfect. Instruction ramming is what pdbg was modified to do in order to have the sreset functionality work reliably on p8. pdbg patches: https://patchwork.ozlabs.org/project/pdbg/list/?series=96593&state=* Fixes: https://github.com/open-power/skiboot/issues/185 Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-05-03pci: Try harder to add meaningful ibm,loc-codeStewart Smith1-0/+15
We keep the existing logic of looking to the parent for the slot-label or slot-location-code, but we add logic to (if all that fails) we look directly for the slot-location-code (as this should give us the correct loc code for things directly under the PHB), and otherwise we just look for a loc-code. The applicable bit of PAPR here is: R1–12.1–1. Each instance of a hardware entity (FRU) has a platform unique location code and any node in the OF device tree that describes a part of a hardware entity must include the “ibm,loc-code” property with a value that represents the location code for that hardware entity. which we weren't really fully obeying at any recent (ever?) point in time. Now we should do okay, at least for PCI. Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-05-02skiboot v6.3-rc3 release notesv6.3-rc3Stewart Smith1-0/+228
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-05-02Mark all partitions except full PNOR and boot kernel firmware read onlyTimothy Pearson1-0/+7
FFS partitions don't always align on erase blocks. Mark any paritions not known to align on erase blocks as read only to prevent silent corruption of adjacent partitions during erase / write from the host. Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-05-02Expose PNOR Flash partitions to host MTD driver via devicetreeTimothy Pearson2-12/+65
This makes it possible for the host to directly address each partition without requiring each application to directly parse the FFS headers. This has been in use for some time already to allow BOOTKERNFW partition updates from the host. Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-05-02Write boot progress to LPC ports 81 and 82Stewart Smith2-2/+102
There's a thought to write more extensive boot progress codes to LPC ports 81 and 82 to supplement/replace any reliance on port 80. We want to still emit port 80 for platforms like Zaius and Barreleye that have the physical display. Ports 81 and 82 can be monitored by a BMC though. Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-05-02Write boot progress to LPC port 80hStewart Smith6-3/+195
This is an adaptation of what we currently do for op_display() on FSP machines, inventing an encoding for what we can write into the single byte at LPC port 80h. Port 80h is often used on x86 systems to indicate boot progress/status and dates back a decent amount of time. Since a byte isn't exactly very expressive for everything that can go on (and wrong) during boot, it's all about compromise. Some systems (such as Zaius/Barreleye G2) have a physical dual 7 segment display that display these codes. So far, this has only been driven by hostboot (see hostboot commit 90ec2e65314c). Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-05-02Remove Talos DT match from Romulus fileTimothy Pearson1-2/+1
Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-05-02Copy and convert Romulus descriptors to TalosTimothy Pearson2-1/+88
Talos II has some hardware differences from Romulus, therefore we cannot guarantee Talos II == Romulus in skiboot. Copy and slightly modify the Romulus files for Talos II. Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-05-02hw/phb4: Fix references to PHB3Oliver O'Halloran1-2/+2
Currently most of the functionality of phb4_lsi_attributes() is disabled when we have #defined DISABLE_ERR_INTS. This is the default behaviour and #undefing the constant results in skiboot not compiling because the code was not updated when it was copied across from PHB3. This patch fixes the problem by changing the names to the phb4 versions. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-05-02npu2: Disable Probe-to-Invalid-Return-Modified-or-Owned snarfing by defaultAlexey Kardashevskiy2-13/+57
V100 GPUs are known to violate NVLink2 protocol in some cases (one is when memory was accessed by the CPU and they by GPU using so called block linear mapping) and issue double probes to NPU which can cope with this problem only if CONFIG_ENABLE_SNARF_CPM ("disable/enable Probe.I.MO snarfing a cp_m") is not set in the CQ_SM Misc Config register #0. If the bit is set (which is the case today), NPU issues the machine check stop. The snarfing feature is designed to detect 2 probes in flight and combine them into one. This adds a new "opal-npu2-snarf-cpm" nvram variable which controls CONFIG_ENABLE_SNARF_CPM for all NVLinks to prevent the machine check stop from happening. This disables snarfing by default as otherwise a broken GPU driver can crash the entire box even when a GPU is passed through to a guest. This provides a dial to allow regression tests (might be useful for a bare metal). To enable snarfing, the user needs to run: sudo nvram -p ibm,skiboot --update-config opal-npu2-snarf-cpm=enable and reboot the host system. While at this, define macros for register names as well to avoid touching same lines over and over again. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-05-02core/init: LPC isn't just P8 (fix comment)Stewart Smith1-1/+1
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-05-02doc: Add (most) nvram debugging optionsStewart Smith6-1/+166
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-04-29hw/npu2: Show name of opencapi error interruptsFrederic Barrat1-2/+5
Add the name of which error interrupt is received. Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-04-29core/pci: Use PHB io-base-location by default for PHB slotsOliver O'Halloran1-0/+9
On witherspoon only the GPU slots and the three pluggable PCI slots (SLOT0, 1, 2) have platform defined slot names. For builtin devices such as the SATA controller or the PLX switch that fans out to the GPU slots we have no location codes which some people consider an issue. This patch address the problem by making the ibm,slot-location-code for the root port device default to the ibm,io-base-location-code which is typically the location code for the system itself. e.g. pciex@600c3c0100000/ibm,loc-code "UOPWR.0000000-Node0-Proc0" pciex@600c3c0100000/pci@0/ibm,loc-code "UOPWR.0000000-Node0-Proc0" pciex@600c3c0100000/pci@0/usb-xhci@0/ibm,loc-code "UOPWR.0000000-Node0" The PHB node, and the root complex nodes have a loc code of the processor they are attached to, while the usb-xhci device under the root port has a location code of the system itself. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-04-29hw/phb4: Read ibm,loc-code from PBCQ nodeOliver O'Halloran1-2/+2
On P9 the PBCQs are subdivided by stacks which implement the PCI Express logic. When phb4 was forked from phb3 most of the properties that were in the pbcq node moved into the stack node, but ibm,loc-code was not one of them. This patch fixes the phb4 init sequence to read the base location code from the PBCQ node (parent of the stack node) rather than the stack node itself. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2019-04-17hw/xscom: P9P rather than P9Stewart Smith1-1/+1
Fixes: 2c8f96534a978bb4cac3e4b7dd393a9cc4926555 Signed-off-by: Stewart Smith <stewart@linux.ibm.com>