Age | Commit message (Collapse) | Author | Files | Lines |
|
FSP sends MBOX command (cmd : 0xEB, subcmd : 0x05, mod : 0x00) to get vNVRAM
statistics. OPAL doesn't maintain any such statistics. Hence return
FSP_STATUS_INVALID_SUBCMD.
Sample OPAL log:
[16944.384670488,3] FSP: Unhandled message eb0500
[16944.474110465,3] FSP: Unhandled message eb0500
[16945.111280784,3] FSP: Unhandled message eb0500
[16945.293393485,3] FSP: Unhandled message eb0500
With this patch, I don't think FSP will ever call "free vNVRAM" MBOX command.
But to be safer side lets return FSP_STATUS_INVALID_SUBCMD for this MBOX
command as well.
Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Tested-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
xscom_read/write operations returns CHIPLET_OFFLINE when chiplet is offline.
Some multicast xscom_read/write requests from HBRT results in xscom operation
on offline chiplet(s) and printing below warnings in OPAL console.
[ 135.036327572,3] XSCOM: Read failed, ret = -14
[ 135.092689829,3] XSCOM: Read failed, ret = -14
This results in unnecessary bugs. Hence remove error message for multicast
SCOM operations.
Suggested-by: Daniel M Crowell <dcrowell@us.ibm.com>
Tested-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
OPAL logs messages for every IPMI request from host. Sometime OPAL console
is filled with only these messages. This path is pretty stable now and
we have enough logs to cover bad path. Hence lets convert these debug
message to trace/info message.
[ 1356.423958816,7] opal_ipmi_recv(cmd: 0xf0 netfn: 0x3b resp_size: 0x02)
[ 1356.430774496,7] opal_ipmi_send(cmd: 0xf0 netfn: 0x3a len: 0x3b)
[ 1356.430797392,7] BT: seq 0x20 netfn 0x3a cmd 0xf0: Message sent to host
[ 1356.431668496,7] BT: seq 0x20 netfn 0x3a cmd 0xf0: IPMI MSG done
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Update fsp_lid_map to include CAPP ucode lids for phb4-chipid ==
0x200d1 and phb4-chipid == 0x201d1 that corresponds to P9 DD-2.0 &
DD-2.1 chips respectively.
Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
When using the aspeed SUART, we see a condition where the UART sends
continuous character timeout interrupts. This change adds a (heavily
commented) dummy read from the RBR to clear the interrupt condition on
init.
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Recently, a link_retries counter was added in pci/phb4 in order
Skiboot can retry to train a link some times - default number of
attempts to retrain a link is 3.
Happens that, if during a regular boot process we exhaust the
link retries and fail to train a PHB, the variable link_retries
is stuck in 0. If a kdump happens later, a PHB reset procedure is
triggered by Linux and, since we have a decrement-and-test in this
variable, we end up setting it to -1; it's unsigned, hence we get
an overflow.
This patch fixes the issue by reassigning the default value to
link_retries in every IODA purge.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
The single port version of the ConnectX-5 has a different device ID 0x1017.
Updated descriptions to match pciutils database.
Signed-off-by: John Walthour <jwalthour@us.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Silences sparse warning:
hw/phb4.c:XX:20: warning: symbol 'retry_whitelist' was not declared. Should it be static?
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Mask the PSL credit timeout error in CAPP FIR Mask register
bit(46). As per the h/w team this error is now deprecated and shouldn't
cause any fir-action for P9.
Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
Acked-by: Christophe Lombard <clombard@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Add pm idle support to POWER9. IPIs are implemented with doorbells.
POWER9 can use the EC=ESL=0 (lite) stop when sreset is not available.
EC=ESL=1 state with RL=3 is enabled when we have a sreset wakeup.
Deep idle states are not implemented.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
OOC/gpe nest microcode maintains the list of individual nest units
supported. Sync the recent updates to the UAV with nest_pmus array.
For reference occ/gpr microcode link for the UAV:
https://github.com/open-power/occ/blob/master/src/occ_gpe1/gpe1_24x7.h
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
[stewart@linux.vnet.ibm.com: add in reference to ucode]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
We added degraded link retries in:
3f936bae97 phb4: Retrain link if degraded
but forgot to update the documentation.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
In this recent fix:
8b4c7a3cef phb4: Mask RXE_ARB: DEC Stage Valid Error
We worked around a problem but the workaround wasn't complete.
Now that we have full documentation and details on the issue, we have
additional registers we need to change inits on.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Currently we treat a timeout as a hard failure and will automatically
fail any transations that hit their timeout. This results in
unnecessarily failing I2C requests if interrupts are dropped, etc.
Although these are bad things that we should log we can handle them
better by checking the actual hardware status and completing the
transation if there are no real errors. This patch reworks the timeout
handling to check the status and continue the transaction if it can.
if it can while logging an error if it detects a timeout due to a
dropped interrupt.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
CAPP recovery is initiated when a CAPP Machine Check is detected.
The capp recovery procedure is initiated via a Hypervisor Maintenance
interrupt (HMI).
CAPP Machine Check may arise from either an error that results in a PHB
freeze or from an internal CAPP error with CAPP checkstop FIR action.
An error that causes a PHB freeze will result in the link down signal
being asserted. The system continues running and the CAPP and PSL will
be re-initialized.
Tests performed on some of the old/new hardware.
Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com>
Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Binding GPU to emulated NPU PCI devices is done using the slot labels
since the NPU devices do not have a patching slot node we need to
copy the label in here.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This needs to be in the PCI device node so the speed of the NVLink
can be passed to the GPU driver.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
These two prints just end up filling the skiboot logs on any machine
that's been booted for more than a few hours.
They have never been useful, so make them trace level.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Change the inits to mask out the RXE ARB: DEC Stage Valid Error (bit
370. This has been a fatal error but should be informational only.
This update will be in the next version of the phb4 workbook.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
When a core enters stop4, it does not loose decrementer and time base.
Hence removing flags OPAL_PM_DEC_STOP and OPAL_PM_TIMEBASE_STOP.
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Use a common variable has_wakeup_engine instead of has_slw to tell if
the
a) SLW image is populated in case of power8
b) CME image is populated in case of power9
Currently we expect CME to be loaded if homer address is known ( except
for simulators)
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Make a stop api call using libpore to restore HRMOR register. HRMOR needs
to be cleared so that when thread exits stop, they arrives at linux
system_reset vector (0x100).
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This OPAL call is made from Linux to OPAL to configure values in
various SPRs after wakeup from a deep idle state.
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This reverts commit 0a2710381f34e6b4c03cff1fa76bc1b74f280ecd.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
You can use the NVRAM override for DD2.00 screened parts.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Set a few new values in the PHY_RESET procedure, as specified by our
updated programming guide documentation.
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
I'm seeing an infinite loop while hot unplugging
a CPU. This is a workaround till we do the right
things for p9. May be a candidate for backporting
The messages I see in an infinite loop are:
[ 740.250192896,3] LIBPORE: Core ID = 20 is not within valid range of [0;15]
[ 740.250230176,3] SLW: Failed to set spr for CPU 51
When trying to hotunplug core id 20. For now the
patch just skips calling p8_pore* on p9 machines.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
In the recent change:
3f936bae97 phb4: Retrain link if degraded
We retrain if the link is degraded. We do 3 retries to get an optimal
link.
Unfortunately if the last retry fails, we mark the PHB as bad and
don't use it. Hence that PHB is lost even though it actually trained
(just degraded).
This fixes the problem by printing an error message (as below) but
still marking the PHB as good.
[ 7.179320404,3] PHB#0005[0:5]: LINK: Link degraded
[ 8.387346665,3] PHB#0005[0:5]: LINK: Link degraded
[ 10.078409137,3] PHB#0005[0:5]: LINK: Link degraded
[ 11.281477269,3] PHB#0005[0:5]: LINK: Link degraded
[ 11.283123885,3] PHB#0005[0:5]: LINK: Degraded but no more retries
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
On P9 Scale Out (Nimbus) DD2.0 and Scale in (Cumulus) DD1.0 (and
below) the PCIe PHY can lockup causing training issues. This can cause
a degradation in speed or width in ~5% of training cases (depending on
the card). This is fixed in later chip revisions. This issue can also
cause PCIe links to not train at all, but this case is already
handled.
This patch checks if the PCIe link has trained optimally and if not,
does a full PHB reset (to fix the PHY lockup) and retrain.
One complication is some devices are known to train degraded unless
device specific configuration is performed. Because of this, we only
retrain when the device is in a whitelist. All devices in the current
whitelist have been testing on a P9DSU/Boston, ZZ and Witherspoon.
We always gather information on the link and print it in the logs even
if the card is not in the whitelist.
For testing purposes, there's an nvram to retry all PCIe cards and all
P9 chips when a degraded link is detected. The new option is
'pci-retry-all=true' which can be set using:
nvram -p ibm,skiboot --update-config pci-retry-all=true
This option may increase the boot time if used on a badly behaving
card.
Signed-off-by: Michael Neuling <mikey@neuling.org>
[stewart@linux.vnet.ibm.com: fix Cumulus VERS_MAJ r.e. Mikey mail]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Make link retries a #define rather than open coding it in the PHB4
init code.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Split phb4_get_link_state() into a new function so that it can be
reused to get info on the speed and width of the link.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Move nvram read to the PHB4 init code so that's it's only read once,
rather than every time we go though PHB reset.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This code was never used (since retries is set to 0), it's not very
useful and it makes the code harder to read. So lets just remove it.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
The HW only supported limited access sizes.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Provide a way to test recoverable data link interrupts via a new
vendor capability byte.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Acked-By: Alistair Popple <alistair@popple.id.au>
====== v2 -> v3: ======
* Corrected name of NPU RING (no 2). [Andrew Donnellan]
* Corrected spelling of device. [Andrew Donnellan]
hw/npu2.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Allow the NPU2 to trigger "recoverable data link" interrupts.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Acked-By: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Acked-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
When requested via OPAL_XIVE_ANY_CHIP, we need to try all
chips. We first try the current one (on which the caller
sits) and if that fails, we iterate all chips until the
allocation succeeds.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Instead of trying to "pull" everything and clear VT (which didn't
work and caused some FIRs to be set), instead just clear and then
set the PTER thread enable bit. This has the side effect of
completely resetting the corresponding thread context.
This fixes the spurrious XIVE FIRs reported by PRD and fircheck
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This is high overhead so we don't enable it by default even
in debug builds, it's also a bit messy, but it allowed me to
detect and debug a locking issue earlier so it can be useful.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
We normally allocate IPIs from 0x10. Make that 0x1000 on debug
builds to limit the chances of overlapping with Linux interrupt
numbers which makes debugging code that confuses them easier.
Also add a warning in emulation if we get an interrupt in the
queue whose number is below the gap.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Thankfully the missing locking only affects debug code and
init code that doesn't run concurrently. Also adds a DEBUG
option that checks the lock is properly held.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Cosmetic fix.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Without this, we sometimes don't observe from a CPU the
values written to the ENDs or NVTs via the cache watch.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This runs 1000 iterations exercising the cache watch and scrub
facilities on VPs and ENDs at boot. This exposes a HW bug with
the scrub which will be worked around in a subsequent patch.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
If this fails, print a bit more info about it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This adds debug code to check that the initial updates of
in-memory VPs and EQs via the cache watch and cache scrub
facilities has worked properly.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
We don't use them and we hijack the VP field with their
configuration to store the EQ reference, so make sure the
kernel or guest can't turn them back on by doing MMIO
writes to ACK#
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
That doesn't work, the HW doesn't implement it in the cache
watch facility anyway.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|