Age | Commit message (Collapse) | Author | Files | Lines |
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
As an immediate mitigator for a current hardware glitch, add a procedure
that can be used to validate NTL credit values. This will be called as a
safeguard to check that link training succeeded.
Assert that things are exactly as we expect, because if they aren't, the
system will experience a catastrophic failure shortly after the start of
link traffic.
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Acked-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit a752f2d908d1fd3d95cedf2b952729995ce53234)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Revise the NPU2DEV{DBG,INF,ERR} logging macros to include the device's
bdfn. It's useful to know exactly which link we're referring to.
For instance, instead of
[ 234.044921238,6] NPU6: Starting procedure reset_ntl
[ 234.048578101,6] NPU6: Starting procedure reset_ntl
[ 234.051049676,6] NPU6: Starting procedure reset_ntl
[ 234.053503542,6] NPU6: Starting procedure reset_ntl
[ 234.057182864,6] NPU6: Starting procedure reset_ntl
[ 234.059666137,6] NPU6: Starting procedure reset_ntl
we'll get
[ 234.044921238,6] NPU6:0:0.0 Starting procedure reset_ntl
[ 234.048578101,6] NPU6:0:0.1 Starting procedure reset_ntl
[ 234.051049676,6] NPU6:0:0.2 Starting procedure reset_ntl
[ 234.053503542,6] NPU6:0:1.0 Starting procedure reset_ntl
[ 234.057182864,6] NPU6:0:1.1 Starting procedure reset_ntl
[ 234.059666137,6] NPU6:0:1.2 Starting procedure reset_ntl
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Acked-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 1c5417ec1898dbdd072495913d7fbc657e570e6f)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Fix cut and paste from phb3. The sizes have changes now we have GEN4,
so the check here needs to change also
Without this we end up with the default settings (all '7') rather
than what's in HDAT.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit d957e9278994831831f0bfa7d13433da4b36b4de)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
These aren't copied currently but should be.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 4e74625dc9f1e1d080128203dc410ff0027c4fcf)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
The M32 BAR is the PHB4 region used to map all the non-prefetchable
or 32-bit device BARs. It's supposed to have its segments remapped
via the MDT and Linux relies on that to assign them individual PE#.
However, we weren't configuring that properly and instead used the
mode where PE# == segment#, thus causing EEH to freeze the wrong
device or PE#.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 4002ea166fde4b4e44f6571027c60c6b75df5c33)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
A PE number can be up to 9 bits, using a uint8_t won't fly..
That was causing error on config accesses to freeze the
wrong PE.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit a1cd5529a84ccf3c88e12a864705e4de93908d8e)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
There are three different ways we configure the MCD and memory map.
1) Old way (current way)
Skiboot configures the MCD and puts GPUs at 4TB and below
2) New way with MCD
Hostboot configures the MCD and skiboot puts GPU at 4TB and above
3) New way without MCD
No one configures the MCD and skiboot puts GPU at 4TB and below
The patch keeps option 1 and adds options 2 and 3.
The different configurations are detected using certain scoms (see
patch).
Option 1 will go away eventually as it's a configuration that can
cause xstops or data integrity problems. We are keeping it around to
support existing hostboot.
Option 2 supports only 4 GPUs and 512GB of memory per socket.
Option 3 supports 6 GPUs and 4TB of memory but may have some
performance impact.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit c76636f3d73fbbc3a1f56ca085eb80f9e56d0411)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This map is soon to be replaced, but we are going to keep it around
for a little while so that we support older hostboot firmware.
Rename it for now.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 74d9a50ac2a79d39b07bbe7d7e0bbb72f7f5b026)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Pull out MCD writing code into npu2_mcd_init()
No functional change.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 75371796ac595a4ce2f1b6bd254f5f3ad7416a96)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This refactors the BAR setting code to make it clearer and handle a
larger range of BAR addresses. This is needed as we are about to move
the GPU to a physical address that is currently not supported by this
code.
This change derives group and chip sections of the BAR from the base
address rather than the chip_id now. mem sel is also derived from the
base address, rather than assuming 0.
No functional change.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 3c0408ded5a14e110c8a418be305ac20714cb32d)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This code is replicated, so let's put it in a function. Also add some cleanups.
No functional change.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 4f4bf83128c1d944782f02b238e632ed8d2451af)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
On P9 the I2C master is shared with the OCC. Currently the watermark
values are set once at init time which is bad for two reasons:
a) We don't take the OCC master lock before setting it. Which
may cause issues if the OCC is currently using the master.
b) The OCC might change the watermark levels and we need to reset
them.
Change this so that we set the watermark value when a new transaction
is started rather than at init time.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit d0f06269ed3cd3e09a9b04c5f70cb3d53a77a689)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
New init value from HW folks for the fence enable register.
This clears bit 17 (CFG Write Error CA or UR response) and bit 22 (MMIO Write
DAT_ERR Indication) and sets bit 21 (MMIO CFG Pending Error)
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 560eb231d6bf36442cb467b6d63e2c5342c34dd8)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Change the implementation of reset_ntl to match the latest programming
guide documentation.
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 30ea08acc2538869229bcdeb0ec79eedc5557e94)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Change the RX clk mux control to be done by software instead of HW. This
avoids glitches caused by changing the mux setting.
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Reviewed-By: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit ac6f1599ff330fa602b3c9557a08f31f1158a55f)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Add a 4-byte version of npu2_write_mask().
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit d6f2505b15422e3c63932a67278ebdbca67047d5)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
We could never clear "unconditional notify" and "escalate"
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 7c2a76705674a6594462b859d1ac5affcde96593)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This updates some inits based on information from the HW
designers. This includes enabling some new DD2.0 features
that we don't yet exploit.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 484d26fd6e65b00f746f852bccb460fef7b695e0)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Commit fd6b71fc fixed the situation where ipmi console was open (hvc0) but got
data on different console (hvc1).
During FSP R/R OPAL closes all consoles. After R/R complete FSP requests to
open hvc1 and sends data on this. If hvc1 registration failed or not opened in
host kernel then it will not read data and results in RCU stalls.
Note that this is workaround for older kernel where we don't have separate irq
for each console. Latest kernel works fine without this patch.
CC: stable
CC: Sam Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
POWER7/8 use DSCR=0. POWER9 preferred value has "stride-N" enabled.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This patch reworks the way timeouts are set so that rather than imposing
a hard deadline based on the transaction length it uses a
kick-the-can-down-the-road approach where the timeout will be reset each
time data is written to or received from the master. This fits better
with the actual failure modes that timeouts are designed to handle, such
as unusually slow or broken devices.
Additionally this patch moves all the special case detection out of the
timeout handler. This is help to improve the robustness of the driver and
prepare for a more substantial rework of the driver as a whole later on.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
0679f61244b "fast-reset: by default (if possible)" broke NPU - now
the NV links does not get enabled after reboot.
This disables fast reboot for NPU machines till a better solution is found.
Suggested-by: Andrew Donnellan <andonnel@au1.ibm.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Workaround on P9: PRD does operations it *knows* will fail with this
error to work around a hardware issue where accesses via the PIB
(FSI or OCC) work as expected, accesses via the ADU (what xscom goes
through) do not. The chip logic will always return all FFs if there
is any error on the scom.
Suggested-by: Daniel M Crowell <dcrowell@us.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Acked-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Add a workaround for a HW logic bug in Power9 where TB residue and HDEC
parity errors cleared by one thread aren't visible to other threads of same
core. The TB reside and HDEC parity error are reported through TFMR bit 45
and 26 respectively. If any of the thread from the core clears the TFMR bit
26 and 45, only thread 0 is able to see that errors are cleared but rest of
the threads 1, 2 and 3 do not see those as cleared. This causes TB error
recovery to fail for TB residue and HDEC parity errors. TFMR is per core
register and any changes made by a one thread should be visible by other
threads of the same core.
On TB residue error (TFMR bit 45), TB goes into invalid state. Hence avoid
handling/clearing TB residue error if TB is valid and running. Use TFMR bit 41
to check validity of TB state.
For HDEC parity error (TFMR bit 26), check for other errors on TFMR register
and ignore the pre-recovery for HDEC parity error. If TFMR has any other
TB error bits set alongwith HDEC parity error we can safely ignore handling
of HDEC parity error. Also, while clearing HDEC parity error bit from TFMR,
allow only thread 0 to clear it.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
On TB/HDEC errors, all 4 threads on the affected receives HMI. On power9,
every thread on the core has its own copy of TB/HDEC and hence every thread
has to clear the dirty data from its own TB/HDEC register before we clear tb
errors through TFMR[24]. The HMI recovery would fail even if one thread
do not cleanup the respective TB/HDEC register.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Freeze events such as MMIO loads can cause the PHB to lose it's
limited powerbus credits. If all credits are used and a further MMIO
will cause a checkstop.
To work around this, we escalate the troublesome freeze events to a
fence. The fence will cause a full PHB reset which resets the powerbus
credits and avoids the checkstop.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
We are going to reuse this so move it earlier. No functional change
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
New inits based on next PHB4 workbook. Increases some timeouts to
avoid some spurious error conditions.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Linux EEH flow is somewhat broken. It saves the PCIe config space of
the PHB on boot, which it then uses to restore on EEH recovery. It
does this to restore MMIO bars and some other pieces.
Unfortunately this save is done before any drivers are bound to
devices under the PHB. A number of other things are configured in the
PHB after drivers start, hence some configuration space settings
aren't saved correctly. These include bus master and MMIO bits in the
command register.
Linux tried to hack around this in this linux commit
bf898ec5cb powerpc/eeh: Enable PCI_COMMAND_MASTER for PCI bridges
This sets the bus master bit but ignores the MMIO bit.
Hence we lose MMIO after a full PHB reset. This causes the next MMIO
access to the device to fail and for us to perform a PE freeze
recovery, which still doesn't set the MMIO bit and hence we still
fail.
This works around this by forcing MMIO on during
phb4_root_port_init().
With this we can recovery from a PHB fence event on POWER9.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Use phb4_ioda_sel() in phb4_read_phb_status() rather than re-implementing it.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Log root complex accesses and print BFDN on device access
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This is old unused code from phb3 so just remove it.
No functional change
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
No functional change.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
No functional change.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
If we hit this message we'll retry and fix the problem. If we run out
of retries and can't fix the problem, we'll still print a log message
at error level indicating a problem.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
In this fix:
62ac7631ae phb4: Fix PCIe GEN4 on DD2.1 and above
We fixed DD2.1 GEN4 but broke DD2.00 as GEN3.
This fixes DD2.00 back to GEN3. This time for sure!
Signed-off-by: Michael Neuling <mikey@neuling.org>
Tested-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
In this change:
eef0e197ab PHB4: Default to PCIe GEN3 on POWER9 DD2.00
We clamped DD2.00 parts to GEN3 but unfortunately this change also
applies to DD2.1 and above.
This fixes this to only apply to DD2.00.
This also cleans up the documentation and printing.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
'special_wakeup_count' is incremented on successfully asserting
special wakeup. So we will never clear the special wakeup if we
check 'special_wakeup_count' to be zero. Fix this issue by checking
the 'special_wakeup_count' to 1 in dctl_clear_special_wakeup().
Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Some instances have been observed where the special wakeup assert
times out. The current timeout is too short for deeper sleep states.
Hostboot uses 100ms, so match that.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Use dt_find_by_name_addr() instead of dt_find_by_name(). That way we
can avoid unnecessary memory allocation/cleanup.
CC: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Presently we traverse SLCA structure to create various FRU nodes under /vpd
node. We assumed that children are always contiguous. It happened to be
contiguous in P8 and worked fine, but failed in P9 system. So it ended up
populating duplicate node under wrong parent. Also failed to populate some
of the nodes.
Unfortunately there is no way to reach all the children of a given parent
from parent node :-( Hence we have to rework vpd creation logic.
This patch goes through all the SLCA entries serially and creates vpd node.
Assumptions:
- SLCA index is always serial (0..n)
- When we traverse serially parent entry comes before child
- Redundant resources are always consecutive
- Populate node if SLCA has 'installed' and 'VPD collected' bit set
CC: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Reviewed-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|