Age | Commit message (Collapse) | Author | Files | Lines |
|
Invalid accesses from the GPU can cause a specific PE to be frozen by the
NPU. Add an interrupt handler which reports the frozen PE to the operating
system via as an EEH event.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Add a mechanism to enable/disable sw checkstop by looking at nvram option
opal-sw-xstop=<enable/disable>.
For now this patch disables the sw checkstop trigger unless explicitly
enabled through nvram option 'opal-sw-xstop=enable'i for p9. This will allow
an opportunity to get host kernel in panic path or xmon for unrecoverable
HMIs or MCE, to be able to debug the issue effectively.
To enable sw checkstop in opal issue following command:
# nvram -p ibm,skiboot --update-config opal-sw-xstop=enable
NOTE: This is a workaround patch to disable sw checkstop by default to gain
control in host kernel for better checkstop debugging. Once we have most of
the checkstop issues stabilized/resolved, revisit this patch to enable sw
checkstop by default.
For p8 platform it will remain enabled by default unless explicitly disabled.
To disable sw checkstop on p8 issue following command:
# nvram -p ibm,skiboot --update-config opal-sw-xstop=disable
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Fixes: 35c66b8ce5a27ad3312806e8bde9148a5e5b5df8
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Add p9_stop_api for EVENT_MASK and PDBAR scoms. These scoms are lost on
wakeup from stop11.
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
While waking up from stop11, we want NCU_DARN_BAR to have enable bit set.
Without this stop_api call, the value restored is without enable bit set.
We loose NCU_SPEC_BAR when the quad goes into stop11, stop_api will
restore while waking up from stop11.
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
All init time p9_stop_api calls have been isolated to slw_late_init. If
p9_stop_api fails, then the deep states can be excluded from device tree.
For p9_stop_api called after device-tree for cpuidle is created ,
has_deep_states will be used to check if this call is even required.
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Patch adds a global variable which indicates if the deep states are enabled
through stop-enabled-bits. Only applies to POWER9.
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Move MAMBO simulator checks to slw_init.
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Patch introduces wakeup_engine_state which replaces a bool
has_wakeup_engine. wakeup_engine_state can have 3 states :
- WAKEUP_ENGINE_PRESENT : When everything is good.
- WAKEUP_ENGINE_NOT_PRESENT : When wakeup_engine is not correctly detected.
- WAKEUP_ENGINE_FAILED : If any operation on wakeup_engine failed.
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Patch adds the following changes :
- Moves slw image sanity check to a seperate function called
slw_image_check_p{8,9}()
- Move has_wakeup_engine to global scope, so that it can be set by other
functions
- Code which uses wakeup_engine will only be called if sanity check passes.
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This patch seperates code which deals with wakeup_engine from one which
doesn't. Init functions for power8 and power9 are split into chip_init
and late_init. slw_late_init_p?() contains wakeup_engine related code.
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Create cpuidle device-tree after slw_init(), so that we can stop the
deeper states from being added , when wakeup engine is not present or
failed.
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This patch introduces a new function phb4_dump_app_err_regs() that
dumps CAPP error registers in case the PEC nestfir register indicates
that the fence was due to a CAPP error (BIT-24).
Contents of these registers are helpful in diagnosing CAPP
issues. Registers that are dumped in phb4_dump_app_err_regs() are:
* CAPP FIR Register
* CAPP APC Master Error Report Register
* CAPP Snoop Error Report Register
* CAPP Transport Error Report Register
* CAPP TLBI Error Report Register
* CAPP Error Status and Control Register
Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Reviewed-by: Christophe Lombard<clombard@linux.vnet.ibm.com>
Acked-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Adding stop5 idle state with rough residency and latency numbers.
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Currently the hw/lpc-mbox layer keeps a pointer for the currently
inflight message for the duration of the mbox call. This creates
problems when messages timeout, is that pointer still valid, what can we
do with it. The memory is owned by the caller but if the caller has
declared a timeout, it may have freed that memory.
Another problem is locking. This patch also locks around sending and
receiving to avoid races with timeouts and possible resends. There was
some locking previously which was likely insufficient - definitely too
hard to be sure is correct
All this is made much easier with the previous rework which moves
sequence number allocation and verification into lpc-mbox rather than
the caller.
Signed-off-by: Cyril Bur <cyril.bur@au1.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Currently when mbox-flash decides that a message times out the driver
has no way of knowing to drop the message and will continue waiting for
a response indefinitely preventing more messages from ever being sent.
This is a problem if the BMC crashes or has some other issue where it
won't ever respond to our outstanding message.
This patch provides a method for mbox-flash to tell the driver how long
it should wait before it no longer needs to care about the response.
Signed-off-by: Cyril Bur <cyril.bur@au1.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Cyril Bur <cyril.bur@au1.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Kudos to Hugo Landau who reported this in:
https://github.com/open-power/skiboot/issues/142
Reported-by: Hugo Landau <hlandau@devever.net>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Useful for debugging.
Sample output:
[29155.157050283,7] PRD: Unsupported prd message type : 0xc
CC: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Acked-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This patch adds support to handle OCC load/start event from FSP/PRD.
During IPL we send a success directly to FSP without invoking any HBRT
load routines on recieving OCC load mbox message from FSP. At runtime
we forward this event to host opal-prd.
This patch provides support for invoking OCC load/start HBRT routines
like load_pm_complex() and start_pm_complex() from opal-prd.
Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This patch handles OCC_RESET runtime events in host opal-prd and also
provides support for calling 'hostinterface->wakeup()' which is
required for doing the reset operation.
Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Acked-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
In P9 HBRT sends error logs to FSP via firmware_request interface.
This patch adds support to parse error log and send it to FSP.
CC: Jeremy Kerr <jk@ozlabs.org>
CC: Daniel M Crowell <dcrowell@us.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Acked-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Don't add counter type of sensors to device-tree as they don't
fit into hwmon sensor interface.
Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Some HostBoot versions leave those as checkstop, they are harmless
and can sometimes occur during normal operations.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Tested-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
The current workaround for the scrub bug described in
__xive_cache_scrub() has an issue in that it can leave
dirty invalid entries in the cache.
When cleaning up EQs or VPs during reset, if we then
remove the underlying indirect page for these entries,
the XIVE will checkstop when trying to flush them out
of the cache.
This replaces the existing workaround with a new pair of
workarounds for VPs and EQs:
- The VP one does the dummy watch on another entry than
the one we scrubbed (which does the job of pushing old
stores out) using an entry that is known to be backed by
a permanent indirect page.
- The EQ one switches to a more efficient workaround
which consists of doing a non-side-effect ESB load from
the EQ's ESe control bits.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This is bogus, we don't support them. (Thankfully the callers
didn't actually try to use this on escalation interrupts).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Tested-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Removing the valid bit means a FIR will trip if it's accessed
inadvertently. Under some circumstances, the XIVE will speculatively
access an IVE for a masked interrupt and trip it. So make sure that
freed entries are still marked valid (but masked).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Tested-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Currently we reset the timebase value to (almost) zero when
synchronising the timebase of each chip to the Chip TOD network which
results in this:
[ 42.374813167,5] CPU: All 80 processors called in...
[ 2.222791151,5] FLASH: Found system flash: Macronix MXxxL51235F id:0
[ 2.222977933,5] BT: Interface initialized, IO 0x00e4
This patch modifies the chiptod_init() process to use the current
timebase value rather than resetting it to zero. This results in the
timestamps remaining contigious from the start of hostboot until
the petikernel starts. e.g.
[ 70.188811484,5] CPU: All 144 processors called in...
[ 72.458004252,5] FLASH: Found system flash: id:0
[ 72.458147358,5] BT: Interface initialized, IO 0x00e4
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Due to a hardware issue where core responding to scom was delayed due to
thread reconfiguration, leaves the SCOM logic in a state where the
subsequent scom to that core can get errors. This is affected for Core
PC scom registers in the range of 20010A80-20010ABF
The solution is if a xscom timeout occurs to one of Core PC scom registers
in the range of 20010A80-20010ABF, a clearing scom write is done to
0x20010800 with data of '0x00000000' which will also get a timeout but
clears the scom logic errors. After the clearing write is done the original
scom operation can be retried.
The scom timeout is reported as status 0x4 (Invalid address) in HMER[21-23].
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
So caller of xscom_reset() does not have to bother about adding a delay
separately. Instead caller can control whether to add a delay or not using
second argument to xscom_reset().
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Currently we have a mismatch between the NCU and PCI timers for MMIO
accesses. The PCI timers must be lower than the NCU timers otherwise
it may cause checkstops.
This changes PCI timeouts controlled by skiboot to 33-50ms. It should
be forwards and backwards compatible with expected hostboot changes to
the NCU timer.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
a. Surveillance response times out and OPAL triggers a HIR
b. Before the HIR process kicks in, OPAL gets a PSI interrupt indicating link down
c. HIR process continues and OPAL tries to write to DRCR; PSI link inactive => xstop
OPAL should confirm that the FSP is not already in reset in the HIR path.
[V2] Handle the case where a second reset is triggered due to the two resets
happening in succession.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Tested-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
disable_unavailable_units() checks whether the ucode
is in the running state before enabling the nest units
in the device tree. From a recent debug, it is found
that on some system boot, ucode is not loaded and
running in all the chips in the system. And this
caused a fail in OPAL_IMC_COUNTERS_STOP call where
we check for ucode state on each chip. Bug here is
that disable_unavailable_units() checks the state
of the ucode only in boot cpu chip. Patch adds a
condition in disable_unavailable_units() to check
for the ucode state in all the chip before enabling
the nest units in the device tree node.
Fixes: f98d59958db19 ('skiboot: Find the IMC DTB')
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
[stewart: clarify with comment and better variable name]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Add mambo direct controls to stop threads, which is required for
reliable fast-reboot. Enable direct controls by default on mambo.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This is an initial fast reboot implementation for p9 which has only been
tested on the Witherspoon platform, and without the use of NPUs, NX/VAS,
etc.
This has worked reasonably well so far, with no failures in about 100
reboots. It is hidden behind the traditional fast-reboot experimental
nvram option, until more platforms and configurations are tested.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
cmpxchg will be used in a subsequent change, and this reduces the
amount of asm code.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[stewart: fix some ifdef __TEST__ foo to ensure unittests work]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This patch fixes the logging of incorrect SCOM
register names.
Signed-off-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
dtc complains about missing reg property when a DT node is having a
unit name or address but no reg property.
/ibm,opal/sensors/vrm-in@c00004 has a unit name, but no reg property
/ibm,opal/sensors/gpu-in@c0001f has a unit name, but no reg property
/ibm,opal/sensor-groups/occ-js@1c00040 has a unit name, but no reg property
This patch fixes these warnings for new occ inband sensors and also for
sensor-groups by adding necessary properties.
Signed-off-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
dtc complains about missing reg property when a DT node is having a
unit name or address but no reg property.
Example warning for core dts sensor:
/ibm,opal/sensors/core-temp@5c has a unit name, but no reg property
/ibm,opal/sensors/core-temp@804 has a unit name, but no reg property
This patch fixes this by adding necessary properties.
Signed-off-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
[stewart: use handle as register rather than chip id]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
dtc complains about missing reg property when a DT node is having a
unit name or address but no reg property.
/ibm,opal/power-mgt/psr/cpu-to-gpu@0 has a unit name, but no reg property
/ibm,opal/power-mgt/psr/cpu-to-gpu@100 has a unit name, but no reg property
This patch fixes this by adding necessary properties.
Signed-off-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
imc_dt_update_nest_node() adds a "imc_nest_chip" property
to the "exports" node (under opal_node) to view nest counter
region. This comes handy when debugging ucode runtime
errors (like counter data update or control block update
so on...). And current code enables the property only if
the microcode is in running state at system boot. To aid
the debug of ucode not running/starting issues at boot,
enable the addition of "imc_nest_chip" property always.
Fixes: 167e65d570a7c ('skiboot/hw/imc: Add nest_memory region to "exports" node')
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Currently in NX, only write xscom config failures are tracing.
Add trace statements for read xscom config failures too.
No functional changes.
Signed-off-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
commit ba4d46fdd9eb ("console: Set log level from nvram") wants to read
from NVRAM rather early. This works fine on BMC based systems as
nvram_init() is actually synchronous. This is not true for FSP systems
and it turns out that the query for the console log level simply
queries blank nvram.
The simple fix is to wait for the NVRAM read to complete before
performing any query. Unfortunately it turns out that the fsp-nvram
code does not inform the generic NVRAM layer when the read is complete,
rather, it must be prompted to do so.
This patch addresses both these problems. This patch adds a check before
the first read of the NVRAM (for the console log level) that the read
has completed. The fsp-nvram code has been updated to inform the generic
layer as soon as the read completes.
The old prompt to the fsp-nvram code has been removed but a check to
ensure that the NVRAM has been loaded remains. It is conservative but
if the NVRAM is not done loading before the host is booted it will not
have an nvram device-tree node which means it won't be able to access
the NVRAM at all, ever, even after the NVRAM has loaded.
Signed-off-by: Cyril Bur <cyril.bur@au1.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
The NX rng BAR is used by each core to source random numbers for the
DARN instruction. Currently we configure each core to use the NX rng of
the chip that it exists on. Unfortunately, the NX can be deconfigured by
hostboot and in this case we need to use the NX of a different chip.
This patch moves the BAR assignments for the NX into the normal nx-rng
init path. This lets us check if the normal (chip local) NX is active
when configuring which NX a core should use so that we can fallback
gracefully.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Currently our GEN3 lane equalisation settings are set to 0x77. Change
this to 0x54. This change will allow us to train at GEN3 in a shorter
time and more consistently.
This setting gives us a TX preset 0x4 and RX hint 0x5. This gives a
boost in gain for high frequency signaling. It allows the most optimal
continuous time linear equalizers (CTLE) for the remote receiver port
and de-emphasis and pre-shoot for the remote transmitter port.
Machine Readable Workbooks (MRW) are moving to this new value also.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
These init changes for phb4 from the HW team.
Link down are now endpoint recoverable (ERC) rather than PHB fatal
errors.
BLIF Completion Timeout Error now generate an interrupt rather than
causing freeze events.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Additionally, warn if we find an enabled one that isn't one
of the firmware built-in queues.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
If an allocated VP is left valid at xive_reset() or Linux tries
to free a valid (enabled) VP block, print errors. The former happens
occasionally if kdump'ing while KVM is running so keep it as a debug
message. The latter is a programming error in Linux so use a an
error log level.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|