Age | Commit message (Collapse) | Author | Files | Lines |
|
Fixes: 8f433d6cd4f92b4f878e5ddc414e2800a2fb7140
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
This removes below unnecessary message in phb3_sm_fundamental_reset()
as there already has on subsequent message indicating the situation.
Performing PERST...
Also, this decreases the outputing level of all messages in this
function to DEBUG.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
When issuing fundamental reset on below IPR adapter that seats
behind root complex, there is 50% possibility that the link
fails to come up after the reset. In that case, the adapter's
config space is blocked and it's not usable.
host# lspci -ns 0004:01:00.0
0004:01:00.0 0104: 1014:034a (rev 01)
host# lspci -s 0004:01:00.0
0004:01:00.0 RAID bus controller: IBM PCI-E IPR SAS Adapter
(ASIC) (rev 01)
This introduces another PHB3 state (PHB3_STATE_FRESET_START)
allowing to redo fundamental reset if the link doesn't come up
in time at the first attempt, to improve the robustness of PHB's
fundamental reset. If the link comes up after the first reset,
the 2nd reset won't be issued at all.
Reported-by: Paul Nguyen <nguyenp@us.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This bug has originated since day 1 (of public release), what was going on
was that we were incorrectly using PSI_DMA_LOC_COD_BUF as the *address*
to write to for the FSP to read rather than using that purely as the
TCE table.
What we *should* have been doing (and this patch now does), is allocating
some (aligned) memory and using it.
With this patch, we no longer write over some poor random memory location
that could be being used by the host OS for something important, for example,
in the (internal) bug report of this, it was futex_hash_bucket in Linux
being replaced with our structure for replying to FSP_CMD_GET_LED_LIST (which
is around 4kb) and Linux doesn't like it when you replace a bunch of lock
data structures with essentially garbage.
Since this is FSP LED code specific, this only affects FSP based systems.
Reported-by: Dionysius d. Bell <belldi@us.ibm.com>
Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Tag skiboot-5.1.6
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
Fixes: 55ae15b
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This removes below unnecessary message in phb3_sm_fundamental_reset()
as there already has on subsequent message indicating the situation.
Performing PERST...
Also, this decreases the outputing level of all messages in this
function to DEBUG.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
When issuing fundamental reset on below IPR adapter that seats
behind root complex, there is 50% possibility that the link
fails to come up after the reset. In that case, the adapter's
config space is blocked and it's not usable.
host# lspci -ns 0004:01:00.0
0004:01:00.0 0104: 1014:034a (rev 01)
host# lspci -s 0004:01:00.0
0004:01:00.0 RAID bus controller: IBM PCI-E IPR SAS Adapter
(ASIC) (rev 01)
This introduces another PHB3 state (PHB3_STATE_FRESET_START)
allowing to redo fundamental reset if the link doesn't come up
in time at the first attempt, to improve the robustness of PHB's
fundamental reset. If the link comes up after the first reset,
the 2nd reset won't be issued at all.
Reported-by: Paul Nguyen <nguyenp@us.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
When deciding if a BT message has timed out we should first check for
a message response. This will ensure that messages will not time out
if there was a delay calling the pollers.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
In root causing a bug on AST BMC Alistair found that pollers weren't
being run for around 3800ms.
This was due to a wonderful accident that's probably about a year or more
old where:
In cpu_wait_job we have:
unsigned long ticks = usecs_to_tb(5);
...
time_wait(ticks);
While in time_wait(), deciding on if to run pollers:
unsigned long period = msecs_to_tb(5);
...
if (remaining >= period) {
Obviously, this means we never run pollers. Not ideal.
This patch ensures we run pollers every 5ms in cpu_wait_job() as well
as displaying how long we waited for a job if that wait was >1second.
Reported-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Some BMC firmware versions don't ship pflash.
Support PFLASH_TO_COPY environment variable to a pflash binary
built for the BMC that will be copied over and used to pflash
the partition or whole pnor.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
It works just like P8, we copy the code for now rather than make
it somewhat common due to our locking differences and to limit
the risk close to release. We can refactor later.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
We didn't pass the right "is_write" argument for writes and
the string used for logging was somewhat confusing.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
When struct phb3::has_link is set to true, the downstream link of
root port is up, not down. This fixes the incorrect comments.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Our normal sequence for a soft power action (IPMI 'power soft' or
'power cycle') involve receiving a SEL from the BMC, sending a message
to Linux's opal platform support which instructs the host OS to shut
down, and finally the host will request OPAL to cut power.
When the host is not yet up we will send the message to /dev/null, and
no action will be taken. This patches changes that behaviour to perform
the action immediately if we know how.
Signed-off-by: Joel Stanley <joel@jms.id.au>
[stewart@linux.vnet.ibm.com: modify checking of OPAL_BOOT_COMPLETE flag, typo]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This tells us when we've entered the host. First use case is knowing if
we can can rely on host communication working, such as receiving and
acting on an opal_msg.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
[stewart@linux.vnet.ibm.com: use real bit field rather than C bitfield]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
We've seen various IPMI timeouts during testing (mainly hit
by petitboot) but it seems that 5 seconds is the magic value
that matches everywhere.
This echoes what we use in petitboot, so at least being consistent
with ourselves is a good idea.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
We disallow to inject error to reserved PE#, which is 255 instead
of 0 on PHB3. Otherwise, error OPAL_PARAM is returned when injecting
error to PE#0.
This fixes above issue by checking against the correct PE number 255.
Reported-by: Pradeep Ramanna <pramann2@in.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
In the event of a lot of OCC events (or many CPU cores), we could
send many OCC messages to the host, which if it wasn't calling
opal_get_msg really often, would cause skiboot to malloc() additional
messages until we ran out of skiboot heap and things didn't end up
being much fun.
When running certain hardware exercisers, they seem to steal all time
from Linux being able to call opal_get_msg, causing these to queue up
and get "opalmsg: No available node in the free list, allocating" warnings
followed by tonnes of backtraces of failing memory allocations.
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
SLW image)
Memory regions in skiboot have an interesting life cycle. First, we get
a bunch from the initial device tree or hdat specifying some existing
reserved ranges (as well as adding some of our own if they're missing)
but we also get ranges for the entirety of RAM.
The idea is that we can do node local allocations for per node resources
(which we do) and then, just prior to booting linux, we copy the reserved
memory regions to expose to linux along with a set of reserver regions
to cover the node local allocations.
The problem was that mem_range_is_reserved() was wanting subtle different
semantics for memory region type than region_is_reserved() provided.
That is, we were overriding the meaning of REGION_SKIBOOT_HEAP to mean both
"this is reserved by skiboot" *and* "this is a memory region that covers
all of memory and will be shrunk to cover just the memory we have allocated
for it just before we boot the payload (linux)".
So what would happen is we would ask "hey, is the memory holding the SLW
image reserved?" and we'd get the answer of "yes" but referring to the memory
region that covers the entirety of memory in a NUMA node, *not* meaning
our intent of "this will be reserved when we start linux".
To fix this, introduce a new memory region type REGION_MEMORY. This has
the semantics of a memory region that covers a block of memory that we can
allocate from (using local_alloc) and that the part that was allocated
will be passed to linux as reserved, but that the entire range will not
be reserved.
So our new semantics are:
- region_is_reservable() is true if the region *MAY* be reserved
(i.e. is the regions that cover the whole of memory OR is explicitly reserved)
- region_is_reserved() is true if the region *WILL* be reserved
(i.e. is explicitly reserved)
This way we check that the SLW image is explicitly reserved and if it isn't,
we reserve it.
Fixes: 58033e44
Acked-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Chaning MPS on PCI upstream bridge might cause error bits set on
downstream endpoints when system boots into Linux as below case
shows:
host# lspci -vvs 0001:06:00.0
0001:06:00.0 Ethernet controller: Broadcom Corporation \
NetXtreme II BCM57810 10 Gigabit Ethernet (rev 10)
:
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
:
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
This clears those error bits in AER and PCIe capability after MPS
is changed. With the patch applied, no more error bits are seen.
Reported-by: John Walthour <jwalthour@us.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Currently, most astbmc platforms do their own call to prd_init(), but
garrison is out-of-sync.
This change moves the prd_init call to astbmc_early_init, so we don't
need to enable it on every platform.
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Replace all printf's with prlog in core/hmi.c.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
On TOD/TB errors timebase register stops/freezes until HMI error recovery
gets TOD/TB back into running state. However, while HMI recovery is in
progress there are chances where some code path may invoke time_wait*()
calls which depends on running TB value. In an event of TB not moving,
time_wait* calls would keep looping resulting into a hang on that CPU.
On OpenPower systems we are seeing system hang on TOD/TB errors. The hang
is seen inside OPAL HMI handler while invoking prlog/perror(). The reason
is, on OpenPower systems prlog/perror() depends on LPC UART console
driver to flush log messages to the console. UART read/write calls invoke
time_wait_nopoll() inside opb_[read|write]() functions. When TB is in
stopped state this causes a hang in prlog/perror() calls.
This patch fixes this issue by modifying time_wait_[no]poll() to check
for TB validity and return immediately.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Slot names courtesy of Sertac
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Copy/paste bug ... we were displaying the same message for
all error sources.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Recent HostBoot & SBE firmware provide a HW timer facility that can
be used to implement OPAL timers and thus limit the reliance on the
Linux heartbeat.
This implements support for it. The side effect is that i2c from Centaurs
is now usable.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[stewart@linux.vnet.ibm.com: fix run-timer unit test]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
The caller usually has it and it avoids additional mftb() which
can be expensive.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[stewart@linux.vnet.ibm.com: fix run-timer unit test]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
The patch fixes an illegal access to the memory which has been
freed.
Fixes Coverity defect # 101858
Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Nothing critical, no functional changes.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
The locking code is obviously correct and it's never shown up in
a profile - so it's likely fine for a while yet.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
We unlikely need this as ASM until somebody finds it to be a problem.
So removing the FIXME so that it doesn't show up when grepping for
FIXMEs.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This should be a strlen() comparison instead of comparing the
truthiness of an array.
This should be safe as everywhere does populate label, if only with
an empty string.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Just bailing out this early in boot is perfectly acceptable.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
If we fail to allocate memory at this point in boot, we should just
assert, there's really no coming back from not being able to reserve
our reserved memory.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Presently abort() function is not working on BMC based machine. System
hangs after abort/assert call. We have to reboot machine from BMC (IPMI
command or BMC console).
This patch introduces attention functionality for BMC based machine.
It logs eSEL event that contains OPAL version, file info and backtrace.
And calls cec_reboot... which takes care of rebooting host.
Note:
- This patch uses ipmi_queue_msg() instead of ipmi_queue_msg_sync() as
we are having some issues with sync path. This will resolved once we
sort out [1].
- This patch calls cec_reboot to reboot machine after logging eSEL event.
It queues IPMI message and bt_poll() should be working until we pass
reboot IPMI message to BMC. Hence we have while loop with time_wait_ms().
Alternatively we can use xscom_trigger_xstop().. but it will stop
immediately and eSEL logging fails.
[1] https://lists.ozlabs.org/pipermail/skiboot/2015-August/001824.html
Sample eSEL output after assert call:
------------------------------------
[hegdevasant@hegdevasant bin]$ strings fir01bmc.150820.120511.eSel.binary
BB821410
AT8335-GTA000000000000
AT8335-GTA000000000000UD
ATDESC
OPAL version : skiboot-5.1.1-44-geae3999-hegdevasant-dirty-bb31bfd
File info : core/init.c:463:0
CPU 0060 Backtrace:
S: 0000000031d83bc0 R: 000000003006086c .ipmi_terminate+0x110
S: 0000000031d83c60 R: 0000000030017f90 ._abort+0x80
S: 0000000031d83ce0 R: 0000000030017fd8 .assert_fail+0x34
S: 0000000031d83d60 R: 0000000030013dcc .load_and_boot_kernel+0x784
S: 0000000031d83e30 R: 000000003001437c .main_cpu_entry+0x57c
S: 0000000031d83f00 R: 0000000030002544 boot_entry+0x194
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Current abort() call works fine on FSP based system. We need different
mechanism on BMC based machine. Hence introduce platform hook for
terminate call.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|