Age | Commit message (Collapse) | Author | Files | Lines |
|
Commit fd6b71fc fixed the situation where ipmi console was open (hvc0) but got
data on different console (hvc1).
During FSP R/R OPAL closes all consoles. After R/R complete FSP requests to
open hvc1 and sends data on this. If hvc1 registration failed or not opened in
host kernel then it will not read data and results in RCU stalls.
Note that this is workaround for older kernel where we don't have separate irq
for each console. Latest kernel works fine without this patch.
CC: stable
CC: Sam Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit c9cc5ef5772ebfbfff978f1c25763d733e45752c)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Commit c8a7535f (FSP/CONSOLE: Workaround for unresponsive ipmi daemon) added
error logging when buffer is full. In some corner cases kernel may call this
function multiple time and we may endup logging error again and again.
This patch fixes it by generating error log only once. I think this is enough
to indicate something went wrong.
Also with previous patch, once console buffer is full, OPAL is returning error
to payload from fsp_console_write_buffer_space(). So payload will never call
fsp_console_write(). Hence move error logging logic to right place.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit d798c276b4da7559970702c2f31b12549c92741e)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Kernel calls fsp_console_write_buffer_space() to check console buffer space
availability. If there is enough buffer space to write data, then kernel will
call fsp_console_write() to write actual data.
In some extreme corner cases (like one explained in commit c8a7535f)
console becomes full and this function returns 0 to kernel (or space available
in console buffer < next incoming data size). Kernel will continue retrying
until it gets enough space. So we will start seeing RCU stalls.
This patch keeps track of previous available space. If previous space is same
as current means not enough space in console buffer to write incoming data.
It may be due to very high console write operation and slow response from FSP
-OR- FSP has stopped processing data (ex: because of ipmi daemon died). At this
point we will start timer with timeout of SER_BUFFER_OUT_TIMEOUT (10 secs).
If situation is not improved within 10 seconds means something went bad. Lets
return OPAL_RESOURCE so that kernel can drop console write and continue.
CC: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
CC: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
[stewart: reset timeout in fsp_console_write() path]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 6557a728385c3fbcef297d2c61c7d93bc539f8bb)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Presently we are not closing SOL and FW console sessions during R/R. Host will
continue to write to SOL buffer during FSP R/R. If there is heavy console write
operation happening during FSP R/R (like running `top` command inside console),
then at some point console buffer becomes full. fsp_console_write_buffer_space()
returns 0 (or less than required space to write data) to host. While one thread
is busy writing to console, if some other threads tries to write data to console
we may see RCU stalls (like below) in kernel.
kernel call trace:
------------------
[ 2082.828363] INFO: rcu_sched detected stalls on CPUs/tasks: { 32} (detected by 16, t=6002 jiffies, g=23154, c=23153, q=254769)
[ 2082.828365] Task dump for CPU 32:
[ 2082.828368] kworker/32:3 R running task 0 4637 2 0x00000884
[ 2082.828375] Workqueue: events dump_work_fn
[ 2082.828376] Call Trace:
[ 2082.828382] [c000000f1633fa00] [c00000000013b6b0] console_unlock+0x570/0x600 (unreliable)
[ 2082.828384] [c000000f1633fae0] [c00000000013ba34] vprintk_emit+0x2f4/0x5c0
[ 2082.828389] [c000000f1633fb60] [c00000000099e644] printk+0x84/0x98
[ 2082.828391] [c000000f1633fb90] [c0000000000851a8] dump_work_fn+0x238/0x250
[ 2082.828394] [c000000f1633fc60] [c0000000000ecb98] process_one_work+0x198/0x4b0
[ 2082.828396] [c000000f1633fcf0] [c0000000000ed3dc] worker_thread+0x18c/0x5a0
[ 2082.828399] [c000000f1633fd80] [c0000000000f4650] kthread+0x110/0x130
[ 2082.828403] [c000000f1633fe30] [c000000000009674] ret_from_kernel_thread+0x5c/0x68
Hence lets close SOL (and FW console) during FSP R/R.
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 9d1755179112071652bb4a317f9006da630ce25d)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Presently OPAL sends associate/unassociate MBOX command for all
FSP serial console (like below OPAL message). We have to check
console is available or not before sending this message.
OPAL log:
-------
[ 5013.227994012,7] FSP: Reassociating HVSI console 1
[ 5013.227997540,7] FSP: Reassociating HVSI console 2
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 175e406ac29435017c5bd08b2c45d93b2c5a7669)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
We use TCE mapped area to write data to console. Console header
(fsp_serbuf_hdr) is modified by both FSP and OPAL (OPAL updates
next_in pointer in fsp_serbuf_hdr and FSP updates next_out pointer).
Kernel makes opal_console_write() OPAL call to write data to console.
OPAL write data to TCE mapped area and sends MBOX command to FSP.
If our console becomes full and we have data to write to console,
we keep on waiting until FSP reads data.
In some corner cases, where FSP is active but not responding to
console MBOX message (due to buggy IPMI) and we have heavy console
write happening from kernel, then eventually our console buffer
becomes full. At this point OPAL starts sending OPAL_BUSY_EVENT to
kernel. Kernel will keep on retrying. This is creating kernel soft
lockups. In some extreme case when every CPU is trying to write to
console, user will not be able to ssh and thinks system is hang.
If we reset FSP or restart IPMI daemon on FSP, system recovers and
everything becomes normal.
This patch adds workaround to above issue by returning OPAL_HARDWARE
when cosole is full. Side effect of this patch is, we may endup dropping
latest console data. But better to drop console data than system hang.
Alternative approach is to drop old data from console buffer, make space
for new data. But in normal condition only FSP can update 'next_out'
pointer and if we touch that pointer, it may introduce some other
race conditions. Hence we decided to just new console write request.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Acked-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit c8a7535f3539c79955645e6b3714b367a994b1e9)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This is an experimental patch that implements "Fast reboot" on P8
machines.
The basic idea is that when the OS calls OPAL reboot, we gather all
the threads in the system using a combination of patching the reset
vector and soft-resetting them, then cleanup a few bits of hardware
(we do re-probe PCIe for example), and reload & restart the bootloader.
For Trusted Boot, this means we *add* measurements to the TPM, so you
will get *different* PCR values as compared to a full IPL. This makes
sense as if you want to be sure you are running something known then,
well, do a full IPL as soft reset should never be trusted to clear any
malicious code.
This is very experimental and needs a lot of testing and also auditing
code for other bits of HW that might need to be cleaned up.
BenH TODO: I also need to check if we are properly PERST'ing PCI devices.
This is partially based on old code I had to do that on P7. I only
support it on P8 though as there are issues with the PSI interrupts
on P7 that cannot be reliably solved.
Even though this should be considered somewhat experimental, we've had
a lot of success on a variety of machines. Dozens/hundreds of reboots
across Tuleta, Garrison and Habanero.
Currently, we've hidden it behind a NVRAM config option, which *is*
liable to change in the future (to ensure that only those who know
what they're doing enable it)
You can enable the experimental support via nvram option:
nvram -p ibm,skiboot --update-config experimental-fast-reset=feeling-lucky
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[stewart@linux.vnet.ibm.com: hide behind nvram option, include Mambo fixes
from Mikey]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Allocate an irq number for each hvc console and set its interrupt-parent
property so that Linux can use the opal irqchip instead of the
OPAL_EVENT_CONSOLE_INPUT interface.
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Linux kernels from v4.1 onwards will try to request an irq for each hvc
console using OPAL_EVENT_CONSOLE_INPUT, however because the IRQF_SHARED
flag is not set any console after the first will fail. If there is data
on one of these failed consoles OPAL will set OPAL_EVENT_CONSOLE_INPUT
every time fsp_console_read is called, leading to RCU stalls in the
kernel.
As a workaround for unpatched kernels, cease setting
OPAL_EVENT_CONSOLE_INPUT for consoles that we have noticed are not being
read.
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
NULL terminate and truncate size of copy into char buffer to the right size.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Now that opal.h includes opal-api.h, there are a bunch of files that
include both but don't need to.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This is probably not the best collection of things in the world,
but it means that opal.h is much closer to being directly usable
by an OS.
This triggers a bunch of #include fixes throughout the tree.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Fix Wunused-result
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
For the most part, PR_DEBUG.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
The patch provides the in-band support for reading the 'console-select'
system parameter. It also adds the console support to honour the system
param for switching the console type in P8 systems.
Tested-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Jeremy Kerr <jeremy.kerr@au.ibm.com>
|
|
When setting the flag in a lock that indicates that it's on the console
path, we need to take and release that lock to ensure that any other
processor that might have taken it before the flag was set has released
it, otherwise the lock might still be held without the console count
properly incremented, which can cause it to go negative or cause the
deadlock that we mean to avoid by that to still occur.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
We modify write() (adding console_write()) which calls down to
a modified __flush_console() which can now decide if it's flushing
the added console contents to the console drivers or not.
A future patch may add support for changing PR_NOTICE to some other level
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
It's not a great idea to try to call fsp_sync_msg from an FSP callback
since the pollers won't run if we already have the poll lock (we might
do something about that at some point but not just yet)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
We can be called from a poller, calling the poller again has no
effect and we might just end up dead locking. There is no reason
to be synchronous anyway.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Otherwise we don't handle surveillance and PSI link monitoring
This should fix cases of surveillance timeouts during things
like code update such as BZ109939
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|