Age | Commit message (Collapse) | Author | Files | Lines |
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
This issue is one of the corner case, which is related to recent change
went upstream and only observed in the petitboot prompt, where we see
only one error log instead of getting all error log in
/sys/firmware/opal/elog.
Below is snippet of the code, where elog module in the kernel
initialised.
{
..
...
rc = request_threaded_irq(irq, NULL, elog_event, =<=======
IRQF_TRIGGER_HIGH | IRQF_ONESHOT, "opal-elog", NULL); |
if (rc) { |
pr_err("%s: Can't request OPAL event irq (%d)\n", |
__func__, rc); |
return rc; |
} |
/* We are now ready to pull error logs from opal. */ |
if (opal_check_token(OPAL_ELOG_RESEND)) |
opal_resend_pending_logs(); =<=======
}
Scenario:
While elog_enabled is true, OPAL_EVENT_ERROR_LOG_AVAIL will be set from
OPAL, whenever it has error logs that are waiting to be fetched from the
kernel.
Race occurs between the code arrowed above, as soon as kernel registers
error log handler, it sees OPAL_EVENT_ERROR_LOG_AVAIL is set, so it
schedule the handler. Which makes 'opal_get_elog_size'(kernel) call on
the error log set the state from ELOG_STATE_FETCHED_DATA to
ELOG_STATE_FETCHED_INFO and clears OPAL_EVENT_ERROR_LOG_AVAIL. During
the same time 'opal_resend_pending_logs'(kernel) call which will set the
state machine from ELOG_STATE_FETCHED_INFO to ELOG_STATE_NONE in OPAL.
Because of that, read call from the kernel, which was to be made after
the 'opal_get_elog_size' ends up failing. But, the elog kobject was
created for the particular error log.
Further in the resend routine in the OPAL, we make opal_commit_elog_in_host()
call that sets OPAL_EVENT_ERROR_LOG_AVAIL. So, Kernel again makes
'opal_get_elog_size' which results in getting the error log info of the
same error log which was fetched earlier. It also changes the state
machine to ELOG_STATE_FETCHED_INFO and clears OPAL_EVENT_ERROR_LOG_AVAIL.
Below is the snippet from the elog_event registered handler call
{
...
...
/* we may get notified twice, let's handle
* that gracefully and not create two conflicting
* entries.
*/
if (kset_find_obj(elog_kset, name))
return IRQ_HANDLED;
...
...
}
In the kernel, we search kobject for the error log whether it already
exist. So kobject is found and it returns without reading error log
data.
So, this patch makes the flag which was true during initialisation
to false. And that solves the race.
Signed-off-by: Mukesh Ojha <mukesh02@linux.vnet.ibm.com>
Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 218f4ae791c6f66532579d06a0bfe45e56bb3c4e)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Confirmed with Alistair on IRC, and earlier discussions with Russell.
Basically, I was a bit of an idiot and didn't think hard enough before
adding the FWTS annotation.
Without this patch, you get spurious FirmWare Test Suite (FWTS) warnings
about NVLink not working on machines that aren't fully populated with
GPUs.
Fixes: 00e3e275344a42f6a682be72c88c015df87a0e28
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit a339d4779a6a382c0e197177b0142d62e26a6416)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Fixes: 5fc07eaa (Produce XZ compressed skiboot.lid as part of build)
CC: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 742b7226124b140f129d181c07d23dbe36c18ea3)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
With the skiboot log set to debug, the FIR (and related registers) were
logged all in the same message. It was too much for one line, didn't
clarify if the numbers were in hex, and didn't show leading zeroes.
So, split it into two lines, with leading zeroes and a "0x" prefix.
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 4eabfa056562e144c1a011bf4159387337023659)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
If we take an unknown exception at boot time we attempt to put the
exception vector in the back trace, The result looks like this (when
we take an 0x700):
S: 0000000031e838a0 R: 000000003001365c .backtrace+0x38
S: 0000000031e83930 R: 00000000300186cc ._abort+0x4c
S: 0000000031e839b0 R: 0000000030023a78 .exception_entry+0x114
S: 0000000031e83a40 R: 0000000000001f04 * +0x1f04
S: 0000000031e83c10 R: 0000000000000700 * +0x700
S: 0000000031e83e30 R: 0000000030014444 .main_cpu_entry+0x444
S: 0000000031e83f00 R: 000000003000259c boot_entry+0x19c
We overwrite link address in the current stack frame with the
exception vector (ie. 0x700 in the above example). Unfortunately this
overrides the location that caused the exception, which is much more
useful information in debugging the problem.
This patch removes the writing link register in the current stack
frame, so the back trace now looks like this:
S: 0000000031da38a0 R: 000000003001365c .backtrace+0x38
S: 0000000031da3930 R: 00000000300186cc ._abort+0x4c
S: 0000000031da39b0 R: 0000000030023a78 .exception_entry+0x114
S: 0000000031da3a40 R: 0000000000001f00 * +0x1f00
S: 0000000031da3c10 R: 00000000300323f8 .psi_init+0x1f4
S: 0000000031da3e30 R: 0000000030014444 .main_cpu_entry+0x444
S: 0000000031da3f00 R: 000000003000259c boot_entry+0x19c
This loses the exception vector from the back trace, but this
information is already available in the exception dump just above it
Suggestion by benh.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit e8c3f4ce21c24eee58489149769e84315d4d647d)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Below are the in-memory console log messages observed with error level(PR_ERROR)
[54460318,3] HBRT: Mem region 'ibm,homer-image' not found !
[54465404,3] HBRT: Mem region 'ibm,homer-image' not found !
[54470372,3] HBRT: Mem region 'ibm,homer-image' not found !
[54475369,3] HBRT: Mem region 'ibm,homer-image' not found !
[11540917382,3] NVRAM: Layout appears sane
[11694529822,3] OPAL: Trying a CPU re-init with flags: 0x2
[61291003267,3] OPAL: Trying a CPU re-init with flags: 0x1
[61394005956,3] OPAL: Trying a CPU re-init with flags: 0x2
Lowering the log level of mem region not found messages to PR_WARNING and remaining messages to PR_INFO level
[54811683,4] HBRT: Mem region 'ibm,homer-image' not found !
[10923382751,6] NVRAM: Layout appears sane
[55533988976,6] OPAL: Trying a CPU re-init with flags: 0x1
Signed-off-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 341daa8104af3231b908e6fcffeedb5e47b33990)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
Someone was a bit too keen with the cleanups last time. Restore the
ability for pflash to build in non-shared mode.
Fixes: c327eddd9b29 (pflash: Clean up makefiles and resolve build race)
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit fd599965f723330da5ec55519c20cdb6aa2b3a2d)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Someone was a bit too keen with the cleanups last time. Restore the
ability for pflash to build in non-shared mode.
Fixes: c327eddd9b29 (pflash: Clean up makefiles and resolve build race)
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
flash_find_subpartition() accepts a pointer to a boolean variable
indicating ecc for a region of flash and passes the pointer directly
to flash_read_corrected() which actually only wants the value. This
has always worked probably because there has always been ECC on
sub partitions.
How there aren't any warnings triggered by this condition escapes me.
Fixes: 6c26bc7 ("libflash: move ffs_flash_read into libflash")
Signed-off-by: Cyril Bur <cyril.bur@au1.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Fix Radix Tree Size (RTS) encoding as per ISA 3.0. This is controlled via a
SIM_CTRL1 bit in mambo.
In Linux we recently changed to this encoding, so we no longer boot.
The associated Linux commit is:
commit b23d9c5b9c83c05e013aa52460f12a8365062cf4
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date: Fri Jun 17 11:40:36 2016 +0530
powerpc/mm/radix: Update Radix tree size as per ISA 3.0
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
I accidentally built myself a cross-toolchain with the musl libc. It does
not support on_exit which we use to clean up in pflash.
Instead use atexit with is supported by both uclibc, musl and glibc.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
It replaces two letter SRC components macro name with some meaningful
components name to make it more legible.
E.g:
OPAL_XS => OPAL_SRC_COMPONENT_XSCOM
Signed-off-by: Mukesh Ojha <mukesh02@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
For a 1004 slot mapping bit 6 (0x40) of the P0 field represents the
pwr_ctl bit. This code previously accessed the wrong field (power_ctl)
which is a single bit which corresponds to the 1005 mapping (which is
the wrong mapping), performed a bitwise and with 0x40 (which will always
be 0), and then compared to 1 (which will also always be 0).
Fix this to access the byte struct member, bitwise and with 0x40 to mask
the power_ctl bit, and double negate to guarantee 0 or 1 result.
Fixes: Coverity Bug #97820
Fixes: 6884fe63 ("platforms/ibm-fsp: Support PCI slot")
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
We have used TCE_MASK value (4095) instead of TCE_PSIZE (4096) to align memory
source address. In some corner cases (like source memory size = 4097) we may
endup doing wrong mapping and corrupting part of SYSDUMP.
This patch uses ALIGN_UP macro with TCE_PSIZE value for alignining memory.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
We have core information under /cpus node and processor chip VPD data
(like SN, LN, etc) under /vpd directory. Presently we don't have any
property to relate cores and chip information. This patch adds chip-id
information for processor chip nodes under vpd.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Be more verbose (at debug level) when formatting the NVRAM, this can
help catch errors at other levels of the stack.
Signed-off-by: Cyril Bur <cyril.bur@au1.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Cyril Bur <cyril.bur@au1.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
The pflash build process has regressed from when the were last fixed in
6c21c4ffaf82.
This patch resolves that issue and performs some cleanups:
- Remove duplicated rules. Patches had moved rules into common files,
but forgotten to remove them from the pflash makefiles.
- Make assignements simply expanded variables where possible. Form the
make manual:
Functions referenced in the definition will be executed every time
the variable is expanded. This makes make run slower; worse, it
causes the wildcard and shell functions to give unpredictable
results because you cannot easily control when they are called, or
even how many times.
To avoid all the problems and inconveniences of recursively
expanded variables, there is another flavor: simply expanded
variables.
- set the 'shared' target as a dependency of the libflash objects. This
was the final piece to resolve the race condition.
The failed build could be reproduced by doing a `git clean -f -x` and
then running the following:
$ make -j 32 CROSS_COMPILE=arm-linux-gnueabi- SKIBOOT_VERSION=5.2.4
PFLASH_VERSION=5.2.4 V=1 -C external/pflash all LINKAGE=dynamic
make: Entering directory '/home/joel/dev/skiboot/external/pflash'
ln -sf ../../libflash ./libflash
ln -sf ../../ccan ./ccan
ln -sf ../common ./common
cc -O2 -Wall -I. -c pflash.c -o pflash.o
cc -O2 -Wall -I. -c progress.c -o progress.o
make -C ../shared
make[1]: Entering directory '/home/joel/dev/skiboot/external/shared'
ln -sf ../../hw/ast-bmc/ast-sf-ctrl.c common/ast-sf-ctrl.c
ln -sf ../../include/ast.h common/ast.h
ln -sf arch_flash_arm_io.h common/io.h
cc -O2 -Wall -I. -c common/arch_flash_common.c -o
common-arch_flash_common.o
cc -O2 -Wall -I. -c common/arch_flash_arm.c -o common-arch_flash_arm.o
cc -O2 -Wall -I. -c common/ast-sf-ctrl.c -o common-ast-sf-ctrl.o
cc -O2 -Wall -I. -c version.c -o version.o
ld -r common-arch_flash_common.o common-arch_flash_arm.o
common-ast-sf-ctrl.o -o common-arch_flash.o
ln -sf ../../libflash ./libflash
ln -sf ../../ccan ./ccan
ln -sf ../common ./common
make[1]: *** No rule to make target 'libflash/file.c', needed by
'libflash-file.o'. Stop.
make[1]: *** Waiting for unfinished jobs....
make[1]: Leaving directory '/home/joel/dev/skiboot/external/shared'
rules.mk:25: recipe for target
'../shared/libflash.so.skiboot-5.2.4-1-g9f13f64c322f-joel-dirty-d5873ce'
failed
make: ***
[../shared/libflash.so.skiboot-5.2.4-1-g9f13f64c322f-joel-dirty-d5873ce]
Error 2
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit c327eddd9b291a0e6e54001fa3b1e547bad3fca2)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
The pflash build process has regressed from when the were last fixed in
6c21c4ffaf82.
This patch resolves that issue and performs some cleanups:
- Remove duplicated rules. Patches had moved rules into common files,
but forgotten to remove them from the pflash makefiles.
- Make assignements simply expanded variables where possible. Form the
make manual:
Functions referenced in the definition will be executed every time
the variable is expanded. This makes make run slower; worse, it
causes the wildcard and shell functions to give unpredictable
results because you cannot easily control when they are called, or
even how many times.
To avoid all the problems and inconveniences of recursively
expanded variables, there is another flavor: simply expanded
variables.
- set the 'shared' target as a dependency of the libflash objects. This
was the final piece to resolve the race condition.
The failed build could be reproduced by doing a `git clean -f -x` and
then running the following:
$ make -j 32 CROSS_COMPILE=arm-linux-gnueabi- SKIBOOT_VERSION=5.2.4
PFLASH_VERSION=5.2.4 V=1 -C external/pflash all LINKAGE=dynamic
make: Entering directory '/home/joel/dev/skiboot/external/pflash'
ln -sf ../../libflash ./libflash
ln -sf ../../ccan ./ccan
ln -sf ../common ./common
cc -O2 -Wall -I. -c pflash.c -o pflash.o
cc -O2 -Wall -I. -c progress.c -o progress.o
make -C ../shared
make[1]: Entering directory '/home/joel/dev/skiboot/external/shared'
ln -sf ../../hw/ast-bmc/ast-sf-ctrl.c common/ast-sf-ctrl.c
ln -sf ../../include/ast.h common/ast.h
ln -sf arch_flash_arm_io.h common/io.h
cc -O2 -Wall -I. -c common/arch_flash_common.c -o
common-arch_flash_common.o
cc -O2 -Wall -I. -c common/arch_flash_arm.c -o common-arch_flash_arm.o
cc -O2 -Wall -I. -c common/ast-sf-ctrl.c -o common-ast-sf-ctrl.o
cc -O2 -Wall -I. -c version.c -o version.o
ld -r common-arch_flash_common.o common-arch_flash_arm.o
common-ast-sf-ctrl.o -o common-arch_flash.o
ln -sf ../../libflash ./libflash
ln -sf ../../ccan ./ccan
ln -sf ../common ./common
make[1]: *** No rule to make target 'libflash/file.c', needed by
'libflash-file.o'. Stop.
make[1]: *** Waiting for unfinished jobs....
make[1]: Leaving directory '/home/joel/dev/skiboot/external/shared'
rules.mk:25: recipe for target
'../shared/libflash.so.skiboot-5.2.4-1-g9f13f64c322f-joel-dirty-d5873ce'
failed
make: ***
[../shared/libflash.so.skiboot-5.2.4-1-g9f13f64c322f-joel-dirty-d5873ce]
Error 2
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
It was left uninitialized which could cause issues later on
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Save everyone a trip to asciitable.com. I realise everyone has probably
memorised \n and friends, but THAT'S NOT THE POINT.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
When enabling CAPI in DMA mode, set the AIB TX command credits for channel
2 (DMA read) to 28, rather than 1. This significantly improves DMA read
performance in CAPI DMA mode.
Fixes: 5477148a439f ("phb3: Add support for CAPP DMA mode")
Reported-by: John Walthour <jwalthour@us.ibm.com>
Reported-by: Ricardo Mata <ricmata@us.ibm.com>
Reported-by: Michael Perez <perezma@us.ibm.com>
Cc: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
The hello world kernel fails to correctly set r3 before making the
shutdown opal call. On FSP machines only shutdown types 0 and 1 are
recognised as valid shutdown types. If any other type is specified
(in r3) the call is rejected with an OPAL_PARAMETER error and the
machine will continue running.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Using the indent option on json dump extra whitespace
preceeds the newline in the json separator formatting.
We need to remove the extra whitespace when the
indent option is used to allow clean patch application.
Signed-off-by: Deb McLemore <debmc@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
The console is very slow when using Skiboot with Mambo.
This adds a heartbeat timer as a platform quirk so that the console is
refresh more quickly. This results in Skiboot doing the right thing
without requiring custom settings in skiboot.tcl files.
Signed-off-by: Chris Smart <chris@distroguy.com>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
The timer code currently has a default and a special check for FSP
machines or those with SLW timer facility.
This patch adds support for platform quirk to set the timer.
Signed-off-by: Chris Smart <chris@distroguy.com>
Acked-by: Michael Neuling <mikey@neuling.org>
[stewart@linux.vnet.ibm.com: fix whitespace issue]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
HEARTBEAT_DEFAULT_MS sets the default heartbeat timeout, however this
was not actually used as the default. The default was ten times quicker
than this (HEARTBEAT_DEFAULT_MS / 10) while HEARTBEAT_DEFAULT_MS was
actually used as a special case for FSP machines or those with SLW
timer facility.
This patch makes the default use HEARTBEAT_DEFAULT_MS and changes FSP
or machines with SLW timer facility run 10 times slower
(HEARTBEAT_DEFAULT_MS * 10). This will also now match the existing
in-line comment.
Signed-off-by: Chris Smart <chris@distroguy.com>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
Fix resend logic in opal_resend_pending_logs, so that it actually
restarts sending remaining logs.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit a6d4a7884e95cb9c918b8a217c11e46b01218358)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
In some corner cases host may send acknowledgement without
reading actual data (fsp_opal_elog_info -> fsp_opal_elog_ack).
Because of this elog_read_from_fsp_head_state may be stuck in
wrong state (ELOG_STATE_HOST_INFO) and not able to send remaining
ELOG's to host. Hence reset ELOG state and start sending remaining
ELOG's.
Also in normal case we will ACK the logs which are already processed
(elog_read_processed). Hence rearrange the code such that we go
through elog_read_processed first.
Finally return OPAL_PARAMETER if we are not able to find ELOG ID.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
[stewart@linux.vnet.ibm.com: spelling fix]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit e7c8cba4ad773055f390632c2996d3242b633bf4)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Chances of elog_read_pending inconsistent state is very very
less. Just to be on safer side, disable notification if list
is not in consistent state.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Reviewed-by: Mukesh Ojha <mukesh02@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit 1fb10de164d3ca034193df81c1f5d007aec37781)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
ELOG enables event notification once new log is available. And this
will be disabled after host completes reading logs (it has to complete
both fsp_opal_elog_info and fsp_opal_elog_read).
Ideally we should disable notification as soon as host consumes event
(after fsp_opal_elog_info). Also if host fails to call fsp_opal_elog_read
(ex: situations like duplicate event), then we endup keeping notification
forever.
This patch introduces new ELOG state (ELOG_STATE_HOST_INFO). As soon
as host consumes event elog will move to this new state so that event
notification is disabled.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit cec5750a4a86ff3f69e1d8817eda023f4d40c492)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
We use elog notifier to notify logs from multiple sources (FSP generated
logs - fsp-elog-read.c and OPAL generated logs - fsp-elog-write.c).
OPAL generated logs sets elog event bit whenever it has new logs to send
to host. But it relies on fsp-elog-read.c to disable the event bit..which
is wrong!
This patch creates common function to enable/disable event notification.
It will enable event notification if any of the source is ready to send
error log to host and disables notification once it completes sending
all errors to host.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit ec366ad4e2e871096fa4c614ad7e89f5bb6f884f)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
ELOG enables event notification once new log is available. And this
will be disabled after host completes reading logs (it has to complete
both fsp_opal_elog_info and fsp_opal_elog_read).
In some corner cases like kexec, host may endup reading same ELOG id twice
(calling fsp_opal_elog_info twice because of resend request). Host finds it
as duplicate and it will not read actual log (fsp_opal_elog_read()). In such
situations we fails to disable event notification :-(
Scenario :
OPAL Host
-------------------------------------
OPAL_EVENT_ELOG_AVAIL --> kexec
OPAL_EVENT_ELOG_AVAIL --> elog client registered
<-- read ELOG (id=x)
<-- resend elog (opal_resend_pending_logs())
resend all ELOG --> read ELOG (id=x) -- Duplicate ELOG !
bhoom!!
kernel call trace:
------------------
[ 28.055923] CPU: 10 PID: 20 Comm: irq/29-opal-elo Not tainted 4.4.0-24-generic #43-Ubuntu
[ 28.056012] task: c0000000ef982a20 ti: c0000000efa38000 task.ti: c0000000efa38000
[ 28.056100] NIP: c000000008010a24 LR: c000000008010a24 CTR: 0000000030033758
[ 28.056188] REGS: c0000000efa3b9c0 TRAP: 0901 Not tainted (4.4.0-24-generic)
[ 28.056274] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 22000844 XER: 20000000
[ 28.056499] CFAR: c000000008009958 SOFTE: 1
GPR00: c000000008131e8c c0000000efa3bc40 c0000000095b4200 0000000000000900
GPR04: c0000000094a63c8 0000000000000001 9000000100009033 0000000000000062
GPR08: 0000000000000000 0000000000000000 c0000000ef960400 9000000100001003
GPR12: c00000000806de48 c00000000fb45f00
[ 28.057042] NIP [c000000008010a24] arch_local_irq_restore+0x74/0x90
[ 28.057117] LR [c000000008010a24] arch_local_irq_restore+0x74/0x90
[ 28.057189] Call Trace:
[ 28.057221] [c0000000efa3bc40] [c0000000f108a980] 0xc0000000f108a980 (unreliable)
[ 28.057326] [c0000000efa3bc60] [c000000008131e8c] irq_finalize_oneshot.part.2+0xbc/0x250
[ 28.057429] [c0000000efa3bcb0] [c000000008132170] irq_thread_fn+0x80/0xa0
[ 28.057519] [c0000000efa3bcf0] [c00000000813263c] irq_thread+0x1ac/0x280
[ 28.057609] [c0000000efa3bd80] [c0000000080e61e0] kthread+0x110/0x130
[ 28.057698] [c0000000efa3be30] [c000000008009538] ret_from_kernel_thread+0x5c/0xa4
[ 28.057799] Instruction dump:
[ 28.057844] 994d02ca 2fa30000 409e0024 e92d0020 61298000 7d210164 38210020 e8010010
[ 28.057995] 7c0803a6 4e800020 60420000 4bff17ad <60000000> 4bffffe4 60420000 e92d0020
This patch adds kexec notifier client. It will disable event notification
during kexec. Once host is ready to receive ELOG's again it will call
fsp_opal_resend_pending_logs(). This call re-enables ELOG notication.
It will fix above issue. I will add follow up patch to improve event state.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
(cherry picked from commit d2ae07fd97bb9408456279cec799f72cb78680a6)
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Fix resend logic in opal_resend_pending_logs, so that it actually
restarts sending remaining logs.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
In some corner cases host may send acknowledgement without
reading actual data (fsp_opal_elog_info -> fsp_opal_elog_ack).
Because of this elog_read_from_fsp_head_state may be stuck in
wrong state (ELOG_STATE_HOST_INFO) and not able to send remaining
ELOG's to host. Hence reset ELOG state and start sending remaining
ELOG's.
Also in normal case we will ACK the logs which are already processed
(elog_read_processed). Hence rearrange the code such that we go
through elog_read_processed first.
Finally return OPAL_PARAMETER if we are not able to find ELOG ID.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
[stewart@linux.vnet.ibm.com: spelling fix]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
Chances of elog_read_pending inconsistent state is very very
less. Just to be on safer side, disable notification if list
is not in consistent state.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Reviewed-by: Mukesh Ojha <mukesh02@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
ELOG enables event notification once new log is available. And this
will be disabled after host completes reading logs (it has to complete
both fsp_opal_elog_info and fsp_opal_elog_read).
Ideally we should disable notification as soon as host consumes event
(after fsp_opal_elog_info). Also if host fails to call fsp_opal_elog_read
(ex: situations like duplicate event), then we endup keeping notification
forever.
This patch introduces new ELOG state (ELOG_STATE_HOST_INFO). As soon
as host consumes event elog will move to this new state so that event
notification is disabled.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
We use elog notifier to notify logs from multiple sources (FSP generated
logs - fsp-elog-read.c and OPAL generated logs - fsp-elog-write.c).
OPAL generated logs sets elog event bit whenever it has new logs to send
to host. But it relies on fsp-elog-read.c to disable the event bit..which
is wrong!
This patch creates common function to enable/disable event notification.
It will enable event notification if any of the source is ready to send
error log to host and disables notification once it completes sending
all errors to host.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
ELOG enables event notification once new log is available. And this
will be disabled after host completes reading logs (it has to complete
both fsp_opal_elog_info and fsp_opal_elog_read).
In some corner cases like kexec, host may endup reading same ELOG id twice
(calling fsp_opal_elog_info twice because of resend request). Host finds it
as duplicate and it will not read actual log (fsp_opal_elog_read()). In such
situations we fails to disable event notification :-(
Scenario :
OPAL Host
-------------------------------------
OPAL_EVENT_ELOG_AVAIL --> kexec
OPAL_EVENT_ELOG_AVAIL --> elog client registered
<-- read ELOG (id=x)
<-- resend elog (opal_resend_pending_logs())
resend all ELOG --> read ELOG (id=x) -- Duplicate ELOG !
bhoom!!
kernel call trace:
------------------
[ 28.055923] CPU: 10 PID: 20 Comm: irq/29-opal-elo Not tainted 4.4.0-24-generic #43-Ubuntu
[ 28.056012] task: c0000000ef982a20 ti: c0000000efa38000 task.ti: c0000000efa38000
[ 28.056100] NIP: c000000008010a24 LR: c000000008010a24 CTR: 0000000030033758
[ 28.056188] REGS: c0000000efa3b9c0 TRAP: 0901 Not tainted (4.4.0-24-generic)
[ 28.056274] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 22000844 XER: 20000000
[ 28.056499] CFAR: c000000008009958 SOFTE: 1
GPR00: c000000008131e8c c0000000efa3bc40 c0000000095b4200 0000000000000900
GPR04: c0000000094a63c8 0000000000000001 9000000100009033 0000000000000062
GPR08: 0000000000000000 0000000000000000 c0000000ef960400 9000000100001003
GPR12: c00000000806de48 c00000000fb45f00
[ 28.057042] NIP [c000000008010a24] arch_local_irq_restore+0x74/0x90
[ 28.057117] LR [c000000008010a24] arch_local_irq_restore+0x74/0x90
[ 28.057189] Call Trace:
[ 28.057221] [c0000000efa3bc40] [c0000000f108a980] 0xc0000000f108a980 (unreliable)
[ 28.057326] [c0000000efa3bc60] [c000000008131e8c] irq_finalize_oneshot.part.2+0xbc/0x250
[ 28.057429] [c0000000efa3bcb0] [c000000008132170] irq_thread_fn+0x80/0xa0
[ 28.057519] [c0000000efa3bcf0] [c00000000813263c] irq_thread+0x1ac/0x280
[ 28.057609] [c0000000efa3bd80] [c0000000080e61e0] kthread+0x110/0x130
[ 28.057698] [c0000000efa3be30] [c000000008009538] ret_from_kernel_thread+0x5c/0xa4
[ 28.057799] Instruction dump:
[ 28.057844] 994d02ca 2fa30000 409e0024 e92d0020 61298000 7d210164 38210020 e8010010
[ 28.057995] 7c0803a6 4e800020 60420000 4bff17ad <60000000> 4bffffe4 60420000 e92d0020
This patch adds kexec notifier client. It will disable event notification
during kexec. Once host is ready to receive ELOG's again it will call
fsp_opal_resend_pending_logs(). This call re-enables ELOG notication.
It will fix above issue. I will add follow up patch to improve event state.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|