fsp: return OPAL_BUSY_EVENT on failure sending FSP_CMD_POWERDOWN_NORM

We had a race condition between FSP Reset/Reload and powering down the system from the host: Roughly: FSP Host --- ---- Power on Power on (inject EPOW) (trigger FSP R/R) Processes EPOW event, starts shutting down calls OPAL_CEC_POWER_DOWN (is still in R/R) gets OPAL_INTERNAL_ERROR, spins in opal_poll_events (FSP comes back) spinning in opal_poll_events (thinks host is running) The call to OPAL_CEC_POWER_DOWN is only made once as the reset/reload error path for fsp_sync_msg() is to return -1, which means we give the OS OPAL_INTERNAL_ERROR, which is fine, except that our own API docs give us the opportunity to return OPAL_BUSY when trying again later may be successful, and we're ambiguous as to if you should retry on OPAL_INTERNAL_ERROR. For reference, the linux code looks like this: >static void __noreturn pnv_power_off(void) >{ > long rc = OPAL_BUSY; > > pnv_prepare_going_down(); > > while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) { > rc = opal_cec_power_down(0); > if (rc == OPAL_BUSY_EVENT) > opal_poll_events(NULL); > else > mdelay(10); > } > for (;;) > opal_poll_events(NULL); >} Which means that *practically* our only option is to return OPAL_BUSY or OPAL_BUSY_EVENT. We choose OPAL_BUSY_EVENT for FSP systems as we do want to ensure we're running pollers to communicate with the FSP and do the final bits of Reset/Reload handling before we power off the system. Additionally, we really should update our documentation to point all of these return codes and what action an OS should take. CC: stable Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
author: Stewart Smith <stewart@linux.vnet.ibm.com> 2017-10-02 12:08:25 +1100
committer: Stewart Smith <stewart@linux.vnet.ibm.com> 2017-10-11 16:45:34 +1100
commit: 696d378d7b7295366e115e89a785640bf72a5043 (patch)
tree: 83cfb7042a91d5d7489684f1b23ba44294663537 /doc
parent: e363cd66debb6a83e64bdd3bbdbf0eff501443a8 (diff)
download: skiboot-696d378d7b7295366e115e89a785640bf72a5043.zip
skiboot-696d378d7b7295366e115e89a785640bf72a5043.tar.gz
skiboot-696d378d7b7295366e115e89a785640bf72a5043.tar.bz2
2 files changed, 20 insertions, 4 deletions
diff --git a/doc/opal-api/opal-cec-power-down-5.rst b/doc/opal-api/opal-cec-power-down-5.rst
index 6daea3d..bdcb84e 100644
--- a/doc/opal-api/opal-cec-power-down-5.rst
+++ b/doc/opal-api/opal-cec-power-down-5.rst
@@ -24,16 +24,28 @@ Return Values
 -------------
 
 ``OPAL_SUCCESS``
-  the power down was updated successful
+  the power down request was successful.
+  This may/may not result in immediate power down. An OS should
+  spin in a loop after getting `OPAL_SUCCESS` as it is likely that there
+  will be a delay before instructions stop being executed.
 
 ``OPAL_BUSY``
-  unable to power down, try again later
+  unable to power down, try again later.
+
+``OPAL_BUSY_EVENT``
+  Unable to power down, call `opal_run_pollers` and try again.
 
 ``OPAL_PARAMETER``
   a parameter was incorrect
 
 ``OPAL_INTERNAL_ERROR``
-  hal code sent incorrect data to hardware device
+  Something went wrong, and waiting and trying again is unlikely to be
+  successful. Although, considering that in a shutdown code path, there's
+  unlikely to be any other valid option to take, retrying is perfectly valid.
+
+  In older OPAL versions (prior to skiboot v5.9), on IBM FSP systems, this
+  return code was returned erroneously instead of OPAL_BUSY_EVENT during an
+  FSP Reset/Reload.
 
 ``OPAL_UNSUPPORTED``
   this platform does not support being powered off.
diff --git a/doc/opal-api/return-codes.rst b/doc/opal-api/return-codes.rst
index 03ea5c1..3ea4a3d 100644
--- a/doc/opal-api/return-codes.rst
+++ b/doc/opal-api/return-codes.rst
@@ -40,7 +40,8 @@ OPAL_BUSY
 
    #define OPAL_BUSY		-2
 
-Try again later.
+Try again later. Related to `OPAL_BUSY_EVENT`, but `OPAL_BUSY` indicates that the
+caller need not call `OPAL_POLL_EVENTS` itself. **TODO** Clarify current situation.
 
 OPAL_PARTIAL
 ------------
@@ -126,6 +127,9 @@ OPAL_BUSY_EVENT
 
    #define OPAL_BUSY_EVENT		-12
 
+The same as `OPAL_BUSY` but signals that the OS should call `OPAL_POLL_EVENTS` as
+that may be required to get into a state where the call will succeed.
+
 OPAL_HARDWARE_FROZEN
 --------------------
 ::
author	Stewart Smith <stewart@linux.vnet.ibm.com>	2017-10-02 12:08:25 +1100
committer	Stewart Smith <stewart@linux.vnet.ibm.com>	2017-10-11 16:45:34 +1100
commit	696d378d7b7295366e115e89a785640bf72a5043 (patch)
tree	83cfb7042a91d5d7489684f1b23ba44294663537 /doc
parent	e363cd66debb6a83e64bdd3bbdbf0eff501443a8 (diff)
download	skiboot-696d378d7b7295366e115e89a785640bf72a5043.zip skiboot-696d378d7b7295366e115e89a785640bf72a5043.tar.gz skiboot-696d378d7b7295366e115e89a785640bf72a5043.tar.bz2