aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2017-10-31skiboot 5.9 release notesv5.9Stewart Smith1-0/+1181
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-30FSP/CONSOLE: remove redundant flush_all_input() call in fsp_console_reset()Vasant Hegde1-2/+0
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-30FSP/CONSOLE: Disable notification on unresponsive consolesVasant Hegde1-3/+5
Commit fd6b71fc fixed the situation where ipmi console was open (hvc0) but got data on different console (hvc1). During FSP R/R OPAL closes all consoles. After R/R complete FSP requests to open hvc1 and sends data on this. If hvc1 registration failed or not opened in host kernel then it will not read data and results in RCU stalls. Note that this is workaround for older kernel where we don't have separate irq for each console. Latest kernel works fine without this patch. CC: stable CC: Sam Mendoza-Jonas <sam@mendozajonas.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-29asm/head: initialize preferred DSCR valueNicholas Piggin2-3/+20
POWER7/8 use DSCR=0. POWER9 preferred value has "stride-N" enabled. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-29p8-i2c: Further timeout reworksOliver O'Halloran1-91/+75
This patch reworks the way timeouts are set so that rather than imposing a hard deadline based on the transaction length it uses a kick-the-can-down-the-road approach where the timeout will be reset each time data is written to or received from the master. This fits better with the actual failure modes that timeouts are designed to handle, such as unusually slow or broken devices. Additionally this patch moves all the special case detection out of the timeout handler. This is help to improve the robustness of the driver and prepare for a more substantial rework of the driver as a whole later on. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-29npu: Fix broken fast resetAlexey Kardashevskiy1-0/+3
0679f61244b "fast-reset: by default (if possible)" broke NPU - now the NV links does not get enabled after reboot. This disables fast reboot for NPU machines till a better solution is found. Suggested-by: Andrew Donnellan <andonnel@au1.ibm.com> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-29Suppress XSCOM chiplet-offline errors on P9Stewart Smith1-1/+21
Workaround on P9: PRD does operations it *knows* will fail with this error to work around a hardware issue where accesses via the PIB (FSI or OCC) work as expected, accesses via the ADU (what xscom goes through) do not. The chip logic will always return all FFs if there is any error on the scom. Suggested-by: Daniel M Crowell <dcrowell@us.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Acked-by: Jeremy Kerr <jk@ozlabs.org> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-23skiboot 5.9-rc5 release notesv5.9-rc5Stewart Smith1-0/+75
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-23opal/hmi: Workaround Power9 hw logic bug for couple of TFMR TB errors.Mahesh Salgaonkar2-1/+54
Add a workaround for a HW logic bug in Power9 where TB residue and HDEC parity errors cleared by one thread aren't visible to other threads of same core. The TB reside and HDEC parity error are reported through TFMR bit 45 and 26 respectively. If any of the thread from the core clears the TFMR bit 26 and 45, only thread 0 is able to see that errors are cleared but rest of the threads 1, 2 and 3 do not see those as cleared. This causes TB error recovery to fail for TB residue and HDEC parity errors. TFMR is per core register and any changes made by a one thread should be visible by other threads of the same core. On TB residue error (TFMR bit 45), TB goes into invalid state. Hence avoid handling/clearing TB residue error if TB is valid and running. Use TFMR bit 41 to check validity of TB state. For HDEC parity error (TFMR bit 26), check for other errors on TFMR register and ignore the pre-recovery for HDEC parity error. If TFMR has any other TB error bits set alongwith HDEC parity error we can safely ignore handling of HDEC parity error. Also, while clearing HDEC parity error bit from TFMR, allow only thread 0 to clear it. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-23opal/hmi: Fix TB reside and HDEC parity error recovery for power9Mahesh Salgaonkar1-3/+102
On TB/HDEC errors, all 4 threads on the affected receives HMI. On power9, every thread on the core has its own copy of TB/HDEC and hence every thread has to clear the dirty data from its own TB/HDEC register before we clear tb errors through TFMR[24]. The HMI recovery would fail even if one thread do not cleanup the respective TB/HDEC register. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-23phb4: Escalate freeze to fence to avoid checkstopMichael Neuling2-6/+49
Freeze events such as MMIO loads can cause the PHB to lose it's limited powerbus credits. If all credits are used and a further MMIO will cause a checkstop. To work around this, we escalate the troublesome freeze events to a fence. The fence will cause a full PHB reset which resets the powerbus credits and avoids the checkstop. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-23phb4: Move code to find frozen PE earlierMichael Neuling1-14/+13
We are going to reuse this so move it earlier. No functional change Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-23phb4: Update some init registersMichael Neuling1-3/+3
New inits based on next PHB4 workbook. Increases some timeouts to avoid some spurious error conditions. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-23phb4: Enable PHB MMIO in phb4_root_port_init()Michael Neuling1-2/+2
Linux EEH flow is somewhat broken. It saves the PCIe config space of the PHB on boot, which it then uses to restore on EEH recovery. It does this to restore MMIO bars and some other pieces. Unfortunately this save is done before any drivers are bound to devices under the PHB. A number of other things are configured in the PHB after drivers start, hence some configuration space settings aren't saved correctly. These include bus master and MMIO bits in the command register. Linux tried to hack around this in this linux commit bf898ec5cb powerpc/eeh: Enable PCI_COMMAND_MASTER for PCI bridges This sets the bus master bit but ignores the MMIO bit. Hence we lose MMIO after a full PHB reset. This causes the next MMIO access to the device to fail and for us to perform a PE freeze recovery, which still doesn't set the MMIO bit and hence we still fail. This works around this by forcing MMIO on during phb4_root_port_init(). With this we can recovery from a PHB fence event on POWER9. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-23phb4: Use phb4_ioda_sel() moreMichael Neuling1-7/+2
Use phb4_ioda_sel() in phb4_read_phb_status() rather than re-implementing it. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-23phb4: Improve config space loggingMichael Neuling1-7/+14
Log root complex accesses and print BFDN on device access Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-23phb4: Remove unused codeMichael Neuling1-10/+0
This is old unused code from phb3 so just remove it. No functional change Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-23phb4: Move code around to avoid indentingMichael Neuling1-53/+51
No functional change. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-23phb4: Update commentMichael Neuling1-1/+1
No functional change. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-23phb4: Reduce link degraded message log level to debugMichael Neuling1-1/+1
If we hit this message we'll retry and fix the problem. If we run out of retries and can't fix the problem, we'll still print a log message at error level indicating a problem. Signed-off-by: Michael Neuling <mikey@neuling.org> Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-23phb4: Fix GEN3 for DD2.00Michael Neuling1-1/+1
In this fix: 62ac7631ae phb4: Fix PCIe GEN4 on DD2.1 and above We fixed DD2.1 GEN4 but broke DD2.00 as GEN3. This fixes DD2.00 back to GEN3. This time for sure! Signed-off-by: Michael Neuling <mikey@neuling.org> Tested-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
2017-10-19skiboot 5.9-rc4 release notesv5.9-rc4Stewart Smith1-0/+47
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-18occ-sensors : Add OCC inband sensor region to exportsShilpasri G Bhat1-1/+13
Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-18phb4: Fix PCIe GEN4 on DD2.1 and aboveMichael Neuling1-4/+3
In this change: eef0e197ab PHB4: Default to PCIe GEN3 on POWER9 DD2.00 We clamped DD2.00 parts to GEN3 but unfortunately this change also applies to DD2.1 and above. This fixes this to only apply to DD2.00. This also cleans up the documentation and printing. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-18core: direct-controls: Fix clearing of special wakeupShilpasri G Bhat1-1/+4
'special_wakeup_count' is incremented on successfully asserting special wakeup. So we will never clear the special wakeup if we check 'special_wakeup_count' to be zero. Fix this issue by checking the 'special_wakeup_count' to 1 in dctl_clear_special_wakeup(). Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-18core/direct-controls: increase special wakeup timeout on POWER9Nicholas Piggin1-3/+6
Some instances have been observed where the special wakeup assert times out. The current timeout is too short for deeper sleep states. Hostboot uses 100ms, so match that. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-18skiboot 5.9-rc3 release notesv5.9-rc3Stewart Smith1-0/+42
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-16hdata/vpd: Improve vpd node find logicVasant Hegde1-14/+2
Use dt_find_by_name_addr() instead of dt_find_by_name(). That way we can avoid unnecessary memory allocation/cleanup. CC: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-16hdata/vpd: Rework vpd node creation logicVasant Hegde5-399/+363
Presently we traverse SLCA structure to create various FRU nodes under /vpd node. We assumed that children are always contiguous. It happened to be contiguous in P8 and worked fine, but failed in P9 system. So it ended up populating duplicate node under wrong parent. Also failed to populate some of the nodes. Unfortunately there is no way to reach all the children of a given parent from parent node :-( Hence we have to rework vpd creation logic. This patch goes through all the SLCA entries serially and creates vpd node. Assumptions: - SLCA index is always serial (0..n) - When we traverse serially parent entry comes before child - Redundant resources are always consecutive - Populate node if SLCA has 'installed' and 'VPD collected' bit set CC: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Reviewed-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-16Revert "npu2: Add vendor cap for IRQ testing"Alistair Popple1-28/+0
This reverts commit 9817c9e29b6fe00daa3a0e4420e69a97c90eb373 which seems to break setting the PCI dev flag and the link number in the PCIe vendor specific config space. This leads to the device driver attempting to re-init the DL when it shouldn't which can cause HMI's. Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-16hw/imc: Fix the pvr (sub_id) for IMC Catalog loadMadhavan Srinivasan1-2/+2
Currently IMC catalog carry multiple dtbs in the pnor partition, one for each power9 major versions. And system pvr value (pvr_type and pvr_major version) is used as sub-id to load the right dtb from the partition. Since minor version of pvr is not used, mask it out. Reported-by: Shriya <shriyak@linux.vnet.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-16cpu: Add OPAL_REINIT_CPUS_TM_SUSPEND_DISABLEDMichael Ellerman3-0/+20
Add a new CPU reinit flag, "TM Suspend Disabled", which requests that CPUs be configured so that TM (Transactional Memory) suspend mode is disabled. Currently this always fails, because skiboot has no way to query the state. A future hostboot change will add a mechanism for skiboot to determine the status and return an appropriate error code. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-16skiboot 5.9-rc2 release notesv5.9-rc2Stewart Smith1-0/+246
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-16opal-prd: Fix memory leakVasant Hegde1-0/+1
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Acked-by: Jeremy Kerr <jk@ozlabs.org> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-15hdata/i2c: update the list of known i2c devsClaudio Carvalho1-4/+33
This updates the list of known i2c devices - as of HDAT spec v10.5e - so that they can be properly identified during the hdat parsing. Signed-off-by: Claudio Carvalho <cclaudio@linux.vnet.ibm.com> Reviewed-by: Oliver O'Halloran <oohal@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-15hdata/i2c: log unknown i2c devicesClaudio Carvalho1-4/+17
An i2c device is unknown if either the i2c device list is outdated or the device is marked as unknown (0xFF) in the hdat. This log both cases. Signed-off-by: Claudio Carvalho <cclaudio@linux.vnet.ibm.com> Reviewed-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-15hdata/i2c: add __packed to the host_i2c_hdr structureClaudio Carvalho1-1/+1
This adds __packed to the host_i2c_hdr structure since it defines an offset that refers to the beginning of the structure. Fixes: 41dc3eb4495c451a405974570f604622a3f829ef Signed-off-by: Claudio Carvalho <cclaudio@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-15opal/cpu: Mark the core as bad while disabling threads of the core.Mahesh Salgaonkar1-0/+10
If any of the core fails to sync its TB during chipTOD initialization, all the threads of that core are disabled. But this does not make linux kernel to ignore the core/cpus. It crashes while bringing them up with below backtrace: [ 38.883898] kexec_core: Starting new kernel cpu 0x0: Vector: 300 (Data Access) at [c0000003f277b730] pc: c0000000001b9890: internal_create_group+0x30/0x304 lr: c0000000001b9880: internal_create_group+0x20/0x304 sp: c0000003f277b9b0 msr: 900000000280b033 dar: 40 dsisr: 40000000 current = 0xc0000003f9f41000 paca = 0xc00000000fe00000 softe: 0 irq_happened: 0x01 pid = 2572, comm = kexec Linux version 4.13.2-openpower1 (jenkins@p89) (gcc version 6.4.0 (Buildroot 2017.08-00006-g319c6e1)) #1 SMP Wed Sep 20 05:42:11 UTC 2017 enter ? for help [c0000003f277b9b0] c0000000008a8780 (unreliable) [c0000003f277ba50] c00000000041c3ac topology_add_dev+0x2c/0x40 [c0000003f277ba70] c00000000006b078 cpuhp_invoke_callback+0x88/0x170 [c0000003f277bac0] c00000000006b22c cpuhp_up_callbacks+0x54/0xb8 [c0000003f277bb10] c00000000006bc68 cpu_up+0x11c/0x168 [c0000003f277bbc0] c00000000002f0e0 default_machine_kexec+0x1fc/0x274 [c0000003f277bc50] c00000000002e2d8 machine_kexec+0x50/0x58 [c0000003f277bc70] c0000000000de4e8 kernel_kexec+0x98/0xb4 [c0000003f277bce0] c00000000008b0f0 SyS_reboot+0x1c8/0x1f4 [c0000003f277be30] c00000000000b118 system_call+0x58/0x6c Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-15doc: Update VPD, ECID documentationVasant Hegde2-4/+16
Recently we added `ecid`, `wafer-id` and `wafer-location` properties under xscom node. Lets document these properties. Also update VPD documentation. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-15hw/imc: pause microcode at bootMadhavan Srinivasan1-0/+24
IMC nest counters has both in-band (ucode access) and out of band access to it. Since not all nest counter configurations are supported by ucode, out of band tools are used to characterize other configuration. So it is prefer to pause the nest microcode at boot to aid the nest out of band tools. If the ucode not paused and OS does not have IMC driver support, then out to band tools will race with ucode and end up getting undesirable values. Patch to check and pause the ucode at boot. OPAL provides APIs to control IMC counters. OPAL_IMC_COUNTERS_INIT is used to initialize these counters at boot. OPAL_IMC_COUNTERS_START and OPAL_IMC_COUNTERS_STOP API calls should be used to start and pause these IMC engines. `doc/opal-api/opal-imc-counters.rst` details the OPAL APIs and their usage. Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-15hw/imc: Use ARRAY_SIZE instead of static macroMadhavan Srinivasan2-3/+1
disable_unavailable_units() loops through nest_pmus array to filter out the unsupported nest units from the imc catalog dtb. Current code use a static macro ('MAX_NEST_UNITS') for array limit, instead use ARRAY_SIZE. This will avoid updates to static macro when updating the nest_pmus array. Fixes: 712837cedca06 ('skiboot/imc: Update the nest_pmus array with occ/gpe microcode uav updates') Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-15xive: Fix VP free block group mode false-positive parameter checkNicholas Piggin1-1/+3
The check to ensure the buddy allocation idx is aligned to its allocation order was not taking into account the allocation split. This would result in opal_xive_free_vp_block failures despite giving the same value as returned by opal_xive_alloc_vp_block. E.g., starting then stopping 4 KVM guests gives the following pattern in the host: opal_xive_alloc_vp_block(5)=0x45000020 opal_xive_alloc_vp_block(5)=0x45000040 opal_xive_alloc_vp_block(5)=0x45000060 opal_xive_alloc_vp_block(5)=0x45000080 opal_xive_free_vp_block(0x45000020)=-1 opal_xive_free_vp_block(0x45000040)=0 opal_xive_free_vp_block(0x45000060)=-1 opal_xive_free_vp_block(0x45000080)=0 Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-15doc: clarify locking and async of OPAL_SENSOR_READStewart Smith1-1/+6
Reported-by: Robert Lippert <rlippert@google.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-15hw/p8-i2c: Fix deadlock in p9_i2c_bus_owner_changeAnton Blanchard1-1/+1
When debugging a system where Linux was taking soft lockup errors, I noticed two CPUs were stuck in OPAL: CPU0 lock p8_i2c_recover opal_handle_interrupt CPU1 sync_timer cancel_timer p9_i2c_bus_owner_change occ_p9_interrupt xive_source_interrupt opal_handle_interrupt p8_i2c_recover() is a timer, and is stuck trying to take master->lock. p9_i2c_bus_owner_change() has taken master->lock, but then is stuck waiting for all timers to complete. We deadlock. Fix this by using cancel_timer_async(), as suggested by Oliver. Fixes: 201fd50f208d ("hw/p8-i2c: Fix OCC locking") Suggested-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-11skiboot 5.4.8 release notesStewart Smith1-0/+158
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> (cherry picked from commit 43290f90e46d632ed5a314292c317e6f813c3b74) Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-11FSP/CONSOLE: Limit number of error loggingVasant Hegde1-8/+13
Commit c8a7535f (FSP/CONSOLE: Workaround for unresponsive ipmi daemon) added error logging when buffer is full. In some corner cases kernel may call this function multiple time and we may endup logging error again and again. This patch fixes it by generating error log only once. I think this is enough to indicate something went wrong. Also with previous patch, once console buffer is full, OPAL is returning error to payload from fsp_console_write_buffer_space(). So payload will never call fsp_console_write(). Hence move error logging logic to right place. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-11FSP/CONSOLE: Fix fsp_console_write_buffer_space() callVasant Hegde1-1/+35
Kernel calls fsp_console_write_buffer_space() to check console buffer space availability. If there is enough buffer space to write data, then kernel will call fsp_console_write() to write actual data. In some extreme corner cases (like one explained in commit c8a7535f) console becomes full and this function returns 0 to kernel (or space available in console buffer < next incoming data size). Kernel will continue retrying until it gets enough space. So we will start seeing RCU stalls. This patch keeps track of previous available space. If previous space is same as current means not enough space in console buffer to write incoming data. It may be due to very high console write operation and slow response from FSP -OR- FSP has stopped processing data (ex: because of ipmi daemon died). At this point we will start timer with timeout of SER_BUFFER_OUT_TIMEOUT (10 secs). If situation is not improved within 10 seconds means something went bad. Lets return OPAL_RESOURCE so that kernel can drop console write and continue. CC: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> CC: Stewart Smith <stewart@linux.vnet.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> [stewart: reset timeout in fsp_console_write() path] Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-11FSP/CONSOLE: Close SOL session during R/RVasant Hegde1-3/+0
Presently we are not closing SOL and FW console sessions during R/R. Host will continue to write to SOL buffer during FSP R/R. If there is heavy console write operation happening during FSP R/R (like running `top` command inside console), then at some point console buffer becomes full. fsp_console_write_buffer_space() returns 0 (or less than required space to write data) to host. While one thread is busy writing to console, if some other threads tries to write data to console we may see RCU stalls (like below) in kernel. kernel call trace: ------------------ [ 2082.828363] INFO: rcu_sched detected stalls on CPUs/tasks: { 32} (detected by 16, t=6002 jiffies, g=23154, c=23153, q=254769) [ 2082.828365] Task dump for CPU 32: [ 2082.828368] kworker/32:3 R running task 0 4637 2 0x00000884 [ 2082.828375] Workqueue: events dump_work_fn [ 2082.828376] Call Trace: [ 2082.828382] [c000000f1633fa00] [c00000000013b6b0] console_unlock+0x570/0x600 (unreliable) [ 2082.828384] [c000000f1633fae0] [c00000000013ba34] vprintk_emit+0x2f4/0x5c0 [ 2082.828389] [c000000f1633fb60] [c00000000099e644] printk+0x84/0x98 [ 2082.828391] [c000000f1633fb90] [c0000000000851a8] dump_work_fn+0x238/0x250 [ 2082.828394] [c000000f1633fc60] [c0000000000ecb98] process_one_work+0x198/0x4b0 [ 2082.828396] [c000000f1633fcf0] [c0000000000ed3dc] worker_thread+0x18c/0x5a0 [ 2082.828399] [c000000f1633fd80] [c0000000000f4650] kthread+0x110/0x130 [ 2082.828403] [c000000f1633fe30] [c000000000009674] ret_from_kernel_thread+0x5c/0x68 Hence lets close SOL (and FW console) during FSP R/R. CC: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-11FSP/CONSOLE: Do not associate unavailable consoleVasant Hegde1-0/+11
Presently OPAL sends associate/unassociate MBOX command for all FSP serial console (like below OPAL message). We have to check console is available or not before sending this message. OPAL log: ------- [ 5013.227994012,7] FSP: Reassociating HVSI console 1 [ 5013.227997540,7] FSP: Reassociating HVSI console 2 Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
2017-10-11FSP: Disable PSI link whenever FSP tells OPAL about impending R/RVasant Hegde2-18/+8
Commit 42d5d047 fixed scenario where DPO has been initiated, but FSP went into reset before the CEC power down came in. But this is generic issue that can happen in normal shutdown path as well. Hence disable PSI link as soon as we detect FSP impending R/R. CC: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> CC: Stewart Smith <stewart@linux.vnet.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>