aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-10-12skiboot 6.0.9 release notesv6.0.9Stewart Smith1-0/+139
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-10-10opal/hmi: Ignore debug trigger inject core FIR.Mahesh Salgaonkar1-1/+0
[ Upstream commit 1317448ddd1a872e93a6f421ba5cd5d9b3b6ea7a ] Core FIR[60] is a side effect of the work around for the CI Vector Load issue in DD2.1. Usually this gets delivered as HMI with HMER[17] where Linux already ignores it. But it looks like in some cases we may happen to see CORE_FIR[60] while we are already in Malfunction Alert HMI (HMER[0]) due to other reasons e.g. CAPI recovery or NPU xstop. If that happens then just ignore it instead of crashing kernel as not recoverable. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-10-10opal/hmi: Handle early HMIs on thread0 when secondaries are still in OPAL.Mahesh Salgaonkar1-0/+49
[ Upstream commit c884f2d0cb921131737df99ed3aad9f5a2d2945f ] When primary thread receives a CORE level HMI for timer facility errors while secondaries are still in OPAL, thread 0 ends up in rendez-vous waiting for secondaries to get into hmi handling. This is because OPAL runs with MSR(EE=0) and hence HMIs are delayed on secondary threads until they are given to Linux OS. Fix this by adding a check for secondary state and force them in hmi handling by queuing job on secondary threads. I have tested this by injecting HDEC parity error very early during Linux kernel boot. Recovery works fine for non-TB errors. But if TB is bad at this very eary stage we already doomed. Without this patch we see: [ 285.046347408,7] OPAL: Start CPU 0x0843 (PIR 0x0843) -> 0x000000000000a83c [ 285.051160609,7] OPAL: Start CPU 0x0844 (PIR 0x0844) -> 0x000000000000a83c [ 285.055359021,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 285.055361439,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:0: TFMR(2e12002870e14000) Timer Facility Error [ 286.232183823,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc1) [ 287.409002056,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc1) [ 289.073820164,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc1) [ 290.250638683,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc2) [ 291.427456821,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc2) [ 293.092274807,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc2) [ 294.269092904,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc3) [ 295.445910944,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc3) [ 297.110728970,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc3) After this patch: [ 259.401719351,7] OPAL: Start CPU 0x0841 (PIR 0x0841) -> 0x000000000000a83c [ 259.406259572,7] OPAL: Start CPU 0x0842 (PIR 0x0842) -> 0x000000000000a83c [ 259.410615534,7] OPAL: Start CPU 0x0843 (PIR 0x0843) -> 0x000000000000a83c [ 259.415444519,7] OPAL: Start CPU 0x0844 (PIR 0x0844) -> 0x000000000000a83c [ 259.419641401,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 259.419644124,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:0: TFMR(2e12002870e04000) Timer Facility Error [ 259.419650678,7] HMI: Sending hmi job to thread 1 [ 259.419652744,7] HMI: Sending hmi job to thread 2 [ 259.419653051,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 259.419654725,7] HMI: Sending hmi job to thread 3 [ 259.419654916,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 259.419658025,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 259.419658406,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:2: TFMR(2e12002870e04000) Timer Facility Error [ 259.419663095,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:3: TFMR(2e12002870e04000) Timer Facility Error [ 259.419655234,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:1: TFMR(2e12002870e04000) Timer Facility Error [ 259.425109779,7] OPAL: Start CPU 0x0845 (PIR 0x0845) -> 0x000000000000a83c [ 259.429870681,7] OPAL: Start CPU 0x0846 (PIR 0x0846) -> 0x000000000000a83c [ 259.434549250,7] OPAL: Start CPU 0x0847 (PIR 0x0847) -> 0x000000000000a83c Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-09-20hw/bt.c: quieten all the noisy BT/IPMI messagesStewart Smith1-4/+4
[ Upstream commit 8f650b6d55b4060cca7b8a2fa2850bc73890b179 ] Suggested-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Yeah-boiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiied-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-09-18npu2: Use correct kill type for TCE invalidationAlexey Kardashevskiy1-1/+1
[ Upstream commit 8a2b6d51b77172d5ff81aa412ff7aa97f57d4f90 ] kill_type is enum of OPAL_PCI_TCE_KILL_PAGES, OPAL_PCI_TCE_KILL_PE, OPAL_PCI_TCE_KILL_ALL and phb4_tce_kill() gets it right but npu2_tce_kill() uses OPAL_PCI_TCE_KILL which is an OPAL API token. This fixes an obvious mistype. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-09-18hw/npu2-opencapi: Fix setting of supported OpenCAPI templatesAndrew Donnellan1-2/+2
[ Upstream commit 34ceb75f282952b40b615558f947c3fee533b1d4 ] In opal_npu_tl_set(), we made a typo that means the OPAL_NPU_TL_SET call may not clear the enable bits for templates that were previously enabled but are now disabled. Fix the typo so we clear NPU2_OTL_CONFIG1_TX_TEMP2_EN as well as TEMP{1,3}_EN. Reported-by: Tyler Seredynski <tseredynski@gmail.com> Fixes: cd8b82a8e83ed ("npu2-opencapi: Add OpenCAPI OPAL API calls") Cc: stable Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-09-13phb4: Workaround PHB errata with CFG write UR/CA errorsBenjamin Herrenschmidt1-1/+5
[ Upstream commit 9a83ab711ea3c76919f311cb1c78e051ae59c808 ] If the PHB encounters a UR or CA status on a CFG write, it will incorrectly freeze the wrong PE. Instead of using the PE# specified in the CONFIG_ADDRESS register, it will use the PE# of whatever MMIO occurred last. Work around this disabling freeze on such errors Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Tested-By: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-09-13phb4: Handle allocation errors in phb4_eeh_dump_regs()Benjamin Herrenschmidt1-0/+4
[ Upstream commit 0a087154ca4f6759ad1e25c0b3933a9e6caeb456 ] If the zalloc fails (and it can be a rather large allocation), we will overwite memory at 0 instead of failing. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-09-13phb4: Don't try to access non-existent PEST entriesBenjamin Herrenschmidt1-3/+3
[ Upstream commit cfecc3960c00ea9a9871c2358d8710c5d2c6539b ] In a POWER9 chip, some PHB4s have 256 PEs, some have 512. Currently, the diagnostics code retrieves 512 unconditionally, which is wrong and causes us to incorrectly report bogus values for the "high" PEs on the small PHBs. Use the actual number of implemented PEs instead Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-09-13phb4: Don't probe a PHB if its gardedVaibhav Jain1-2/+11
[ Upstream commit 1520d6a1e3aaec74228d213083b68da70729121a ] Presently phb4_probe_stack() causes an exception while trying to probe a PHB if its garded. This causes skiboot to go into a reboot loop with following exception log: *********************************************** Fatal MCE at 000000003006ecd4 .probe_phb4+0x570 CFAR : 00000000300b98a0 <snip> Aborting! CPU 0018 Backtrace: S: 0000000031cc37e0 R: 000000003001a51c ._abort+0x4c S: 0000000031cc3860 R: 0000000030028170 .exception_entry+0x180 S: 0000000031cc3a40 R: 0000000000001f10 * S: 0000000031cc3c20 R: 000000003006ecb0 .probe_phb4+0x54c S: 0000000031cc3e30 R: 0000000030014ca4 .main_cpu_entry+0x5b0 S: 0000000031cc3f00 R: 0000000030002700 boot_entry+0x1b8 This is caused as phb4_probe_stack() will ignore all xscom read/write errors to enable PHB Bars and then tries to perform an mmio to read PHB Version registers that cause the fatal MCE. We fix this by ignoring the PHB probe if the first xscom_write() to populate the PHB Bar register fails, which indicates that there is something wrong with the PHB. Cc: stable Fixes: dc21b4db3a2e('hw/phb4: Add initial support') Reviewed-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-08-16skiboot 6.0.8 release notesv6.0.8Stewart Smith1-0/+67
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-08-16i2c: Ensure ordering between i2c_request_send() and completionBenjamin Herrenschmidt1-0/+3
i2c_request_send loops waiting for a flag "uc.done" set by the completion routine, and then look for a result code also set by that same completion. There is no synchronization, the completion can happen on another processor, so we need to order the stores to uc and the reads from uc so that uc.done is stored last and tested first using memory barriers. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit ef79d0370737130256168d20a9bf40f06001af88) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-08-16i2c: Fix multiple-enqueue of the same request on NACKBenjamin Herrenschmidt1-4/+3
i2c_request_send() will retry the request if the error is a NAK, however it forgets to clear the "ud.done" flag. It will thus loop again and try to re-enqueue the same request causing internal request list corruption. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit f737777b34382d5293901c4a5040b1fad05294a0) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-08-16phb4: Disable 32-bit MSI in capi modeFrederic Barrat1-0/+9
If a capi device does a DMA write targeting an address lower than 4GB, it does so through a 32-bit operation, per the PCI spec. In capi mode, the first TVE entry is configured in bypass mode, so the address is valid. But with any (bad) luck, the address could be 0xFFFFxxxx, thus looking like a 32-bit MSI. We currently enable both 32-bit and 64-bit MSIs, so the PHB will interpret the DMA write as a MSI, which very likely results in an EEH (MSI with a bad payload size). We can fix it by disabling 32-bit MSI when switching the PHB to capi mode. Capi devices are 64-bit. Cc: stable Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> Reviewed-by: Christophe Lombard <clombard@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit 3b9bc869a4fee22c99a4d24ba87ce938d46b11f4) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-08-16capp: Fix the capp recovery timeout comparisonVaibhav Jain1-1/+1
The current capp recovery timeout control loop in do_capp_recovery_scoms() uses a wrong comparison for return value of tb_compare(). This may cause do_capp_recovery_scoms() to report an timeout earlier than the 168ms stipulated time. The patch fixes this by updating the loop timeout control branch in do_capp_recovery_scoms() to use the correct enum tb_cmpval. Cc: Stable #6.0+ Fixes: 09b853cae0aa0("capi: Poll Err/Status register during CAPP recovery") Reported-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Reviewed-by: Christophe Lombard <clombard@linux.vnet.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit ec954f764efe064d7fc99e8a21a0ebdb7b8a3c91) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-08-14phb4/capp: Update DMA read engines set in APC_FSM_READ_MASK based on link-widthVaibhav Jain1-4/+18
Commit 47c09cdfe7a3("phb4/capp: Calculate STQ/DMA read engines based on link-width for PEC") update the CAPP init sequence by calculating the needed STQ/DMA-read engines based on link width and populating it in XPEC_NEST_CAPP_CNTL register. This however needs to be synchronized with the value set in CAPP APC FSM Read Machine Mask Register. Hence this patch update phb4_init_capp_regs() to calculate the link width of the stack on PEC2 and populate the same values as previously populated in PEC CAPP_CNTL register. Cc: stable # v5.7+ Fixes: 47c09cdfe7a3("phb4/capp: Calculate STQ/DMA read engines based on link-width for PEC") Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit ef9caad57e59ffc1a9ee44d38a161f624993b67b) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-08-14core/cpu: Call memset with proper cpu_thread offsetVasant Hegde1-1/+1
"cpu_thread *t + value" vs "(void *)t + val" Fixes: cfe9d441 (core/cpu: Prevent clobbering of stack guard for boot-cpu) CC: stable <skiboot@lists.ozlabs.org> # v6.0+ CC: Vaibhav Jain <vaibhav@linux.vnet.ibm.com> CC: Nicholas Piggin <npiggin@gmail.com> CC: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Acked-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Reviewed-by: Vaibhav Jain<vaibhav@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit 15880d514e1f27e4380eaaf0b7de5ac90d35da66) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-08-03skiboot 6.0.7 release notesv6.0.7Stewart Smith1-0/+20
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-08-02xive: Disable block trackerBenjamin Herrenschmidt1-2/+4
Due to some HW errata, the block tracking facility (performance optimisation for large systems) should be disabled on Nimbus chips. Disable it unconditionally for now. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit 7db7c9f652295a47b7fed0fb62787ab795216a18) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-07-19skiboot 6.0.6 release notesv6.0.6Stewart Smith1-0/+51
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-07-19doc: Add a man page for OPAL_PCI_SET_PHB_CAPI_MODEVaibhav Jain1-0/+74
We add a man page describing the opal call OPAL_PCI_SET_PHB_CAPI_MODE used for activating/deactivating CAPP attached to a PEC for CAPI 1 & 2. Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> [stewart: nitpicks that Andrew pointed out in review] Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit 5690c5a8980faf9e528df65dd95535e21c2c868f) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-07-19phb4: Reallocate PEC2 DMA-Read engines to improve GPU-Direct bandwidthVaibhav Jain2-3/+39
We reallocate additional 16/8 DMA-Read engines allocated to stack0/1 on PEC2 respectively. This is needed to improve bandwidth available to the Mellanox CX5 adapter when trying to read GPU memory (GPU-Direct). If kernel cxl driver indicates a request to allocate maximum possible DMA read engines when calling enable_capi_mode() and card is attached to PEC2/stack0 slot then we assume its a Mellanox CX5 adapter. We then allocate additional 16/8 extra DMA read engines to stack0 and stack1 respectively on PEC2. This is done by populating the XPEC_PCI_PRDSTKOVR and XPEC_NEST_READ_STACK_OVERRIDE as suggested by the h/w team. Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit 3754dba77ef5a4d72dc579e789c0a7b06af02160) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-07-18phb4: Disable nodal scoped DMA accesses when PB pump mode is enabledAlistair Popple2-0/+13
By default when a PCIe device issues a read request via the PHB it is first issued with nodal scope. When accessing GPU memory the NPU does not know at the time of response if the requested memory page is off node or not. Therefore every read of GPU memory by a PHB is retried with larger scope which introduces bandwidth and latency issues. On smaller boxes which have pump mode enabled nodal and group scoped reads are treated the same and both types of request are broadcast to one chip. Therefore we can avoid the retry by disabling nodal scope on the PHB for these boxes. On larger boxes nodal (single chip) and group (multiple chip) scoped reads are treated differently. Therefore we avoid disabling nodal scope on large boxes which have pump mode disabled to avoid all PHB requests being broadcast to multiple chips. Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit 68518e542e6f7adfe4e97ac22024970ac2400872) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-07-18Move pb_cen_hp_mode_curr register definition to xscom-p9-reg.hAlistair Popple3-2/+5
Currently it is defined in npu2-regs.h but needs to be used by other files as well so move it somewhere generic. Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit b8702e2c69638f9cab818e76232af3481935e250) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-07-17npu2/hw-procedures: Enable parity and credit overflow checksReza Arbab3-1/+14
Enable these error checking features by setting the appropriate bits in our one-off initialization of each "NTL Misc Config 2" register. The exception is NDL RX parity checking, which should be disabled during the link training procedures. Signed-off-by: Reza Arbab <arbab@linux.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit 041d69bb1a7084778d63a846d109c148c7a0009a) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-07-17npu2/hw-procedures: Don't open code NPU2_NTL_MISC_CFG2_BRICK_ENABLEReza Arbab2-6/+8
Name this bit properly. There's a lot more cleanup like this to be done, but I'm catching this one now as part of some related changes. Signed-off-by: Reza Arbab <arbab@linux.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit c2493fd0ce30dd4204cf4cec2e9c4496201a0cf1) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-07-11skiboot 6.0.5 release notesv6.0.5Stewart Smith1-0/+118
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-07-11phb4/capp: Calculate STQ/DMA read engines based on link-width for PECVaibhav Jain2-9/+33
Presently in CAPI mode the number of STQ/DMA-read engines allocated on PEC2 for CAPP is fixed to 6 and 0-30 respectively irrespective of the PCI link width. These values are only suitable for x8 cards and quickly run out if a x16 card is plugged to a PEC2 attached slot. This usually manifests as CAPP reporting TLBI timeout due to these messages getting stalled due to insufficient STQs. To fix this we update enable_capi_mode() to check if PEC2 chiplet is in x16 mode and if yes then we allocate 4/0-47 STQ/DMA-read engines for the CAPP traffic. Cc: stable # v5.7+ Fixes: 37ea3cfdc852("capi: Enable capi mode for PHB4") Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit 47c09cdfe7a34843387c968ce75cea8dc578ab91) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-07-11occ: sensors: Fix the size of the phandle array 'sensors' in DTShilpasri G Bhat1-2/+2
Fixes: 99505c03f493 ("sensor-groups: occ: Add support to disable/enable sensor group") Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit d6de8fe73b88f92d6a222905e1974ec73777d5e5) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-07-10capi: Select the correct IODA table entry for the mbt cache.Christophe Lombard1-9/+9
With the current code, the capi mmio window is not correctly configured in the IODA table entry. The first entry (generally the non-prefetchable BAR) is overwrriten. This patch sets the capi window bar at the right place. Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com> Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit 98182a960c5ffd53eed139668e686bc5af6e2e5f) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-07-10npu2/hw-procedures: Fence bricks via NTL instead of MISCReza Arbab1-24/+7
There are a couple of places we can set/unset fence for a brick: 1. MISC register: NPU2_MISC_FENCE_STATE 2. NTL register for the brick: NPU2_NTL_MISC_CFG1(ndev) Recent testing of ATS in combination with GPU reset has exposed a side effect of using (1); if fence is set for all six bricks, it triggers a sticky nmmu latch which prevents the NPU from getting ATR responses. This manifests as a hang in the tests. We have npu2_dev_fence_brick() which uses (1), and only two calls to it. Replace the call which sets fence with a write to (2). Remove the corresponding unset call entirely. It's unneeded because the procedures already do a progression from full fence to half to idle using (2). Signed-off-by: Reza Arbab <arbab@linux.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit 5ff8763c9b0421d8de0f4346ca211c853d2406d4) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-07-10phb4: Delay training till after PERST is deassertedMichael Neuling1-0/+14
This helps some cards train on the second PERST (ie fast-reboot). The reason is not clear why but it helps, so YOLO! Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit 9078f8268922b44c3b0f2cd44f567b9389073142) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-07-10phb4: Move training trace logging to next state.Michael Neuling1-2/+2
I'm going to defer training to this state soon, so move the tracing first. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit efc4020a32fbb199c58ada9315d64a175162d066) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-07-10phb4: Minimise wait when moving through FRESET statesMichael Neuling1-1/+1
We want to get through this as fast as possible so minimise by removing msecs_to_tb() call. Changes number passed from 512 -> 1. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit da05882b8e6e146b5b4121b1e177c4aea47de8f2) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-06-29vpd: Add vendor property to processor nodeVasant Hegde3-0/+19
Processor FRU vpd doesn't contain vendor detail. We have to parse module VPD to get vendor detail. Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit 861350941f9a3fb76ebcae3e5a32b3cbec929d03) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-06-29vpd: Sanitize VPD dataVasant Hegde3-24/+51
On OpenPower system, VPD keyword size tells us the maximum size of the data. But they fill trailing end with space (0x20) instead of NULL. Also spec doesn't stop user to have space (0x20) within actual data. This patch discards trailing spaces before populating device tree. Reported-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> [stewart: fixup make check] Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit 77f510d35e8d60faed989496fac2de16663ff332) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-06-22test/qemu: skip qemu test if 'old' qemu without PCRStewart Smith2-0/+14
3d019581c98153 introduced clearing PCR on reinit cpus, and until (the near future from now) qemu didn't support this register. Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit 021f6f39b9bfb60e1cda5432bfe6430cb9adfed7) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-06-19core: Add test for PCI quirksAndrew Jeffery3-4/+75
Ensure that quirks are run (or not) for given PCI vendor and device IDs. This tests the quirk infrastructure and the PCI_VENDOR_ID() and PCI_DEVICE_ID() macros, the latter of which was recently found to be broken. Signed-off-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit dc24a1fd61e0b3fbceb1027b2c458bde0257fb38) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-06-19pci: Fix PCI_DEVICE_ID()Andrew Jeffery1-1/+1
The vendor ID is 16 bits not 8. This error leaves the top of the vendor ID in the bottom bits of the device ID, which resulted in e.g. a failure to run the PCI quirk for the AST VGA device. Fixes: 2b841bf0ef1b ("core/pci: Use cached vendor/device IDs in quirks") Signed-off-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit 50dfd067835af53736ec8106b1f0e99339b54a81) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-06-19NX: Add NX coprocessor init opal callHaren Myneni5-1/+106
The read offset (4:11) in Receive FIFO control register is incremented by FIFO size whenever CRB read by NX. But the index in RxFIFO has to match with the corresponding entry in FIFO maintained by VAS in kernel. VAS entry is reset to 0 when opening the receive window during driver initialization. So when NX842 is reloaded or in kexec boot, possibility of mismatch between RxFIFO control register and VAS entries in kernel. It could cause CRB failure / timeout from NX. This patch adds nx_coproc_init opal call for kernel to initialize readOffset (4:11) and Queued (15:23) in RxFIFO control register. Fixes: 3b3c5962f432 ("NX: Add P9 NX support for 842 compression engine") CC: stable # v5.8+ Signed-off-by: Haren Myneni <haren@us.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit 56026a13292453b072ad3cc9adf3dee960077f38) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-06-05opal/hmi: Display correct chip id while printing NPU FIRs.Mahesh Salgaonkar1-4/+4
HMIs for NPU xstops are broadcasted to all chips. All cores on all the chips receive HMI. HMI handler correctly identifies and extracts the NPU FIR details from affected chip, but while printing FIR data it prints chip id and location code details of this_cpu()->chip_id which may not be correct. This patch fixes this issue. CC: stable # v6.0+ Fixes: 7bcbc78c ("Add location code to NPU2 HMI logging") Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> [stewart: add fixes and cc stable] Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit fa82d360a73a5e52131d8a19e9fdb9d6e9c2eeb9) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-05-28skiboot 6.0.4 release notesv6.0.4Stewart Smith1-0/+55
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-05-28SLW: Remove stop1_lite and stop2_liteAkshay Adiga1-28/+8
stop1_lite has been removed since it adds no additional benefit over stop0_lite. stop2_lite has been removed since currently it adds minimal benefit over stop2. However, the benefit is eclipsed by the time required to ungate the clocks Moreover, Lite states don't give up the SMT resources, can potentially have a performance impact on sibling threads. Since current OSs (Linux) aren't smart enough to make good decisions with these stop states, we're (temporarly) removing them from what we expose to the OS, the idea being to bring them back in a new DT representation so that only an OS that knows what to do will do things with them. Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com> Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> [stewart: add to explanation] Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit 34e9c3c1edb3eed02f428f9cbf97d99b3db43d4d) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-05-28opal-prd: Do not error out on first failure for soft/hard offline.Mahesh Salgaonkar1-3/+3
The memory errors (CEs and UEs) that are detected as part of background memory scrubbing are reported by PRD asynchronously to opal-prd along with affected memory ranges. hservice_memory_error() converts these ranges into page granularity before hooking up them to soft/hard offline-ing infrastructure. But the current implementation of hservice_memory_error() does not hookup all the pages to soft/hard offline-ing if any of the page offline action fails. e.g hard offline can fail for: - Pages that are not part of buddy managed pool. - Pages that are reserved by kernel using memblock_reserved() - Pages that are in use by kernel. But for the pages that are in use by user space application, the hard offline marks the page as hwpoison, sends SIGBUS signal to kill the affected application as recovery action and returns success. Hence, It is possible that some of the pages in that memory range are in use by application or free. By stopping on first error we loose the opportunity to hwpoison the subsequent pages which may be free or in use by application. This patch fixes this issue. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit e9ee7c7d357160a704c8248a1787124f94df8c54) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-05-23skiboot 6.0.3 release notesv6.0.3Stewart Smith1-0/+53
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-05-23p8-i2c: Remove force resetOliver O'Halloran1-135/+38
Force reset was added as an attempt to work around some issues with TPM devices locking up their I2C bus. In that particular case the problem was that the device would hold the SCL line down permanently due to a device firmware bug. The force reset doesn't actually do anything to alleviate the situation here, it just happens to reset the internal master state enough to make the I2C driver appear to work until something tries to access the bus again. On P9 systems with secure boot enabled there is the added problem of the "diagostic mode" not being supported on I2C masters A,B,C and D. Diagnostic mode allows the SCL and SDA lines to be driven directly by software. Without this force reset is impossible to implement. This patch removes the force reset functionality entirely since: a) it doesn't do what it's supposed to, and b) it's butt ugly code Additionally, turn p8_i2c_reset_engine() into p8_i2c_reset_port(). There's no need to reset every port on a master in response to an error that occurred on a specific port. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit 49656a181133013d0b436db8052e23895ad4ff11) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-05-23libstb/i2c-driver: Bump max timeoutOliver O'Halloran1-1/+2
We have observed some TPMs clock streching the I2C bus for signifigant amounts of time when processing commands. The same TPMs also have errata that can result in permernantly locking up a bus in response to an I2C transaction they don't understand. Using an excessively long timeout to prevent this in the field. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit 81d52fb22cc95cc1a8fc1001dbac361843da2662) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-05-23hdata: Add TPM timeout workaroundOliver O'Halloran1-0/+10
Set the default timeout for any bus containing a TPM to one second. This is needed to work around a bug in the firmware of certain TPMs that will clock strech the I2C port the for up to a second. Additionally, when the TPM is clock streching it responds to a STOP condition on the bus by bricking itself. Clearing this error requires a hard power cycle of the system since the TPM is powered by standby power. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit 3668dc88a1bdfc087ab7d329eabde5ac0086f9bf) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-05-23p8-i2c: Allow a per-port default timeoutOliver O'Halloran1-7/+13
Add support for setting a default timeout for the I2C port to the device-tree. This is consumed by skiboot. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com> (cherry picked from commit ac6059026442f0da98293f800aa002271d579097) Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
2018-05-18skiboot 6.0.2 release notesv6.0.2Stewart Smith1-0/+23
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>