npu2: Invalidate entire TCE cache if many entries requested

Turned out invalidating entries in NPU TCE cache is so slow that it becomes visible when running a 30+GB guest with GPU+NVlink2 passed through; a 100GB guest takes about 20s to map all 100GB. This falls through to the entire cache invalidation if more than 128 TCEs were requested to invalidate, this reduces 20s from the abobe to less than 1s. The KVM change [1] is required to see this difference. The threshold of 128 is chosen in attempt not to affect performance much as it is not clear how expensive it is to populate the TCE cache again; all we know for sure is that mapping the guest produces invalidation requests of 512 TCEs each. Note TCE cache invalidation in PHB4 is faster and does not require the same workaround. [1] KVM: PPC: vfio/spapr_tce: Split out TCE invalidation from TCE updates https://patchwork.ozlabs.org/patch/1149003/ Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Alistair Popple <alistair@popple.id.au>
author: Alexey Kardashevskiy <aik@ozlabs.ru> 2019-08-19 16:17:48 +1000
committer: Oliver O'Halloran <oohall@gmail.com> 2019-08-23 16:50:47 +1000
commit: 2a0455ba0f7784b2d7e9e3915fd30f815afd2ae1 (patch)
tree: b37925ca8bbeefc49e632b1d8524de24d5aab097 /hw/npu2.c
parent: e2018d2a3d46491dc2abd758c67c1937910b3a67 (diff)
download: skiboot-2a0455ba0f7784b2d7e9e3915fd30f815afd2ae1.zip
skiboot-2a0455ba0f7784b2d7e9e3915fd30f815afd2ae1.tar.gz
skiboot-2a0455ba0f7784b2d7e9e3915fd30f815afd2ae1.tar.bz2
1 files changed, 12 insertions, 5 deletions
diff --git a/hw/npu2.c b/hw/npu2.c
index 06eaf4d..81539d1 100644
--- a/hw/npu2.c
+++ b/hw/npu2.c
@@ -1257,12 +1257,19 @@ static int64_t npu2_tce_kill(struct phb *phb, uint32_t kill_type,
 			return OPAL_PARAMETER;
 		}
 
-		while (npages--) {
-			val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr, pe_number);
-			npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val);
-			dma_addr += tce_size;
+		if (npages < 128) {
+			while (npages--) {
+				val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr, pe_number);
+				npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val);
+				dma_addr += tce_size;
+			}
+			break;
 		}
-		break;
+		/*
+		 * For too many TCEs do not bother with the loop above and simply
+		 * flush everything, going to be lot faster.
+		 */
+		/* Fall through */
 	case OPAL_PCI_TCE_KILL_PE:
 		/*
 		 * NPU2 doesn't support killing a PE so fall through
author	Alexey Kardashevskiy <aik@ozlabs.ru>	2019-08-19 16:17:48 +1000
committer	Oliver O'Halloran <oohall@gmail.com>	2019-08-23 16:50:47 +1000
commit	2a0455ba0f7784b2d7e9e3915fd30f815afd2ae1 (patch)
tree	b37925ca8bbeefc49e632b1d8524de24d5aab097 /hw/npu2.c
parent	e2018d2a3d46491dc2abd758c67c1937910b3a67 (diff)
download	skiboot-2a0455ba0f7784b2d7e9e3915fd30f815afd2ae1.zip skiboot-2a0455ba0f7784b2d7e9e3915fd30f815afd2ae1.tar.gz skiboot-2a0455ba0f7784b2d7e9e3915fd30f815afd2ae1.tar.bz2