From 78722ed0b826644ae240e3c0bbb6bdde02dfe7e1 Mon Sep 17 00:00:00 2001 From: "Emilio G. Cota" Date: Wed, 26 Jul 2017 20:15:41 -0400 Subject: translate-all: make l1_map lockless MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Groundwork for supporting parallel TCG generation. We never remove entries from the radix tree, so we can use cmpxchg to implement lockless insertions. Reviewed-by: Richard Henderson Reviewed-by: Alex Bennée Signed-off-by: Emilio G. Cota Signed-off-by: Richard Henderson --- docs/devel/multi-thread-tcg.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'docs/devel') diff --git a/docs/devel/multi-thread-tcg.txt b/docs/devel/multi-thread-tcg.txt index a99b456..faf8918 100644 --- a/docs/devel/multi-thread-tcg.txt +++ b/docs/devel/multi-thread-tcg.txt @@ -134,8 +134,8 @@ tb_set_jmp_target() code. Modification to the linked lists that allow searching for linked pages are done under the protect of the tb_lock(). -The global page table is protected by the tb_lock() in system-mode and -mmap_lock() in linux-user mode. +The global page table is a lockless radix tree; cmpxchg is used +to atomically insert new elements. The lookup caches are updated atomically and the lookup hash uses QHT which is designed for concurrent safe lookup. -- cgit v1.1 From 95590e24af11236ef334f6bc3e2b71404a790ddb Mon Sep 17 00:00:00 2001 From: "Emilio G. Cota" Date: Tue, 1 Aug 2017 15:40:16 -0400 Subject: translate-all: discard TB when tb_link_page returns an existing matching TB Use the recently-gained QHT feature of returning the matching TB if it already exists. This allows us to get rid of the lookup we perform right after acquiring tb_lock. Suggested-by: Richard Henderson Reviewed-by: Richard Henderson Signed-off-by: Emilio G. Cota Signed-off-by: Richard Henderson --- docs/devel/multi-thread-tcg.txt | 3 +++ 1 file changed, 3 insertions(+) (limited to 'docs/devel') diff --git a/docs/devel/multi-thread-tcg.txt b/docs/devel/multi-thread-tcg.txt index faf8918..faf09c6 100644 --- a/docs/devel/multi-thread-tcg.txt +++ b/docs/devel/multi-thread-tcg.txt @@ -140,6 +140,9 @@ to atomically insert new elements. The lookup caches are updated atomically and the lookup hash uses QHT which is designed for concurrent safe lookup. +Parallel code generation is supported. QHT is used at insertion time +as the synchronization point across threads, thereby ensuring that we only +keep track of a single TranslationBlock for each guest code block. Memory maps and TLBs -------------------- -- cgit v1.1 From 194125e3ebd553acb02aaf3797a4f0387493fe94 Mon Sep 17 00:00:00 2001 From: "Emilio G. Cota" Date: Wed, 2 Aug 2017 20:34:06 -0400 Subject: translate-all: protect TB jumps with a per-destination-TB lock MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This applies to both user-mode and !user-mode emulation. Instead of relying on a global lock, protect the list of incoming jumps with tb->jmp_lock. This lock also protects tb->cflags, so update all tb->cflags readers outside tb->jmp_lock to use atomic reads via tb_cflags(). In order to find the destination TB (and therefore its jmp_lock) from the origin TB, we introduce tb->jmp_dest[]. I considered not using a linked list of jumps, which simplifies code and makes the struct smaller. However, it unnecessarily increases memory usage, which results in a performance decrease. See for instance these numbers booting+shutting down debian-arm: Time (s) Rel. err (%) Abs. err (s) Rel. slowdown (%) ------------------------------------------------------------------------------ before 20.88 0.74 0.154512 0. after 20.81 0.38 0.079078 -0.33524904 GTree 21.02 0.28 0.058856 0.67049808 GHashTable + xxhash 21.63 1.08 0.233604 3.5919540 Using a hash table or a binary tree to keep track of the jumps doesn't really pay off, not only due to the increased memory usage, but also because most TBs have only 0 or 1 jumps to them. The maximum number of jumps when booting debian-arm that I measured is 35, but as we can see in the histogram below a TB with that many incoming jumps is extremely rare; the average TB has 0.80 incoming jumps. n_jumps: 379208; avg jumps/tb: 0.801099 dist: [0.0,1.0)|▄█▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁ ▁▁▁ ▁▁▁ ▁|[34.0,35.0] Reviewed-by: Richard Henderson Signed-off-by: Emilio G. Cota Signed-off-by: Richard Henderson --- docs/devel/multi-thread-tcg.txt | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'docs/devel') diff --git a/docs/devel/multi-thread-tcg.txt b/docs/devel/multi-thread-tcg.txt index faf09c6..df83445 100644 --- a/docs/devel/multi-thread-tcg.txt +++ b/docs/devel/multi-thread-tcg.txt @@ -131,8 +131,10 @@ DESIGN REQUIREMENT: Safely handle invalidation of TBs The direct jump themselves are updated atomically by the TCG tb_set_jmp_target() code. Modification to the linked lists that allow -searching for linked pages are done under the protect of the -tb_lock(). +searching for linked pages are done under the protection of tb->jmp_lock, +where tb is the destination block of a jump. Each origin block keeps a +pointer to its destinations so that the appropriate lock can be acquired before +iterating over a jump list. The global page table is a lockless radix tree; cmpxchg is used to atomically insert new elements. -- cgit v1.1 From 0ac20318ce16f4de288969b2007ef5a654176058 Mon Sep 17 00:00:00 2001 From: "Emilio G. Cota" Date: Fri, 4 Aug 2017 23:46:31 -0400 Subject: tcg: remove tb_lock MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Use mmap_lock in user-mode to protect TCG state and the page descriptors. In !user-mode, each vCPU has its own TCG state, so no locks needed. Per-page locks are used to protect the page descriptors. Per-TB locks are used in both modes to protect TB jumps. Some notes: - tb_lock is removed from notdirty_mem_write by passing a locked page_collection to tb_invalidate_phys_page_fast. - tcg_tb_lookup/remove/insert/etc have their own internal lock(s), so there is no need to further serialize access to them. - do_tb_flush is run in a safe async context, meaning no other vCPU threads are running. Therefore acquiring mmap_lock there is just to please tools such as thread sanitizer. - Not visible in the diff, but tb_invalidate_phys_page already has an assert_memory_lock. - cpu_io_recompile is !user-only, so no mmap_lock there. - Added mmap_unlock()'s before all siglongjmp's that could be called in user-mode while mmap_lock is held. + Added an assert for !have_mmap_lock() after returning from the longjmp in cpu_exec, just like we do in cpu_exec_step_atomic. Performance numbers before/after: Host: AMD Opteron(tm) Processor 6376 ubuntu 17.04 ppc64 bootup+shutdown time 700 +-+--+----+------+------------+-----------+------------*--+-+ | + + + + + *B | | before ***B*** ** * | |tb lock removal ###D### *** | 600 +-+ *** +-+ | ** # | | *B* #D | | *** * ## | 500 +-+ *** ### +-+ | * *** ### | | *B* # ## | | ** * #D# | 400 +-+ ** ## +-+ | ** ### | | ** ## | | ** # ## | 300 +-+ * B* #D# +-+ | B *** ### | | * ** #### | | * *** ### | 200 +-+ B *B #D# +-+ | #B* * ## # | | #* ## | | + D##D# + + + + | 100 +-+--+----+------+------------+-----------+------------+--+-+ 1 8 16 Guest CPUs 48 64 png: https://imgur.com/HwmBHXe debian jessie aarch64 bootup+shutdown time 90 +-+--+-----+-----+------------+------------+------------+--+-+ | + + + + + + | | before ***B*** B | 80 +tb lock removal ###D### **D +-+ | **### | | **## | 70 +-+ ** # +-+ | ** ## | | ** # | 60 +-+ *B ## +-+ | ** ## | | *** #D | 50 +-+ *** ## +-+ | * ** ### | | **B* ### | 40 +-+ **** # ## +-+ | **** #D# | | ***B** ### | 30 +-+ B***B** #### +-+ | B * * # ### | | B ###D# | 20 +-+ D ##D## +-+ | D# | | + + + + + + | 10 +-+--+-----+-----+------------+------------+------------+--+-+ 1 8 16 Guest CPUs 48 64 png: https://imgur.com/iGpGFtv The gains are high for 4-8 CPUs. Beyond that point, however, unrelated lock contention significantly hurts scalability. Reviewed-by: Richard Henderson Reviewed-by: Alex Bennée Signed-off-by: Emilio G. Cota Signed-off-by: Richard Henderson --- docs/devel/multi-thread-tcg.txt | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) (limited to 'docs/devel') diff --git a/docs/devel/multi-thread-tcg.txt b/docs/devel/multi-thread-tcg.txt index df83445..06530be 100644 --- a/docs/devel/multi-thread-tcg.txt +++ b/docs/devel/multi-thread-tcg.txt @@ -61,6 +61,7 @@ have their block-to-block jumps patched. Global TCG State ---------------- +### User-mode emulation We need to protect the entire code generation cycle including any post generation patching of the translated code. This also implies a shared translation buffer which contains code running on all cores. Any @@ -75,9 +76,11 @@ patching. (Current solution) -Mainly as part of the linux-user work all code generation is -serialised with a tb_lock(). For the SoftMMU tb_lock() also takes the -place of mmap_lock() in linux-user. +Code generation is serialised with mmap_lock(). + +### !User-mode emulation +Each vCPU has its own TCG context and associated TCG region, thereby +requiring no locking. Translation Blocks ------------------ @@ -195,7 +198,7 @@ work as "safe work" and exiting the cpu run loop. This ensure by the time execution restarts all flush operations have completed. TLB flag updates are all done atomically and are also protected by the -tb_lock() which is used by the functions that update the TLB in bulk. +corresponding page lock. (Known limitation) -- cgit v1.1