From 78722ed0b826644ae240e3c0bbb6bdde02dfe7e1 Mon Sep 17 00:00:00 2001
From: "Emilio G. Cota" <cota@braap.org>
Date: Wed, 26 Jul 2017 20:15:41 -0400
Subject: translate-all: make l1_map lockless
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Groundwork for supporting parallel TCG generation.

We never remove entries from the radix tree, so we can use cmpxchg
to implement lockless insertions.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 docs/devel/multi-thread-tcg.txt | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

(limited to 'docs/devel')

diff --git a/docs/devel/multi-thread-tcg.txt b/docs/devel/multi-thread-tcg.txt
index a99b456..faf8918 100644
--- a/docs/devel/multi-thread-tcg.txt
+++ b/docs/devel/multi-thread-tcg.txt
@@ -134,8 +134,8 @@ tb_set_jmp_target() code. Modification to the linked lists that allow
 searching for linked pages are done under the protect of the
 tb_lock().
 
-The global page table is protected by the tb_lock() in system-mode and
-mmap_lock() in linux-user mode.
+The global page table is a lockless radix tree; cmpxchg is used
+to atomically insert new elements.
 
 The lookup caches are updated atomically and the lookup hash uses QHT
 which is designed for concurrent safe lookup.
-- 
cgit v1.1


From 95590e24af11236ef334f6bc3e2b71404a790ddb Mon Sep 17 00:00:00 2001
From: "Emilio G. Cota" <cota@braap.org>
Date: Tue, 1 Aug 2017 15:40:16 -0400
Subject: translate-all: discard TB when tb_link_page returns an existing
 matching TB

Use the recently-gained QHT feature of returning the matching TB if it
already exists. This allows us to get rid of the lookup we perform
right after acquiring tb_lock.

Suggested-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 docs/devel/multi-thread-tcg.txt | 3 +++
 1 file changed, 3 insertions(+)

(limited to 'docs/devel')

diff --git a/docs/devel/multi-thread-tcg.txt b/docs/devel/multi-thread-tcg.txt
index faf8918..faf09c6 100644
--- a/docs/devel/multi-thread-tcg.txt
+++ b/docs/devel/multi-thread-tcg.txt
@@ -140,6 +140,9 @@ to atomically insert new elements.
 The lookup caches are updated atomically and the lookup hash uses QHT
 which is designed for concurrent safe lookup.
 
+Parallel code generation is supported. QHT is used at insertion time
+as the synchronization point across threads, thereby ensuring that we only
+keep track of a single TranslationBlock for each guest code block.
 
 Memory maps and TLBs
 --------------------
-- 
cgit v1.1


From 194125e3ebd553acb02aaf3797a4f0387493fe94 Mon Sep 17 00:00:00 2001
From: "Emilio G. Cota" <cota@braap.org>
Date: Wed, 2 Aug 2017 20:34:06 -0400
Subject: translate-all: protect TB jumps with a per-destination-TB lock
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This applies to both user-mode and !user-mode emulation.

Instead of relying on a global lock, protect the list of incoming
jumps with tb->jmp_lock. This lock also protects tb->cflags,
so update all tb->cflags readers outside tb->jmp_lock to use
atomic reads via tb_cflags().

In order to find the destination TB (and therefore its jmp_lock)
from the origin TB, we introduce tb->jmp_dest[].

I considered not using a linked list of jumps, which simplifies
code and makes the struct smaller. However, it unnecessarily increases
memory usage, which results in a performance decrease. See for
instance these numbers booting+shutting down debian-arm:
                      Time (s)  Rel. err (%)  Abs. err (s)  Rel. slowdown (%)
------------------------------------------------------------------------------
 before                  20.88          0.74      0.154512                 0.
 after                   20.81          0.38      0.079078        -0.33524904
 GTree                   21.02          0.28      0.058856         0.67049808
 GHashTable + xxhash     21.63          1.08      0.233604          3.5919540

Using a hash table or a binary tree to keep track of the jumps
doesn't really pay off, not only due to the increased memory usage,
but also because most TBs have only 0 or 1 jumps to them. The maximum
number of jumps when booting debian-arm that I measured is 35, but
as we can see in the histogram below a TB with that many incoming jumps
is extremely rare; the average TB has 0.80 incoming jumps.

n_jumps: 379208; avg jumps/tb: 0.801099
dist: [0.0,1.0)|▄█▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁ ▁▁▁  ▁▁▁     ▁|[34.0,35.0]

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 docs/devel/multi-thread-tcg.txt | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

(limited to 'docs/devel')

diff --git a/docs/devel/multi-thread-tcg.txt b/docs/devel/multi-thread-tcg.txt
index faf09c6..df83445 100644
--- a/docs/devel/multi-thread-tcg.txt
+++ b/docs/devel/multi-thread-tcg.txt
@@ -131,8 +131,10 @@ DESIGN REQUIREMENT: Safely handle invalidation of TBs
 
 The direct jump themselves are updated atomically by the TCG
 tb_set_jmp_target() code. Modification to the linked lists that allow
-searching for linked pages are done under the protect of the
-tb_lock().
+searching for linked pages are done under the protection of tb->jmp_lock,
+where tb is the destination block of a jump. Each origin block keeps a
+pointer to its destinations so that the appropriate lock can be acquired before
+iterating over a jump list.
 
 The global page table is a lockless radix tree; cmpxchg is used
 to atomically insert new elements.
-- 
cgit v1.1


From 0ac20318ce16f4de288969b2007ef5a654176058 Mon Sep 17 00:00:00 2001
From: "Emilio G. Cota" <cota@braap.org>
Date: Fri, 4 Aug 2017 23:46:31 -0400
Subject: tcg: remove tb_lock
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Use mmap_lock in user-mode to protect TCG state and the page descriptors.
In !user-mode, each vCPU has its own TCG state, so no locks needed.
Per-page locks are used to protect the page descriptors.

Per-TB locks are used in both modes to protect TB jumps.

Some notes:

- tb_lock is removed from notdirty_mem_write by passing a
  locked page_collection to tb_invalidate_phys_page_fast.

- tcg_tb_lookup/remove/insert/etc have their own internal lock(s),
  so there is no need to further serialize access to them.

- do_tb_flush is run in a safe async context, meaning no other
  vCPU threads are running. Therefore acquiring mmap_lock there
  is just to please tools such as thread sanitizer.

- Not visible in the diff, but tb_invalidate_phys_page already
  has an assert_memory_lock.

- cpu_io_recompile is !user-only, so no mmap_lock there.

- Added mmap_unlock()'s before all siglongjmp's that could
  be called in user-mode while mmap_lock is held.
  + Added an assert for !have_mmap_lock() after returning from
    the longjmp in cpu_exec, just like we do in cpu_exec_step_atomic.

Performance numbers before/after:

Host: AMD Opteron(tm) Processor 6376

                 ubuntu 17.04 ppc64 bootup+shutdown time

  700 +-+--+----+------+------------+-----------+------------*--+-+
      |    +    +      +            +           +           *B    |
      |         before ***B***                            ** *    |
      |tb lock removal ###D###                         ***        |
  600 +-+                                           ***         +-+
      |                                           **         #    |
      |                                        *B*          #D    |
      |                                     *** *         ##      |
  500 +-+                                ***           ###      +-+
      |                             * ***           ###           |
      |                            *B*          # ##              |
      |                          ** *          #D#                |
  400 +-+                      **            ##                 +-+
      |                      **           ###                     |
      |                    **           ##                        |
      |                  **         # ##                          |
  300 +-+  *           B*          #D#                          +-+
      |    B         ***        ###                               |
      |    *       **       ####                                  |
      |     *   ***      ###                                      |
  200 +-+   B  *B     #D#                                       +-+
      |     #B* *   ## #                                          |
      |     #*    ##                                              |
      |    + D##D#     +            +           +            +    |
  100 +-+--+----+------+------------+-----------+------------+--+-+
           1    8      16      Guest CPUs       48           64
  png: https://imgur.com/HwmBHXe

              debian jessie aarch64 bootup+shutdown time

  90 +-+--+-----+-----+------------+------------+------------+--+-+
     |    +     +     +            +            +            +    |
     |         before ***B***                                B    |
  80 +tb lock removal ###D###                              **D  +-+
     |                                                   **###    |
     |                                                 **##       |
  70 +-+                                             ** #       +-+
     |                                             ** ##          |
     |                                           **  #            |
  60 +-+                                       *B  ##           +-+
     |                                       **  ##               |
     |                                    ***  #D                 |
  50 +-+                               ***   ##                 +-+
     |                             * **   ###                     |
     |                           **B*  ###                        |
  40 +-+                     ****  # ##                         +-+
     |                   ****     #D#                             |
     |             ***B**      ###                                |
  30 +-+    B***B**        ####                                 +-+
     |    B *   *     # ###                                       |
     |     B       ###D#                                          |
  20 +-+   D  ##D##                                             +-+
     |      D#                                                    |
     |    +     +     +            +            +            +    |
  10 +-+--+-----+-----+------------+------------+------------+--+-+
          1     8     16      Guest CPUs        48           64
  png: https://imgur.com/iGpGFtv

The gains are high for 4-8 CPUs. Beyond that point, however, unrelated
lock contention significantly hurts scalability.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 docs/devel/multi-thread-tcg.txt | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

(limited to 'docs/devel')

diff --git a/docs/devel/multi-thread-tcg.txt b/docs/devel/multi-thread-tcg.txt
index df83445..06530be 100644
--- a/docs/devel/multi-thread-tcg.txt
+++ b/docs/devel/multi-thread-tcg.txt
@@ -61,6 +61,7 @@ have their block-to-block jumps patched.
 Global TCG State
 ----------------
 
+### User-mode emulation
 We need to protect the entire code generation cycle including any post
 generation patching of the translated code. This also implies a shared
 translation buffer which contains code running on all cores. Any
@@ -75,9 +76,11 @@ patching.
 
 (Current solution)
 
-Mainly as part of the linux-user work all code generation is
-serialised with a tb_lock(). For the SoftMMU tb_lock() also takes the
-place of mmap_lock() in linux-user.
+Code generation is serialised with mmap_lock().
+
+### !User-mode emulation
+Each vCPU has its own TCG context and associated TCG region, thereby
+requiring no locking.
 
 Translation Blocks
 ------------------
@@ -195,7 +198,7 @@ work as "safe work" and exiting the cpu run loop. This ensure by the
 time execution restarts all flush operations have completed.
 
 TLB flag updates are all done atomically and are also protected by the
-tb_lock() which is used by the functions that update the TLB in bulk.
+corresponding page lock.
 
 (Known limitation)
 
-- 
cgit v1.1