1 files changed, 884 insertions, 0 deletions
diff --git a/libgo/go/math/big/natdiv.go b/libgo/go/math/big/natdiv.go
new file mode 100644
index 0000000..882bb6d
--- /dev/null
+++ b/libgo/go/math/big/natdiv.go
@@ -0,0 +1,884 @@
+// Copyright 2009 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+/*
+
+Multi-precision division. Here be dragons.
+
+Given u and v, where u is n+m digits, and v is n digits (with no leading zeros),
+the goal is to return quo, rem such that u = quo*v + rem, where 0 ≤ rem < v.
+That is, quo = ⌊u/v⌋ where ⌊x⌋ denotes the floor (truncation to integer) of x,
+and rem = u - quo·v.
+
+
+Long Division
+
+Division in a computer proceeds the same as long division in elementary school,
+but computers are not as good as schoolchildren at following vague directions,
+so we have to be much more precise about the actual steps and what can happen.
+
+We work from most to least significant digit of the quotient, doing:
+
+ • Guess a digit q, the number of v to subtract from the current
+   section of u to zero out the topmost digit.
+ • Check the guess by multiplying q·v and comparing it against
+   the current section of u, adjusting the guess as needed.
+ • Subtract q·v from the current section of u.
+ • Add q to the corresponding section of the result quo.
+
+When all digits have been processed, the final remainder is left in u
+and returned as rem.
+
+For example, here is a sketch of dividing 5 digits by 3 digits (n=3, m=2).
+
+	                 q₂ q₁ q₀
+	         _________________
+	v₂ v₁ v₀ ) u₄ u₃ u₂ u₁ u₀
+	           ↓  ↓  ↓  |  |
+	          [u₄ u₃ u₂]|  |
+	        - [  q₂·v  ]|  |
+	        ----------- ↓  |
+	          [  rem  | u₁]|
+	        - [    q₁·v   ]|
+	           ----------- ↓
+	             [  rem  | u₀]
+	           - [    q₀·v   ]
+	              ------------
+	                [  rem   ]
+
+Instead of creating new storage for the remainders and copying digits from u
+as indicated by the arrows, we use u's storage directly as both the source
+and destination of the subtractions, so that the remainders overwrite
+successive overlapping sections of u as the division proceeds, using a slice
+of u to identify the current section. This avoids all the copying as well as
+shifting of remainders.
+
+Division of u with n+m digits by v with n digits (in base B) can in general
+produce at most m+1 digits, because:
+
+  • u < B^(n+m)               [B^(n+m) has n+m+1 digits]
+  • v ≥ B^(n-1)               [B^(n-1) is the smallest n-digit number]
+  • u/v < B^(n+m) / B^(n-1)   [divide bounds for u, v]
+  • u/v < B^(m+1)             [simplify]
+
+The first step is special: it takes the top n digits of u and divides them by
+the n digits of v, producing the first quotient digit and an n-digit remainder.
+In the example, q₂ = ⌊u₄u₃u₂ / v⌋.
+
+The first step divides n digits by n digits to ensure that it produces only a
+single digit.
+
+Each subsequent step appends the next digit from u to the remainder and divides
+those n+1 digits by the n digits of v, producing another quotient digit and a
+new n-digit remainder.
+
+Subsequent steps divide n+1 digits by n digits, an operation that in general
+might produce two digits. However, as used in the algorithm, that division is
+guaranteed to produce only a single digit. The dividend is of the form
+rem·B + d, where rem is a remainder from the previous step and d is a single
+digit, so:
+
+ • rem ≤ v - 1                 [rem is a remainder from dividing by v]
+ • rem·B ≤ v·B - B             [multiply by B]
+ • d ≤ B - 1                   [d is a single digit]
+ • rem·B + d ≤ v·B - 1         [add]
+ • rem·B + d < v·B             [change ≤ to <]
+ • (rem·B + d)/v < B           [divide by v]
+
+
+Guess and Check
+
+At each step we need to divide n+1 digits by n digits, but this is for the
+implementation of division by n digits, so we can't just invoke a division
+routine: we _are_ the division routine. Instead, we guess at the answer and
+then check it using multiplication. If the guess is wrong, we correct it.
+
+How can this guessing possibly be efficient? It turns out that the following
+statement (let's call it the Good Guess Guarantee) is true.
+
+If
+
+ • q = ⌊u/v⌋ where u is n+1 digits and v is n digits,
+ • q < B, and
+ • the topmost digit of v = vₙ₋₁ ≥ B/2,
+
+then q̂ = ⌊uₙuₙ₋₁ / vₙ₋₁⌋ satisfies q ≤ q̂ ≤ q+2. (Proof below.)
+
+That is, if we know the answer has only a single digit and we guess an answer
+by ignoring the bottom n-1 digits of u and v, using a 2-by-1-digit division,
+then that guess is at least as large as the correct answer. It is also not
+too much larger: it is off by at most two from the correct answer.
+
+Note that in the first step of the overall division, which is an n-by-n-digit
+division, the 2-by-1 guess uses an implicit uₙ = 0.
+
+Note that using a 2-by-1-digit division here does not mean calling ourselves
+recursively. Instead, we use an efficient direct hardware implementation of
+that operation.
+
+Note that because q is u/v rounded down, q·v must not exceed u: u ≥ q·v.
+If a guess q̂ is too big, it will not satisfy this test. Viewed a different way,
+the remainder r̂ for a given q̂ is u - q̂·v, which must be positive. If it is
+negative, then the guess q̂ is too big.
+
+This gives us a way to compute q. First compute q̂ with 2-by-1-digit division.
+Then, while u < q̂·v, decrement q̂; this loop executes at most twice, because
+q̂ ≤ q+2.
+
+
+Scaling Inputs
+
+The Good Guess Guarantee requires that the top digit of v (vₙ₋₁) be at least B/2.
+For example in base 10, ⌊172/19⌋ = 9, but ⌊18/1⌋ = 18: the guess is wildly off
+because the first digit 1 is smaller than B/2 = 5.
+
+We can ensure that v has a large top digit by multiplying both u and v by the
+right amount. Continuing the example, if we multiply both 172 and 19 by 3, we
+now have ⌊516/57⌋, the leading digit of v is now ≥ 5, and sure enough
+⌊51/5⌋ = 10 is much closer to the correct answer 9. It would be easier here
+to multiply by 4, because that can be done with a shift. Specifically, we can
+always count the number of leading zeros i in the first digit of v and then
+shift both u and v left by i bits.
+
+Having scaled u and v, the value ⌊u/v⌋ is unchanged, but the remainder will
+be scaled: 172 mod 19 is 1, but 516 mod 57 is 3. We have to divide the remainder
+by the scaling factor (shifting right i bits) when we finish.
+
+Note that these shifts happen before and after the entire division algorithm,
+not at each step in the per-digit iteration.
+
+Note the effect of scaling inputs on the size of the possible quotient.
+In the scaled u/v, u can gain a digit from scaling; v never does, because we
+pick the scaling factor to make v's top digit larger but without overflowing.
+If u and v have n+m and n digits after scaling, then:
+
+  • u < B^(n+m)               [B^(n+m) has n+m+1 digits]
+  • v ≥ B^n / 2               [vₙ₋₁ ≥ B/2, so vₙ₋₁·B^(n-1) ≥ B^n/2]
+  • u/v < B^(n+m) / (B^n / 2) [divide bounds for u, v]
+  • u/v < 2 B^m               [simplify]
+
+The quotient can still have m+1 significant digits, but if so the top digit
+must be a 1. This provides a different way to handle the first digit of the
+result: compare the top n digits of u against v and fill in either a 0 or a 1.
+
+
+Refining Guesses
+
+Before we check whether u < q̂·v, we can adjust our guess to change it from
+q̂ = ⌊uₙuₙ₋₁ / vₙ₋₁⌋ into the refined guess ⌊uₙuₙ₋₁uₙ₋₂ / vₙ₋₁vₙ₋₂⌋.
+Although not mentioned above, the Good Guess Guarantee also promises that this
+3-by-2-digit division guess is more precise and at most one away from the real
+answer q. The improvement from the 2-by-1 to the 3-by-2 guess can also be done
+without n-digit math.
+
+If we have a guess q̂ = ⌊uₙuₙ₋₁ / vₙ₋₁⌋ and we want to see if it also equal to
+⌊uₙuₙ₋₁uₙ₋₂ / vₙ₋₁vₙ₋₂⌋, we can use the same check we would for the full division:
+if uₙuₙ₋₁uₙ₋₂ < q̂·vₙ₋₁vₙ₋₂, then the guess is too large and should be reduced.
+
+Checking uₙuₙ₋₁uₙ₋₂ < q̂·vₙ₋₁vₙ₋₂ is the same as uₙuₙ₋₁uₙ₋₂ - q̂·vₙ₋₁vₙ₋₂ < 0,
+and
+
+	uₙuₙ₋₁uₙ₋₂ - q̂·vₙ₋₁vₙ₋₂ = (uₙuₙ₋₁·B + uₙ₋₂) - q̂·(vₙ₋₁·B + vₙ₋₂)
+	                          [splitting off the bottom digit]
+	                      = (uₙuₙ₋₁ - q̂·vₙ₋₁)·B + uₙ₋₂ - q̂·vₙ₋₂
+	                          [regrouping]
+
+The expression (uₙuₙ₋₁ - q̂·vₙ₋₁) is the remainder of uₙuₙ₋₁ / vₙ₋₁.
+If the initial guess returns both q̂ and its remainder r̂, then checking
+whether uₙuₙ₋₁uₙ₋₂ < q̂·vₙ₋₁vₙ₋₂ is the same as checking r̂·B + uₙ₋₂ < q̂·vₙ₋₂.
+
+If we find that r̂·B + uₙ₋₂ < q̂·vₙ₋₂, then we can adjust the guess by
+decrementing q̂ and adding vₙ₋₁ to r̂. We repeat until r̂·B + uₙ₋₂ ≥ q̂·vₙ₋₂.
+(As before, this fixup is only needed at most twice.)
+
+Now that q̂ = ⌊uₙuₙ₋₁uₙ₋₂ / vₙ₋₁vₙ₋₂⌋, as mentioned above it is at most one
+away from the correct q, and we've avoided doing any n-digit math.
+(If we need the new remainder, it can be computed as r̂·B + uₙ₋₂ - q̂·vₙ₋₂.)
+
+The final check u < q̂·v and the possible fixup must be done at full precision.
+For random inputs, a fixup at this step is exceedingly rare: the 3-by-2 guess
+is not often wrong at all. But still we must do the check. Note that since the
+3-by-2 guess is off by at most 1, it can be convenient to perform the final
+u < q̂·v as part of the computation of the remainder r = u - q̂·v. If the
+subtraction underflows, decremeting q̂ and adding one v back to r is enough to
+arrive at the final q, r.
+
+That's the entirety of long division: scale the inputs, and then loop over
+each output position, guessing, checking, and correcting the next output digit.
+
+For a 2n-digit number divided by an n-digit number (the worst size-n case for
+division complexity), this algorithm uses n+1 iterations, each of which must do
+at least the 1-by-n-digit multiplication q̂·v. That's O(n) iterations of
+O(n) time each, so O(n²) time overall.
+
+
+Recursive Division
+
+For very large inputs, it is possible to improve on the O(n²) algorithm.
+Let's call a group of n/2 real digits a (very) “wide digit”. We can run the
+standard long division algorithm explained above over the wide digits instead of
+the actual digits. This will result in many fewer steps, but the math involved in
+each step is more work.
+
+Where basic long division uses a 2-by-1-digit division to guess the initial q̂,
+the new algorithm must use a 2-by-1-wide-digit division, which is of course
+really an n-by-n/2-digit division. That's OK: if we implement n-digit division
+in terms of n/2-digit division, the recursion will terminate when the divisor
+becomes small enough to handle with standard long division or even with the
+2-by-1 hardware instruction.
+
+For example, here is a sketch of dividing 10 digits by 4, proceeding with
+wide digits corresponding to two regular digits. The first step, still special,
+must leave off a (regular) digit, dividing 5 by 4 and producing a 4-digit
+remainder less than v. The middle steps divide 6 digits by 4, guaranteed to
+produce two output digits each (one wide digit) with 4-digit remainders.
+The final step must use what it has: the 4-digit remainder plus one more,
+5 digits to divide by 4.
+
+	                       q₆ q₅ q₄ q₃ q₂ q₁ q₀
+	            _______________________________
+	v₃ v₂ v₁ v₀ ) u₉ u₈ u₇ u₆ u₅ u₄ u₃ u₂ u₁ u₀
+	              ↓  ↓  ↓  ↓  ↓  |  |  |  |  |
+	             [u₉ u₈ u₇ u₆ u₅]|  |  |  |  |
+	           - [    q₆q₅·v    ]|  |  |  |  |
+	           ----------------- ↓  ↓  |  |  |
+	                [    rem    |u₄ u₃]|  |  |
+	              - [     q₄q₃·v      ]|  |  |
+	              -------------------- ↓  ↓  |
+	                      [    rem    |u₂ u₁]|
+	                    - [     q₂q₁·v      ]|
+	                    -------------------- ↓
+	                            [    rem    |u₀]
+	                          - [     q₀·v     ]
+	                          ------------------
+	                               [    rem    ]
+
+An alternative would be to look ahead to how well n/2 divides into n+m and
+adjust the first step to use fewer digits as needed, making the first step
+more special to make the last step not special at all. For example, using the
+same input, we could choose to use only 4 digits in the first step, leaving
+a full wide digit for the last step:
+
+	                       q₆ q₅ q₄ q₃ q₂ q₁ q₀
+	            _______________________________
+	v₃ v₂ v₁ v₀ ) u₉ u₈ u₇ u₆ u₅ u₄ u₃ u₂ u₁ u₀
+	              ↓  ↓  ↓  ↓  |  |  |  |  |  |
+	             [u₉ u₈ u₇ u₆]|  |  |  |  |  |
+	           - [    q₆·v   ]|  |  |  |  |  |
+	           -------------- ↓  ↓  |  |  |  |
+	             [    rem    |u₅ u₄]|  |  |  |
+	           - [     q₅q₄·v      ]|  |  |  |
+	           -------------------- ↓  ↓  |  |
+	                   [    rem    |u₃ u₂]|  |
+	                 - [     q₃q₂·v      ]|  |
+	                 -------------------- ↓  ↓
+	                         [    rem    |u₁ u₀]
+	                       - [     q₁q₀·v      ]
+	                       ---------------------
+	                               [    rem    ]
+
+Today, the code in divRecursiveStep works like the first example. Perhaps in
+the future we will make it work like the alternative, to avoid a special case
+in the final iteration.
+
+Either way, each step is a 3-by-2-wide-digit division approximated first by
+a 2-by-1-wide-digit division, just as we did for regular digits in long division.
+Because the actual answer we want is a 3-by-2-wide-digit division, instead of
+multiplying q̂·v directly during the fixup, we can use the quick refinement
+from long division (an n/2-by-n/2 multiply) to correct q to its actual value
+and also compute the remainder (as mentioned above), and then stop after that,
+never doing a full n-by-n multiply.
+
+Instead of using an n-by-n/2-digit division to produce n/2 digits, we can add
+(not discard) one more real digit, doing an (n+1)-by-(n/2+1)-digit division that
+produces n/2+1 digits. That single extra digit tightens the Good Guess Guarantee
+to q ≤ q̂ ≤ q+1 and lets us drop long division's special treatment of the first
+digit. These benefits are discussed more after the Good Guess Guarantee proof
+below.
+
+
+How Fast is Recursive Division?
+
+For a 2n-by-n-digit division, this algorithm runs a 4-by-2 long division over
+wide digits, producing two wide digits plus a possible leading regular digit 1,
+which can be handled without a recursive call. That is, the algorithm uses two
+full iterations, each using an n-by-n/2-digit division and an n/2-by-n/2-digit
+multiplication, along with a few n-digit additions and subtractions. The standard
+n-by-n-digit multiplication algorithm requires O(n²) time, making the overall
+algorithm require time T(n) where
+
+	T(n) = 2T(n/2) + O(n) + O(n²)
+
+which, by the Bentley-Haken-Saxe theorem, ends up reducing to T(n) = O(n²).
+This is not an improvement over regular long division.
+
+When the number of digits n becomes large enough, Karatsuba's algorithm for
+multiplication can be used instead, which takes O(n^log₂3) = O(n^1.6) time.
+(Karatsuba multiplication is implemented in func karatsuba in nat.go.)
+That makes the overall recursive division algorithm take O(n^1.6) time as well,
+which is an improvement, but again only for large enough numbers.
+
+It is not critical to make sure that every recursion does only two recursive
+calls. While in general the number of recursive calls can change the time
+analysis, in this case doing three calls does not change the analysis:
+
+	T(n) = 3T(n/2) + O(n) + O(n^log₂3)
+
+ends up being T(n) = O(n^log₂3). Because the Karatsuba multiplication taking
+time O(n^log₂3) is itself doing 3 half-sized recursions, doing three for the
+division does not hurt the asymptotic performance. Of course, it is likely
+still faster in practice to do two.
+
+
+Proof of the Good Guess Guarantee
+
+Given numbers x, y, let us break them into the quotients and remainders when
+divided by some scaling factor S, with the added constraints that the quotient
+x/y and the high part of y are both less than some limit T, and that the high
+part of y is at least half as big as T.
+
+	x₁ = ⌊x/S⌋        y₁ = ⌊y/S⌋
+	x₀ = x mod S      y₀ = y mod S
+
+	x  = x₁·S + x₀    0 ≤ x₀ < S    x/y < T
+	y  = y₁·S + y₀    0 ≤ y₀ < S    T/2 ≤ y₁ < T
+
+And consider the two truncated quotients:
+
+	q = ⌊x/y⌋
+	q̂ = ⌊x₁/y₁⌋
+
+We will prove that q ≤ q̂ ≤ q+2.
+
+The guarantee makes no real demands on the scaling factor S: it is simply the
+magnitude of the digits cut from both x and y to produce x₁ and y₁.
+The guarantee makes only limited demands on T: it must be large enough to hold
+the quotient x/y, and y₁ must have roughly the same size.
+
+To apply to the earlier discussion of 2-by-1 guesses in long division,
+we would choose:
+
+	S  = Bⁿ⁻¹
+	T  = B
+	x  = u
+	x₁ = uₙuₙ₋₁
+	x₀ = uₙ₋₂...u₀
+	y  = v
+	y₁ = vₙ₋₁
+	y₀ = vₙ₋₂...u₀
+
+These simpler variables avoid repeating those longer expressions in the proof.
+
+Note also that, by definition, truncating division ⌊x/y⌋ satisfies
+
+	x/y - 1 < ⌊x/y⌋ ≤ x/y.
+
+This fact will be used a few times in the proofs.
+
+Proof that q ≤ q̂:
+
+	q̂·y₁ = ⌊x₁/y₁⌋·y₁                      [by definition, q̂ = ⌊x₁/y₁⌋]
+	     > (x₁/y₁ - 1)·y₁                  [x₁/y₁ - 1 < ⌊x₁/y₁⌋]
+	     = x₁ - y₁                         [distribute y₁]
+
+	So q̂·y₁ > x₁ - y₁.
+	Since q̂·y₁ is an integer, q̂·y₁ ≥ x₁ - y₁ + 1.
+
+	q̂ - q = q̂ - ⌊x/y⌋                      [by definition, q = ⌊x/y⌋]
+	      ≥ q̂ - x/y                        [⌊x/y⌋ < x/y]
+	      = (1/y)·(q̂·y - x)                [factor out 1/y]
+	      ≥ (1/y)·(q̂·y₁·S - x)             [y = y₁·S + y₀ ≥ y₁·S]
+	      ≥ (1/y)·((x₁ - y₁ + 1)·S - x)    [above: q̂·y₁ ≥ x₁ - y₁ + 1]
+	      = (1/y)·(x₁·S - y₁·S + S - x)    [distribute S]
+	      = (1/y)·(S - x₀ - y₁·S)          [-x = -x₁·S - x₀]
+	      > -y₁·S / y                      [x₀ < S, so S - x₀ < 0; drop it]
+	      ≥ -1                             [y₁·S ≤ y]
+
+	So q̂ - q > -1.
+	Since q̂ - q is an integer, q̂ - q ≥ 0, or equivalently q ≤ q̂.
+
+Proof that q̂ ≤ q+2:
+
+	x₁/y₁ - x/y = x₁·S/y₁·S - x/y          [multiply left term by S/S]
+	            ≤ x/y₁·S - x/y             [x₁S ≤ x]
+	            = (x/y)·(y/y₁·S - 1)       [factor out x/y]
+	            = (x/y)·((y - y₁·S)/y₁·S)  [move -1 into y/y₁·S fraction]
+	            = (x/y)·(y₀/y₁·S)          [y - y₁·S = y₀]
+	            = (x/y)·(1/y₁)·(y₀/S)      [factor out 1/y₁]
+	            < (x/y)·(1/y₁)             [y₀ < S, so y₀/S < 1]
+	            ≤ (x/y)·(2/T)              [y₁ ≥ T/2, so 1/y₁ ≤ 2/T]
+	            < T·(2/T)                  [x/y < T]
+	            = 2                        [T·(2/T) = 2]
+
+	So x₁/y₁ - x/y < 2.
+
+	q̂ - q = ⌊x₁/y₁⌋ - q                    [by definition, q̂ = ⌊x₁/y₁⌋]
+	      = ⌊x₁/y₁⌋ - ⌊x/y⌋                [by definition, q = ⌊x/y⌋]
+	      ≤ x₁/y₁ - ⌊x/y⌋                  [⌊x₁/y₁⌋ ≤ x₁/y₁]
+	      < x₁/y₁ - (x/y - 1)              [⌊x/y⌋ > x/y - 1]
+	      = (x₁/y₁ - x/y) + 1              [regrouping]
+	      < 2 + 1                          [above: x₁/y₁ - x/y < 2]
+	      = 3
+
+	So q̂ - q < 3.
+	Since q̂ - q is an integer, q̂ - q ≤ 2.
+
+Note that when x/y < T/2, the bounds tighten to x₁/y₁ - x/y < 1 and therefore
+q̂ - q ≤ 1.
+
+Note also that in the general case 2n-by-n division where we don't know that
+x/y < T, we do know that x/y < 2T, yielding the bound q̂ - q ≤ 4. So we could
+remove the special case first step of long division as long as we allow the
+first fixup loop to run up to four times. (Using a simple comparison to decide
+whether the first digit is 0 or 1 is still more efficient, though.)
+
+Finally, note that when dividing three leading base-B digits by two (scaled),
+we have T = B² and x/y < B = T/B, a much tighter bound than x/y < T.
+This in turn yields the much tighter bound x₁/y₁ - x/y < 2/B. This means that
+⌊x₁/y₁⌋ and ⌊x/y⌋ can only differ when x/y is less than 2/B greater than an
+integer. For random x and y, the chance of this is 2/B, or, for large B,
+approximately zero. This means that after we produce the 3-by-2 guess in the
+long division algorithm, the fixup loop essentially never runs.
+
+In the recursive algorithm, the extra digit in (2·⌊n/2⌋+1)-by-(⌊n/2⌋+1)-digit
+division has exactly the same effect: the probability of needing a fixup is the
+same 2/B. Even better, we can allow the general case x/y < 2T and the fixup
+probability only grows to 4/B, still essentially zero.
+
+
+References
+
+There are no great references for implementing long division; thus this comment.
+Here are some notes about what to expect from the obvious references.
+
+Knuth Volume 2 (Seminumerical Algorithms) section 4.3.1 is the usual canonical
+reference for long division, but that entire series is highly compressed, never
+repeating a necessary fact and leaving important insights to the exercises.
+For example, no rationale whatsoever is given for the calculation that extends
+q̂ from a 2-by-1 to a 3-by-2 guess, nor why it reduces the error bound.
+The proof that the calculation even has the desired effect is left to exercises.
+The solutions to those exercises provided at the back of the book are entirely
+calculations, still with no explanation as to what is going on or how you would
+arrive at the idea of doing those exact calculations. Nowhere is it mentioned
+that this test extends the 2-by-1 guess into a 3-by-2 guess. The proof of the
+Good Guess Guarantee is only for the 2-by-1 guess and argues by contradiction,
+making it difficult to understand how modifications like adding another digit
+or adjusting the quotient range affects the overall bound.
+
+All that said, Knuth remains the canonical reference. It is dense but packed
+full of information and references, and the proofs are simpler than many other
+presentations. The proofs above are reworkings of Knuth's to remove the
+arguments by contradiction and add explanations or steps that Knuth omitted.
+But beware of errors in older printings. Take the published errata with you.
+
+Brinch Hansen's “Multiple-length Division Revisited: a Tour of the Minefield”
+starts with a blunt critique of Knuth's presentation (among others) and then
+presents a more detailed and easier to follow treatment of long division,
+including an implementation in Pascal. But the algorithm and implementation
+work entirely in terms of 3-by-2 division, which is much less useful on modern
+hardware than an algorithm using 2-by-1 division. The proofs are a bit too
+focused on digit counting and seem needlessly complex, especially compared to
+the ones given above.
+
+Burnikel and Ziegler's “Fast Recursive Division” introduced the key insight of
+implementing division by an n-digit divisor using recursive calls to division
+by an n/2-digit divisor, relying on Karatsuba multiplication to yield a
+sub-quadratic run time. However, the presentation decisions are made almost
+entirely for the purpose of simplifying the run-time analysis, rather than
+simplifying the presentation. Instead of a single algorithm that loops over
+quotient digits, the paper presents two mutually-recursive algorithms, for
+2n-by-n and 3n-by-2n. The paper also does not present any general (n+m)-by-n
+algorithm.
+
+The proofs in the paper are remarkably complex, especially considering that
+the algorithm is at its core just long division on wide digits, so that the
+usual long division proofs apply essentially unaltered.
+*/
+
+package big
+
+import "math/bits"
+
+// div returns q, r such that q = ⌊u/v⌋ and r = u%v = u - q·v.
+// It uses z and z2 as the storage for q and r.
+func (z nat) div(z2, u, v nat) (q, r nat) {
+	if len(v) == 0 {
+		panic("division by zero")
+	}
+
+	if u.cmp(v) < 0 {
+		q = z[:0]
+		r = z2.set(u)
+		return
+	}
+
+	if len(v) == 1 {
+		// Short division: long optimized for a single-word divisor.
+		// In that case, the 2-by-1 guess is all we need at each step.
+		var r2 Word
+		q, r2 = z.divW(u, v[0])
+		r = z2.setWord(r2)
+		return
+	}
+
+	q, r = z.divLarge(z2, u, v)
+	return
+}
+
+// divW returns q, r such that q = ⌊x/y⌋ and r = x%y = x - q·y.
+// It uses z as the storage for q.
+// Note that y is a single digit (Word), not a big number.
+func (z nat) divW(x nat, y Word) (q nat, r Word) {
+	m := len(x)
+	switch {
+	case y == 0:
+		panic("division by zero")
+	case y == 1:
+		q = z.set(x) // result is x
+		return
+	case m == 0:
+		q = z[:0] // result is 0
+		return
+	}
+	// m > 0
+	z = z.make(m)
+	r = divWVW(z, 0, x, y)
+	q = z.norm()
+	return
+}
+
+// modW returns x % d.
+func (x nat) modW(d Word) (r Word) {
+	// TODO(agl): we don't actually need to store the q value.
+	var q nat
+	q = q.make(len(x))
+	return divWVW(q, 0, x, d)
+}
+
+// divWVW overwrites z with ⌊x/y⌋, returning the remainder r.
+// The caller must ensure that len(z) = len(x).
+func divWVW(z []Word, xn Word, x []Word, y Word) (r Word) {
+	r = xn
+	if len(x) == 1 {
+		qq, rr := bits.Div(uint(r), uint(x[0]), uint(y))
+		z[0] = Word(qq)
+		return Word(rr)
+	}
+	rec := reciprocalWord(y)
+	for i := len(z) - 1; i >= 0; i-- {
+		z[i], r = divWW(r, x[i], y, rec)
+	}
+	return r
+}
+
+// div returns q, r such that q = ⌊uIn/vIn⌋ and r = uIn%vIn = uIn - q·vIn.
+// It uses z and u as the storage for q and r.
+// The caller must ensure that len(vIn) ≥ 2 (use divW otherwise)
+// and that len(uIn) ≥ len(vIn) (the answer is 0, uIn otherwise).
+func (z nat) divLarge(u, uIn, vIn nat) (q, r nat) {
+	n := len(vIn)
+	m := len(uIn) - n
+
+	// Scale the inputs so vIn's top bit is 1 (see “Scaling Inputs” above).
+	// vIn is treated as a read-only input (it may be in use by another
+	// goroutine), so we must make a copy.
+	// uIn is copied to u.
+	shift := nlz(vIn[n-1])
+	vp := getNat(n)
+	v := *vp
+	shlVU(v, vIn, shift)
+	u = u.make(len(uIn) + 1)
+	u[len(uIn)] = shlVU(u[0:len(uIn)], uIn, shift)
+
+	// The caller should not pass aliased z and u, since those are
+	// the two different outputs, but correct just in case.
+	if alias(z, u) {
+		z = nil
+	}
+	q = z.make(m + 1)
+
+	// Use basic or recursive long division depending on size.
+	if n < divRecursiveThreshold {
+		q.divBasic(u, v)
+	} else {
+		q.divRecursive(u, v)
+	}
+	putNat(vp)
+
+	q = q.norm()
+
+	// Undo scaling of remainder.
+	shrVU(u, u, shift)
+	r = u.norm()
+
+	return q, r
+}
+
+// divBasic implements long division as described above.
+// It overwrites q with ⌊u/v⌋ and overwrites u with the remainder r.
+// q must be large enough to hold ⌊u/v⌋.
+func (q nat) divBasic(u, v nat) {
+	n := len(v)
+	m := len(u) - n
+
+	qhatvp := getNat(n + 1)
+	qhatv := *qhatvp
+
+	// Set up for divWW below, precomputing reciprocal argument.
+	vn1 := v[n-1]
+	rec := reciprocalWord(vn1)
+
+	// Compute each digit of quotient.
+	for j := m; j >= 0; j-- {
+		// Compute the 2-by-1 guess q̂.
+		// The first iteration must invent a leading 0 for u.
+		qhat := Word(_M)
+		var ujn Word
+		if j+n < len(u) {
+			ujn = u[j+n]
+		}
+
+		// ujn ≤ vn1, or else q̂ would be more than one digit.
+		// For ujn == vn1, we set q̂ to the max digit M above.
+		// Otherwise, we compute the 2-by-1 guess.
+		if ujn != vn1 {
+			var rhat Word
+			qhat, rhat = divWW(ujn, u[j+n-1], vn1, rec)
+
+			// Refine q̂ to a 3-by-2 guess. See “Refining Guesses” above.
+			vn2 := v[n-2]
+			x1, x2 := mulWW(qhat, vn2)
+			ujn2 := u[j+n-2]
+			for greaterThan(x1, x2, rhat, ujn2) { // x1x2 > r̂ u[j+n-2]
+				qhat--
+				prevRhat := rhat
+				rhat += vn1
+				// If r̂  overflows, then
+				// r̂ u[j+n-2]v[n-1] is now definitely > x1 x2.
+				if rhat < prevRhat {
+					break
+				}
+				// TODO(rsc): No need for a full mulWW.
+				// x2 += vn2; if x2 overflows, x1++
+				x1, x2 = mulWW(qhat, vn2)
+			}
+		}
+
+		// Compute q̂·v.
+		qhatv[n] = mulAddVWW(qhatv[0:n], v, qhat, 0)
+		qhl := len(qhatv)
+		if j+qhl > len(u) && qhatv[n] == 0 {
+			qhl--
+		}
+
+		// Subtract q̂·v from the current section of u.
+		// If it underflows, q̂·v > u, which we fix up
+		// by decrementing q̂ and adding v back.
+		c := subVV(u[j:j+qhl], u[j:], qhatv)
+		if c != 0 {
+			c := addVV(u[j:j+n], u[j:], v)
+			// If n == qhl, the carry from subVV and the carry from addVV
+			// cancel out and don't affect u[j+n].
+			if n < qhl {
+				u[j+n] += c
+			}
+			qhat--
+		}
+
+		// Save quotient digit.
+		// Caller may know the top digit is zero and not leave room for it.
+		if j == m && m == len(q) && qhat == 0 {
+			continue
+		}
+		q[j] = qhat
+	}
+
+	putNat(qhatvp)
+}
+
+// greaterThan reports whether the two digit numbers x1 x2 > y1 y2.
+// TODO(rsc): In contradiction to most of this file, x1 is the high
+// digit and x2 is the low digit. This should be fixed.
+func greaterThan(x1, x2, y1, y2 Word) bool {
+	return x1 > y1 || x1 == y1 && x2 > y2
+}
+
+// divRecursiveThreshold is the number of divisor digits
+// at which point divRecursive is faster than divBasic.
+const divRecursiveThreshold = 100
+
+// divRecursive implements recursive division as described above.
+// It overwrites z with ⌊u/v⌋ and overwrites u with the remainder r.
+// z must be large enough to hold ⌊u/v⌋.
+// This function is just for allocating and freeing temporaries
+// around divRecursiveStep, the real implementation.
+func (z nat) divRecursive(u, v nat) {
+	// Recursion depth is (much) less than 2 log₂(len(v)).
+	// Allocate a slice of temporaries to be reused across recursion,
+	// plus one extra temporary not live across the recursion.
+	recDepth := 2 * bits.Len(uint(len(v)))
+	tmp := getNat(3 * len(v))
+	temps := make([]*nat, recDepth)
+
+	z.clear()
+	z.divRecursiveStep(u, v, 0, tmp, temps)
+
+	// Free temporaries.
+	for _, n := range temps {
+		if n != nil {
+			putNat(n)
+		}
+	}
+	putNat(tmp)
+}
+
+// divRecursiveStep is the actual implementation of recursive division.
+// It adds ⌊u/v⌋ to z and overwrites u with the remainder r.
+// z must be large enough to hold ⌊u/v⌋.
+// It uses temps[depth] (allocating if needed) as a temporary live across
+// the recursive call. It also uses tmp, but not live across the recursion.
+func (z nat) divRecursiveStep(u, v nat, depth int, tmp *nat, temps []*nat) {
+	// u is a subsection of the original and may have leading zeros.
+	// TODO(rsc): The v = v.norm() is useless and should be removed.
+	// We know (and require) that v's top digit is ≥ B/2.
+	u = u.norm()
+	v = v.norm()
+	if len(u) == 0 {
+		z.clear()
+		return
+	}
+
+	// Fall back to basic division if the problem is now small enough.
+	n := len(v)
+	if n < divRecursiveThreshold {
+		z.divBasic(u, v)
+		return
+	}
+
+	// Nothing to do if u is shorter than v (implies u < v).
+	m := len(u) - n
+	if m < 0 {
+		return
+	}
+
+	// We consider B digits in a row as a single wide digit.
+	// (See “Recursive Division” above.)
+	//
+	// TODO(rsc): rename B to Wide, to avoid confusion with _B,
+	// which is something entirely different.
+	// TODO(rsc): Look into whether using ⌈n/2⌉ is better than ⌊n/2⌋.
+	B := n / 2
+
+	// Allocate a nat for qhat below.
+	if temps[depth] == nil {
+		temps[depth] = getNat(n) // TODO(rsc): Can be just B+1.
+	} else {
+		*temps[depth] = temps[depth].make(B + 1)
+	}
+
+	// Compute each wide digit of the quotient.
+	//
+	// TODO(rsc): Change the loop to be
+	//	for j := (m+B-1)/B*B; j > 0; j -= B {
+	// which will make the final step a regular step, letting us
+	// delete what amounts to an extra copy of the loop body below.
+	j := m
+	for j > B {
+		// Divide u[j-B:j+n] (3 wide digits) by v (2 wide digits).
+		// First make the 2-by-1-wide-digit guess using a recursive call.
+		// Then extend the guess to the full 3-by-2 (see “Refining Guesses”).
+		//
+		// For the 2-by-1-wide-digit guess, instead of doing 2B-by-B-digit,
+		// we use a (2B+1)-by-(B+1) digit, which handles the possibility that
+		// the result has an extra leading 1 digit as well as guaranteeing
+		// that the computed q̂ will be off by at most 1 instead of 2.
+
+		// s is the number of digits to drop from the 3B- and 2B-digit chunks.
+		// We drop B-1 to be left with 2B+1 and B+1.
+		s := (B - 1)
+
+		// uu is the up-to-3B-digit section of u we are working on.
+		uu := u[j-B:]
+
+		// Compute the 2-by-1 guess q̂, leaving r̂ in uu[s:B+n].
+		qhat := *temps[depth]
+		qhat.clear()
+		qhat.divRecursiveStep(uu[s:B+n], v[s:], depth+1, tmp, temps)
+		qhat = qhat.norm()
+
+		// Extend to a 3-by-2 quotient and remainder.
+		// Because divRecursiveStep overwrote the top part of uu with
+		// the remainder r̂, the full uu already contains the equivalent
+		// of r̂·B + uₙ₋₂ from the “Refining Guesses” discussion.
+		// Subtracting q̂·vₙ₋₂ from it will compute the full-length remainder.
+		// If that subtraction underflows, q̂·v > u, which we fix up
+		// by decrementing q̂ and adding v back, same as in long division.
+
+		// TODO(rsc): Instead of subtract and fix-up, this code is computing
+		// q̂·vₙ₋₂ and decrementing q̂ until that product is ≤ u.
+		// But we can do the subtraction directly, as in the comment above
+		// and in long division, because we know that q̂ is wrong by at most one.
+		qhatv := tmp.make(3 * n)
+		qhatv.clear()
+		qhatv = qhatv.mul(qhat, v[:s])
+		for i := 0; i < 2; i++ {
+			e := qhatv.cmp(uu.norm())
+			if e <= 0 {
+				break
+			}
+			subVW(qhat, qhat, 1)
+			c := subVV(qhatv[:s], qhatv[:s], v[:s])
+			if len(qhatv) > s {
+				subVW(qhatv[s:], qhatv[s:], c)
+			}
+			addAt(uu[s:], v[s:], 0)
+		}
+		if qhatv.cmp(uu.norm()) > 0 {
+			panic("impossible")
+		}
+		c := subVV(uu[:len(qhatv)], uu[:len(qhatv)], qhatv)
+		if c > 0 {
+			subVW(uu[len(qhatv):], uu[len(qhatv):], c)
+		}
+		addAt(z, qhat, j-B)
+		j -= B
+	}
+
+	// TODO(rsc): Rewrite loop as described above and delete all this code.
+
+	// Now u < (v<<B), compute lower bits in the same way.
+	// Choose shift = B-1 again.
+	s := B - 1
+	qhat := *temps[depth]
+	qhat.clear()
+	qhat.divRecursiveStep(u[s:].norm(), v[s:], depth+1, tmp, temps)
+	qhat = qhat.norm()
+	qhatv := tmp.make(3 * n)
+	qhatv.clear()
+	qhatv = qhatv.mul(qhat, v[:s])
+	// Set the correct remainder as before.
+	for i := 0; i < 2; i++ {
+		if e := qhatv.cmp(u.norm()); e > 0 {
+			subVW(qhat, qhat, 1)
+			c := subVV(qhatv[:s], qhatv[:s], v[:s])
+			if len(qhatv) > s {
+				subVW(qhatv[s:], qhatv[s:], c)
+			}
+			addAt(u[s:], v[s:], 0)
+		}
+	}
+	if qhatv.cmp(u.norm()) > 0 {
+		panic("impossible")
+	}
+	c := subVV(u[0:len(qhatv)], u[0:len(qhatv)], qhatv)
+	if c > 0 {
+		c = subVW(u[len(qhatv):], u[len(qhatv):], c)
+	}
+	if c > 0 {
+		panic("impossible")
+	}
+
+	// Done!
+	addAt(z, qhat.norm(), 0)
+}