diff options
author | Iain Buclaw <ibuclaw@gcc.gnu.org> | 2018-10-28 19:51:47 +0000 |
---|---|---|
committer | Iain Buclaw <ibuclaw@gcc.gnu.org> | 2018-10-28 19:51:47 +0000 |
commit | b4c522fabd0df7be08882d2207df8b2765026110 (patch) | |
tree | b5ffc312b0a441c1ba24323152aec463fdbe5e9f /libphobos/src/std/internal/math | |
parent | 01ce9e31a02c8039d88e90f983735104417bf034 (diff) | |
download | gcc-b4c522fabd0df7be08882d2207df8b2765026110.zip gcc-b4c522fabd0df7be08882d2207df8b2765026110.tar.gz gcc-b4c522fabd0df7be08882d2207df8b2765026110.tar.bz2 |
Add D front-end, libphobos library, and D2 testsuite.
ChangeLog:
* Makefile.def (target_modules): Add libphobos.
(flags_to_pass): Add GDC, GDCFLAGS, GDC_FOR_TARGET and
GDCFLAGS_FOR_TARGET.
(dependencies): Make libphobos depend on libatomic, libbacktrace
configure, and zlib configure.
(language): Add language d.
* Makefile.in: Rebuild.
* Makefile.tpl (BUILD_EXPORTS): Add GDC and GDCFLAGS.
(HOST_EXPORTS): Add GDC.
(POSTSTAGE1_HOST_EXPORTS): Add GDC and GDC_FOR_BUILD.
(BASE_TARGET_EXPORTS): Add GDC.
(GDC_FOR_BUILD, GDC, GDCFLAGS): New variables.
(GDC_FOR_TARGET, GDC_FLAGS_FOR_TARGET): New variables.
(EXTRA_HOST_FLAGS): Add GDC.
(STAGE1_FLAGS_TO_PASS): Add GDC.
(EXTRA_TARGET_FLAGS): Add GDC and GDCFLAGS.
* config-ml.in: Treat GDC and GDCFLAGS like other compiler/flag
environment variables.
* configure: Rebuild.
* configure.ac: Add target-libphobos to target_libraries. Set and
substitute GDC_FOR_BUILD and GDC_FOR_TARGET.
config/ChangeLog:
* multi.m4: Set GDC.
gcc/ChangeLog:
* Makefile.in (tm_d_file_list, tm_d_include_list): New variables.
(TM_D_H, D_TARGET_DEF, D_TARGET_H, D_TARGET_OBJS): New variables.
(tm_d.h, cs-tm_d.h, default-d.o): New rules.
(d/d-target-hooks-def.h, s-d-target-hooks-def-h): New rules.
(s-tm-texi): Also check timestamp on d-target.def.
(generated_files): Add TM_D_H and d-target-hooks-def.h.
(build/genhooks.o): Also depend on D_TARGET_DEF.
* config.gcc (tm_d_file, d_target_objs, target_has_targetdm): New
variables.
* config/aarch64/aarch64-d.c: New file.
* config/aarch64/aarch64-linux.h (GNU_USER_TARGET_D_CRITSEC_SIZE):
Define.
* config/aarch64/aarch64-protos.h (aarch64_d_target_versions): New
prototype.
* config/aarch64/aarch64.h (TARGET_D_CPU_VERSIONS): Define.
* config/aarch64/t-aarch64 (aarch64-d.o): New rule.
* config/arm/arm-d.c: New file.
* config/arm/arm-protos.h (arm_d_target_versions): New prototype.
* config/arm/arm.h (TARGET_D_CPU_VERSIONS): Define.
* config/arm/linux-eabi.h (EXTRA_TARGET_D_OS_VERSIONS): Define.
* config/arm/t-arm (arm-d.o): New rule.
* config/default-d.c: New file.
* config/glibc-d.c: New file.
* config/gnu.h (GNU_USER_TARGET_D_OS_VERSIONS): Define.
* config/i386/i386-d.c: New file.
* config/i386/i386-protos.h (ix86_d_target_versions): New prototype.
* config/i386/i386.h (TARGET_D_CPU_VERSIONS): Define.
* config/i386/linux-common.h (EXTRA_TARGET_D_OS_VERSIONS): Define.
(GNU_USER_TARGET_D_CRITSEC_SIZE): Define.
* config/i386/t-i386 (i386-d.o): New rule.
* config/kfreebsd-gnu.h (GNU_USER_TARGET_D_OS_VERSIONS): Define.
* config/kopensolaris-gnu.h (GNU_USER_TARGET_D_OS_VERSIONS): Define.
* config/linux-android.h (ANDROID_TARGET_D_OS_VERSIONS): Define.
* config/linux.h (GNU_USER_TARGET_D_OS_VERSIONS): Define.
* config/mips/linux-common.h (EXTRA_TARGET_D_OS_VERSIONS): Define.
* config/mips/mips-d.c: New file.
* config/mips/mips-protos.h (mips_d_target_versions): New prototype.
* config/mips/mips.h (TARGET_D_CPU_VERSIONS): Define.
* config/mips/t-mips (mips-d.o): New rule.
* config/powerpcspe/linux.h (GNU_USER_TARGET_D_OS_VERSIONS): Define.
* config/powerpcspe/linux64.h (GNU_USER_TARGET_D_OS_VERSIONS): Define.
* config/powerpcspe/powerpcspe-d.c: New file.
* config/powerpcspe/powerpcspe-protos.h (rs6000_d_target_versions):
New prototype.
* config/powerpcspe/powerpcspe.c (rs6000_output_function_epilogue):
Support GNU D by using 0 as the language type.
* config/powerpcspe/powerpcspe.h (TARGET_D_CPU_VERSIONS): Define.
* config/powerpcspe/t-powerpcspe (powerpcspe-d.o): New rule.
* config/riscv/riscv-d.c: New file.
* config/riscv/riscv-protos.h (riscv_d_target_versions): New
prototype.
* config/riscv/riscv.h (TARGET_D_CPU_VERSIONS): Define.
* config/riscv/t-riscv (riscv-d.o): New rule.
* config/rs6000/linux.h (GNU_USER_TARGET_D_OS_VERSIONS): Define.
* config/rs6000/linux64.h (GNU_USER_TARGET_D_OS_VERSIONS): Define.
* config/rs6000/rs6000-d.c: New file.
* config/rs6000/rs6000-protos.h (rs6000_d_target_versions): New
prototype.
* config/rs6000/rs6000.c (rs6000_output_function_epilogue):
Support GNU D by using 0 as the language type.
* config/rs6000/rs6000.h (TARGET_D_CPU_VERSIONS): Define.
* config/rs6000/t-rs6000 (rs6000-d.o): New rule.
* config/s390/s390-d.c: New file.
* config/s390/s390-protos.h (s390_d_target_versions): New prototype.
* config/s390/s390.h (TARGET_D_CPU_VERSIONS): Define.
* config/s390/t-s390 (s390-d.o): New rule.
* config/sparc/sparc-d.c: New file.
* config/sparc/sparc-protos.h (sparc_d_target_versions): New
prototype.
* config/sparc/sparc.h (TARGET_D_CPU_VERSIONS): Define.
* config/sparc/t-sparc (sparc-d.o): New rule.
* config/t-glibc (glibc-d.o): New rule.
* configure: Regenerated.
* configure.ac (tm_d_file): New variable.
(tm_d_file_list, tm_d_include_list, d_target_objs): Add substitutes.
* doc/contrib.texi (Contributors): Add self for the D frontend.
* doc/frontends.texi (G++ and GCC): Mention D as a supported language.
* doc/install.texi (Configuration): Mention libphobos as an option for
--enable-shared. Mention d as an option for --enable-languages.
(Testing): Mention check-d as a target.
* doc/invoke.texi (Overall Options): Mention .d, .dd, and .di as file
name suffixes. Mention d as a -x option.
* doc/sourcebuild.texi (Top Level): Mention libphobos.
* doc/standards.texi (Standards): Add section on D language.
* doc/tm.texi: Regenerated.
* doc/tm.texi.in: Add @node for D language and ABI, and @hook for
TARGET_CPU_VERSIONS, TARGET_D_OS_VERSIONS, and TARGET_D_CRITSEC_SIZE.
* dwarf2out.c (is_dlang): New function.
(gen_compile_unit_die): Use DW_LANG_D for D.
(declare_in_namespace): Return module die for D, instead of adding
extra declarations into the namespace.
(gen_namespace_die): Generate DW_TAG_module for D.
(gen_decl_die): Handle CONST_DECLSs for D.
(dwarf2out_decl): Likewise.
(prune_unused_types_walk_local_classes): Handle DW_tag_interface_type.
(prune_unused_types_walk): Handle DW_tag_interface_type same as other
kinds of aggregates.
* gcc.c (default_compilers): Add entries for .d, .dd and .di.
* genhooks.c: Include d/d-target.def.
gcc/po/ChangeLog:
* EXCLUDES: Add sources from d/dmd.
gcc/testsuite/ChangeLog:
* gcc.misc-tests/help.exp: Add D to option descriptions check.
* gdc.dg/asan/asan.exp: New file.
* gdc.dg/asan/gdc272.d: New test.
* gdc.dg/compilable.d: New test.
* gdc.dg/dg.exp: New file.
* gdc.dg/gdc254.d: New test.
* gdc.dg/gdc260.d: New test.
* gdc.dg/gdc270a.d: New test.
* gdc.dg/gdc270b.d: New test.
* gdc.dg/gdc282.d: New test.
* gdc.dg/gdc283.d: New test.
* gdc.dg/imports/gdc170.d: New test.
* gdc.dg/imports/gdc231.d: New test.
* gdc.dg/imports/gdc239.d: New test.
* gdc.dg/imports/gdc241a.d: New test.
* gdc.dg/imports/gdc241b.d: New test.
* gdc.dg/imports/gdc251a.d: New test.
* gdc.dg/imports/gdc251b.d: New test.
* gdc.dg/imports/gdc253.d: New test.
* gdc.dg/imports/gdc254a.d: New test.
* gdc.dg/imports/gdc256.d: New test.
* gdc.dg/imports/gdc27.d: New test.
* gdc.dg/imports/gdcpkg256/package.d: New test.
* gdc.dg/imports/runnable.d: New test.
* gdc.dg/link.d: New test.
* gdc.dg/lto/lto.exp: New file.
* gdc.dg/lto/ltotests_0.d: New test.
* gdc.dg/lto/ltotests_1.d: New test.
* gdc.dg/runnable.d: New test.
* gdc.dg/simd.d: New test.
* gdc.test/gdc-test.exp: New file.
* lib/gdc-dg.exp: New file.
* lib/gdc.exp: New file.
libphobos/ChangeLog:
* Makefile.am: New file.
* Makefile.in: New file.
* acinclude.m4: New file.
* aclocal.m4: New file.
* config.h.in: New file.
* configure: New file.
* configure.ac: New file.
* d_rules.am: New file.
* libdruntime/Makefile.am: New file.
* libdruntime/Makefile.in: New file.
* libdruntime/__entrypoint.di: New file.
* libdruntime/__main.di: New file.
* libdruntime/gcc/attribute.d: New file.
* libdruntime/gcc/backtrace.d: New file.
* libdruntime/gcc/builtins.d: New file.
* libdruntime/gcc/config.d.in: New file.
* libdruntime/gcc/deh.d: New file.
* libdruntime/gcc/libbacktrace.d.in: New file.
* libdruntime/gcc/unwind/arm.d: New file.
* libdruntime/gcc/unwind/arm_common.d: New file.
* libdruntime/gcc/unwind/c6x.d: New file.
* libdruntime/gcc/unwind/generic.d: New file.
* libdruntime/gcc/unwind/package.d: New file.
* libdruntime/gcc/unwind/pe.d: New file.
* m4/autoconf.m4: New file.
* m4/druntime.m4: New file.
* m4/druntime/cpu.m4: New file.
* m4/druntime/libraries.m4: New file.
* m4/druntime/os.m4: New file.
* m4/gcc_support.m4: New file.
* m4/gdc.m4: New file.
* m4/libtool.m4: New file.
* src/Makefile.am: New file.
* src/Makefile.in: New file.
* src/libgphobos.spec.in: New file.
* testsuite/Makefile.am: New file.
* testsuite/Makefile.in: New file.
* testsuite/config/default.exp: New file.
* testsuite/lib/libphobos-dg.exp: New file.
* testsuite/lib/libphobos.exp: New file.
* testsuite/testsuite_flags.in: New file.
From-SVN: r265573
Diffstat (limited to 'libphobos/src/std/internal/math')
-rw-r--r-- | libphobos/src/std/internal/math/biguintcore.d | 2571 | ||||
-rw-r--r-- | libphobos/src/std/internal/math/biguintnoasm.d | 370 | ||||
-rw-r--r-- | libphobos/src/std/internal/math/biguintx86.d | 1353 | ||||
-rw-r--r-- | libphobos/src/std/internal/math/errorfunction.d | 1145 | ||||
-rw-r--r-- | libphobos/src/std/internal/math/gammafunction.d | 1834 |
5 files changed, 7273 insertions, 0 deletions
diff --git a/libphobos/src/std/internal/math/biguintcore.d b/libphobos/src/std/internal/math/biguintcore.d new file mode 100644 index 0000000..f5cd769 --- /dev/null +++ b/libphobos/src/std/internal/math/biguintcore.d @@ -0,0 +1,2571 @@ +/** Fundamental operations for arbitrary-precision arithmetic + * + * These functions are for internal use only. + */ +/* Copyright Don Clugston 2008 - 2010. + * Distributed under the Boost Software License, Version 1.0. + * (See accompanying file LICENSE_1_0.txt or copy at + * http://www.boost.org/LICENSE_1_0.txt) + */ +/* References: + "Modern Computer Arithmetic" (MCA) is the primary reference for all + algorithms used in this library. + - R.P. Brent and P. Zimmermann, "Modern Computer Arithmetic", + Version 0.5.9, (Oct 2010). + - C. Burkinel and J. Ziegler, "Fast Recursive Division", MPI-I-98-1-022, + Max-Planck Institute fuer Informatik, (Oct 1998). + - G. Hanrot, M. Quercia, and P. Zimmermann, "The Middle Product Algorithm, I.", + INRIA 4664, (Dec 2002). + - M. Bodrato and A. Zanoni, "What about Toom-Cook Matrices Optimality?", + http://bodrato.it/papers (2006). + - A. Fog, "Optimizing subroutines in assembly language", + www.agner.org/optimize (2008). + - A. Fog, "The microarchitecture of Intel and AMD CPU's", + www.agner.org/optimize (2008). + - A. Fog, "Instruction tables: Lists of instruction latencies, throughputs + and micro-operation breakdowns for Intel and AMD CPU's.", www.agner.org/optimize (2008). + +Idioms: + Many functions in this module use + 'func(Tulong)(Tulong x) if (is(Tulong == ulong))' rather than 'func(ulong x)' + in order to disable implicit conversion. + +*/ +module std.internal.math.biguintcore; + +version (D_InlineAsm_X86) +{ + import std.internal.math.biguintx86; +} +else +{ + import std.internal.math.biguintnoasm; +} + +alias multibyteAdd = multibyteAddSub!('+'); +alias multibyteSub = multibyteAddSub!('-'); + + +import core.cpuid; +public import std.ascii : LetterCase; +import std.range.primitives; +import std.traits; + +shared static this() +{ + CACHELIMIT = core.cpuid.datacache[0].size*1024/2; +} + +private: +// Limits for when to switch between algorithms. +immutable size_t CACHELIMIT; // Half the size of the data cache. +enum size_t FASTDIVLIMIT = 100; // crossover to recursive division + + +// These constants are used by shift operations +static if (BigDigit.sizeof == int.sizeof) +{ + enum { LG2BIGDIGITBITS = 5, BIGDIGITSHIFTMASK = 31 }; + alias BIGHALFDIGIT = ushort; +} +else static if (BigDigit.sizeof == long.sizeof) +{ + alias BIGHALFDIGIT = uint; + enum { LG2BIGDIGITBITS = 6, BIGDIGITSHIFTMASK = 63 }; +} +else static assert(0, "Unsupported BigDigit size"); + +import std.exception : assumeUnique; +import std.traits : isIntegral; +enum BigDigitBits = BigDigit.sizeof*8; +template maxBigDigits(T) +if (isIntegral!T) +{ + enum maxBigDigits = (T.sizeof+BigDigit.sizeof-1)/BigDigit.sizeof; +} + +static immutable BigDigit[] ZERO = [0]; +static immutable BigDigit[] ONE = [1]; +static immutable BigDigit[] TWO = [2]; +static immutable BigDigit[] TEN = [10]; + + +public: + +/// BigUint performs memory management and wraps the low-level calls. +struct BigUint +{ +private: + pure invariant() + { + assert( data.length >= 1 && (data.length == 1 || data[$-1] != 0 )); + } + + immutable(BigDigit) [] data = ZERO; + + this(immutable(BigDigit) [] x) pure nothrow @nogc @safe + { + data = x; + } + package(std) // used from: std.bigint + this(T)(T x) pure nothrow @safe if (isIntegral!T) + { + opAssign(x); + } + + enum trustedAssumeUnique = function(BigDigit[] input) pure @trusted @nogc { + return assumeUnique(input); + }; +public: + // Length in uints + @property size_t uintLength() pure nothrow const @safe @nogc + { + static if (BigDigit.sizeof == uint.sizeof) + { + return data.length; + } + else static if (BigDigit.sizeof == ulong.sizeof) + { + return data.length * 2 - + ((data[$-1] & 0xFFFF_FFFF_0000_0000L) ? 1 : 0); + } + } + @property size_t ulongLength() pure nothrow const @safe @nogc + { + static if (BigDigit.sizeof == uint.sizeof) + { + return (data.length + 1) >> 1; + } + else static if (BigDigit.sizeof == ulong.sizeof) + { + return data.length; + } + } + + // The value at (cast(ulong[]) data)[n] + ulong peekUlong(int n) pure nothrow const @safe @nogc + { + static if (BigDigit.sizeof == int.sizeof) + { + if (data.length == n*2 + 1) return data[n*2]; + return data[n*2] + ((cast(ulong) data[n*2 + 1]) << 32 ); + } + else static if (BigDigit.sizeof == long.sizeof) + { + return data[n]; + } + } + uint peekUint(int n) pure nothrow const @safe @nogc + { + static if (BigDigit.sizeof == int.sizeof) + { + return data[n]; + } + else + { + immutable x = data[n >> 1]; + return (n & 1) ? cast(uint)(x >> 32) : cast(uint) x; + } + } +public: + /// + void opAssign(Tulong)(Tulong u) pure nothrow @safe if (is (Tulong == ulong)) + { + if (u == 0) data = ZERO; + else if (u == 1) data = ONE; + else if (u == 2) data = TWO; + else if (u == 10) data = TEN; + else + { + static if (BigDigit.sizeof == int.sizeof) + { + uint ulo = cast(uint)(u & 0xFFFF_FFFF); + uint uhi = cast(uint)(u >> 32); + if (uhi == 0) + { + data = [ulo]; + } + else + { + data = [ulo, uhi]; + } + } + else static if (BigDigit.sizeof == long.sizeof) + { + data = [u]; + } + } + } + void opAssign(Tdummy = void)(BigUint y) pure nothrow @nogc @safe + { + this.data = y.data; + } + + /// + int opCmp(Tdummy = void)(const BigUint y) pure nothrow @nogc const @safe + { + if (data.length != y.data.length) + return (data.length > y.data.length) ? 1 : -1; + size_t k = highestDifferentDigit(data, y.data); + if (data[k] == y.data[k]) + return 0; + return data[k] > y.data[k] ? 1 : -1; + } + + /// + int opCmp(Tulong)(Tulong y) pure nothrow @nogc const @safe if (is (Tulong == ulong)) + { + if (data.length > maxBigDigits!Tulong) + return 1; + + foreach_reverse (i; 0 .. maxBigDigits!Tulong) + { + BigDigit tmp = cast(BigDigit)(y>>(i*BigDigitBits)); + if (tmp == 0) + if (data.length >= i+1) + { + // Since ZERO is [0], so we cannot simply return 1 here, as + // data[i] would be 0 for i == 0 in that case. + return (data[i] > 0) ? 1 : 0; + } + else + continue; + else + if (i+1 > data.length) + return -1; + else if (tmp != data[i]) + return data[i] > tmp ? 1 : -1; + } + return 0; + } + + bool opEquals(Tdummy = void)(ref const BigUint y) pure nothrow @nogc const @safe + { + return y.data[] == data[]; + } + + bool opEquals(Tdummy = void)(ulong y) pure nothrow @nogc const @safe + { + if (data.length > 2) + return false; + uint ylo = cast(uint)(y & 0xFFFF_FFFF); + uint yhi = cast(uint)(y >> 32); + if (data.length == 2 && data[1]!=yhi) + return false; + if (data.length == 1 && yhi != 0) + return false; + return (data[0] == ylo); + } + + bool isZero() pure const nothrow @safe @nogc + { + return data.length == 1 && data[0] == 0; + } + + size_t numBytes() pure nothrow const @safe @nogc + { + return data.length * BigDigit.sizeof; + } + + // the extra bytes are added to the start of the string + char [] toDecimalString(int frontExtraBytes) const pure nothrow + { + immutable predictlength = 20+20*(data.length/2); // just over 19 + char [] buff = new char[frontExtraBytes + predictlength]; + ptrdiff_t sofar = biguintToDecimal(buff, data.dup); + return buff[sofar-frontExtraBytes..$]; + } + + /** Convert to a hex string, printing a minimum number of digits 'minPadding', + * allocating an additional 'frontExtraBytes' at the start of the string. + * Padding is done with padChar, which may be '0' or ' '. + * 'separator' is a digit separation character. If non-zero, it is inserted + * between every 8 digits. + * Separator characters do not contribute to the minPadding. + */ + char [] toHexString(int frontExtraBytes, char separator = 0, + int minPadding=0, char padChar = '0', + LetterCase letterCase = LetterCase.upper) const pure nothrow @safe + { + // Calculate number of extra padding bytes + size_t extraPad = (minPadding > data.length * 2 * BigDigit.sizeof) + ? minPadding - data.length * 2 * BigDigit.sizeof : 0; + + // Length not including separator bytes + size_t lenBytes = data.length * 2 * BigDigit.sizeof; + + // Calculate number of separator bytes + size_t mainSeparatorBytes = separator ? (lenBytes / 8) - 1 : 0; + immutable totalSeparatorBytes = separator ? ((extraPad + lenBytes + 7) / 8) - 1: 0; + + char [] buff = new char[lenBytes + extraPad + totalSeparatorBytes + frontExtraBytes]; + biguintToHex(buff[$ - lenBytes - mainSeparatorBytes .. $], data, separator, letterCase); + if (extraPad > 0) + { + if (separator) + { + size_t start = frontExtraBytes; // first index to pad + if (extraPad &7) + { + // Do 1 to 7 extra zeros. + buff[frontExtraBytes .. frontExtraBytes + (extraPad & 7)] = padChar; + buff[frontExtraBytes + (extraPad & 7)] = (padChar == ' ' ? ' ' : separator); + start += (extraPad & 7) + 1; + } + for (int i=0; i< (extraPad >> 3); ++i) + { + buff[start .. start + 8] = padChar; + buff[start + 8] = (padChar == ' ' ? ' ' : separator); + start += 9; + } + } + else + { + buff[frontExtraBytes .. frontExtraBytes + extraPad]=padChar; + } + } + int z = frontExtraBytes; + if (lenBytes > minPadding) + { + // Strip leading zeros. + ptrdiff_t maxStrip = lenBytes - minPadding; + while (z< buff.length-1 && (buff[z]=='0' || buff[z]==padChar) && maxStrip>0) + { + ++z; + --maxStrip; + } + } + if (padChar!='0') + { + // Convert leading zeros into padChars. + for (size_t k= z; k< buff.length-1 && (buff[k]=='0' || buff[k]==padChar); ++k) + { + if (buff[k]=='0') buff[k]=padChar; + } + } + return buff[z-frontExtraBytes..$]; + } + + /** + * Convert to an octal string. + */ + char[] toOctalString() const + { + auto predictLength = 1 + data.length*BigDigitBits / 3; + char[] buff = new char[predictLength]; + size_t firstNonZero = biguintToOctal(buff, data); + return buff[firstNonZero .. $]; + } + + // return false if invalid character found + bool fromHexString(Range)(Range s) if ( + isBidirectionalRange!Range && isSomeChar!(ElementType!Range)) + { + import std.range : walkLength; + + //Strip leading zeros + while (!s.empty && s.front == '0') + s.popFront; + + if (s.empty) + { + data = ZERO; + return true; + } + + immutable len = (s.save.walkLength + 15) / 4; + auto tmp = new BigDigit[len + 1]; + uint part, sofar, partcount; + + foreach_reverse (character; s) + { + if (character == '_') + continue; + + uint x; + if (character >= '0' && character <= '9') + { + x = character - '0'; + } + else if (character >= 'A' && character <= 'F') + { + x = character - 'A' + 10; + } + else if (character >= 'a' && character <= 'f') + { + x = character - 'a' + 10; + } + else + { + return false; + } + + part >>= 4; + part |= (x << (32 - 4)); + ++partcount; + + if (partcount == 8) + { + tmp[sofar] = part; + ++sofar; + partcount = 0; + part = 0; + } + } + if (part) + { + for ( ; partcount != 8; ++partcount) part >>= 4; + tmp[sofar] = part; + ++sofar; + } + if (sofar == 0) + data = ZERO; + else + data = trustedAssumeUnique(tmp[0 .. sofar]); + + return true; + } + + // return true if OK; false if erroneous characters found + bool fromDecimalString(Range)(Range s) if ( + isForwardRange!Range && isSomeChar!(ElementType!Range)) + { + import std.range : walkLength; + + while (!s.empty && s.front == '0') + { + s.popFront; + } + + if (s.empty) + { + data = ZERO; + return true; + } + + auto predict_length = (18 * 2 + 2 * s.save.walkLength) / 19; + auto tmp = new BigDigit[predict_length]; + + tmp.length = biguintFromDecimal(tmp, s); + + data = trustedAssumeUnique(tmp); + return true; + } + + //////////////////////// + // + // All of these member functions create a new BigUint. + + // return x >> y + BigUint opShr(Tulong)(Tulong y) pure nothrow const if (is (Tulong == ulong)) + { + assert(y>0); + uint bits = cast(uint) y & BIGDIGITSHIFTMASK; + if ((y >> LG2BIGDIGITBITS) >= data.length) return BigUint(ZERO); + uint words = cast(uint)(y >> LG2BIGDIGITBITS); + if (bits == 0) + { + return BigUint(data[words..$]); + } + else + { + uint [] result = new BigDigit[data.length - words]; + multibyteShr(result, data[words..$], bits); + + if (result.length > 1 && result[$-1] == 0) + return BigUint(trustedAssumeUnique(result[0 .. $-1])); + else + return BigUint(trustedAssumeUnique(result)); + } + } + + // return x << y + BigUint opShl(Tulong)(Tulong y) pure nothrow const if (is (Tulong == ulong)) + { + assert(y>0); + if (isZero()) return this; + uint bits = cast(uint) y & BIGDIGITSHIFTMASK; + assert((y >> LG2BIGDIGITBITS) < cast(ulong)(uint.max)); + uint words = cast(uint)(y >> LG2BIGDIGITBITS); + BigDigit [] result = new BigDigit[data.length + words+1]; + result[0 .. words] = 0; + if (bits == 0) + { + result[words .. words+data.length] = data[]; + return BigUint(trustedAssumeUnique(result[0 .. words+data.length])); + } + else + { + immutable c = multibyteShl(result[words .. words+data.length], data, bits); + if (c == 0) return BigUint(trustedAssumeUnique(result[0 .. words+data.length])); + result[$-1] = c; + return BigUint(trustedAssumeUnique(result)); + } + } + + // If wantSub is false, return x + y, leaving sign unchanged + // If wantSub is true, return abs(x - y), negating sign if x < y + static BigUint addOrSubInt(Tulong)(const BigUint x, Tulong y, + bool wantSub, ref bool sign) pure nothrow if (is(Tulong == ulong)) + { + BigUint r; + if (wantSub) + { // perform a subtraction + if (x.data.length > 2) + { + r.data = subInt(x.data, y); + } + else + { // could change sign! + ulong xx = x.data[0]; + if (x.data.length > 1) + xx += (cast(ulong) x.data[1]) << 32; + ulong d; + if (xx <= y) + { + d = y - xx; + sign = !sign; + } + else + { + d = xx - y; + } + if (d == 0) + { + r = 0UL; + sign = false; + return r; + } + if (d > uint.max) + { + r.data = [cast(uint)(d & 0xFFFF_FFFF), cast(uint)(d >> 32)]; + } + else + { + r.data = [cast(uint)(d & 0xFFFF_FFFF)]; + } + } + } + else + { + r.data = addInt(x.data, y); + } + return r; + } + + // If wantSub is false, return x + y, leaving sign unchanged. + // If wantSub is true, return abs(x - y), negating sign if x<y + static BigUint addOrSub(BigUint x, BigUint y, bool wantSub, bool *sign) + pure nothrow + { + BigUint r; + if (wantSub) + { // perform a subtraction + bool negative; + r.data = sub(x.data, y.data, &negative); + *sign ^= negative; + if (r.isZero()) + { + *sign = false; + } + } + else + { + r.data = add(x.data, y.data); + } + return r; + } + + + // return x*y. + // y must not be zero. + static BigUint mulInt(T = ulong)(BigUint x, T y) pure nothrow + { + if (y == 0 || x == 0) return BigUint(ZERO); + uint hi = cast(uint)(y >>> 32); + uint lo = cast(uint)(y & 0xFFFF_FFFF); + uint [] result = new BigDigit[x.data.length+1+(hi != 0)]; + result[x.data.length] = multibyteMul(result[0 .. x.data.length], x.data, lo, 0); + if (hi != 0) + { + result[x.data.length+1] = multibyteMulAdd!('+')(result[1 .. x.data.length+1], + x.data, hi, 0); + } + return BigUint(removeLeadingZeros(trustedAssumeUnique(result))); + } + + /* return x * y. + */ + static BigUint mul(BigUint x, BigUint y) pure nothrow + { + if (y == 0 || x == 0) + return BigUint(ZERO); + auto len = x.data.length + y.data.length; + BigDigit [] result = new BigDigit[len]; + if (y.data.length > x.data.length) + { + mulInternal(result, y.data, x.data); + } + else + { + if (x.data[]==y.data[]) squareInternal(result, x.data); + else mulInternal(result, x.data, y.data); + } + // the highest element could be zero, + // in which case we need to reduce the length + return BigUint(removeLeadingZeros(trustedAssumeUnique(result))); + } + + // return x / y + static BigUint divInt(T)(BigUint x, T y_) pure nothrow + if ( is(Unqual!T == uint) ) + { + uint y = y_; + if (y == 1) + return x; + uint [] result = new BigDigit[x.data.length]; + if ((y&(-y))==y) + { + assert(y != 0, "BigUint division by zero"); + // perfect power of 2 + uint b = 0; + for (;y != 1; y>>=1) + { + ++b; + } + multibyteShr(result, x.data, b); + } + else + { + result[] = x.data[]; + cast(void) multibyteDivAssign(result, y, 0); + } + return BigUint(removeLeadingZeros(trustedAssumeUnique(result))); + } + + static BigUint divInt(T)(BigUint x, T y) pure nothrow + if ( is(Unqual!T == ulong) ) + { + if (y <= uint.max) + return divInt!uint(x, cast(uint) y); + if (x.data.length < 2) + return BigUint(ZERO); + uint hi = cast(uint)(y >>> 32); + uint lo = cast(uint)(y & 0xFFFF_FFFF); + immutable uint[2] z = [lo, hi]; + BigDigit[] result = new BigDigit[x.data.length - z.length + 1]; + divModInternal(result, null, x.data, z[]); + return BigUint(removeLeadingZeros(trustedAssumeUnique(result))); + } + + // return x % y + static uint modInt(T)(BigUint x, T y_) pure if ( is(Unqual!T == uint) ) + { + import core.memory : GC; + uint y = y_; + assert(y != 0); + if ((y&(-y)) == y) + { // perfect power of 2 + return x.data[0] & (y-1); + } + else + { + // horribly inefficient - malloc, copy, & store are unnecessary. + uint [] wasteful = new BigDigit[x.data.length]; + wasteful[] = x.data[]; + immutable rem = multibyteDivAssign(wasteful, y, 0); + () @trusted { GC.free(wasteful.ptr); } (); + return rem; + } + } + + // return x / y + static BigUint div(BigUint x, BigUint y) pure nothrow + { + if (y.data.length > x.data.length) + return BigUint(ZERO); + if (y.data.length == 1) + return divInt(x, y.data[0]); + BigDigit [] result = new BigDigit[x.data.length - y.data.length + 1]; + divModInternal(result, null, x.data, y.data); + return BigUint(removeLeadingZeros(trustedAssumeUnique(result))); + } + + // return x % y + static BigUint mod(BigUint x, BigUint y) pure nothrow + { + if (y.data.length > x.data.length) return x; + if (y.data.length == 1) + { + return BigUint([modInt(x, y.data[0])]); + } + BigDigit [] result = new BigDigit[x.data.length - y.data.length + 1]; + BigDigit [] rem = new BigDigit[y.data.length]; + divModInternal(result, rem, x.data, y.data); + return BigUint(removeLeadingZeros(trustedAssumeUnique(rem))); + } + + // return x op y + static BigUint bitwiseOp(string op)(BigUint x, BigUint y, bool xSign, bool ySign, ref bool resultSign) + pure nothrow @safe if (op == "|" || op == "^" || op == "&") + { + auto d1 = includeSign(x.data, y.uintLength, xSign); + auto d2 = includeSign(y.data, x.uintLength, ySign); + + foreach (i; 0 .. d1.length) + { + mixin("d1[i] " ~ op ~ "= d2[i];"); + } + + mixin("resultSign = xSign " ~ op ~ " ySign;"); + + if (resultSign) + { + twosComplement(d1, d1); + } + + return BigUint(removeLeadingZeros(trustedAssumeUnique(d1))); + } + + /** + * Return a BigUint which is x raised to the power of y. + * Method: Powers of 2 are removed from x, then left-to-right binary + * exponentiation is used. + * Memory allocation is minimized: at most one temporary BigUint is used. + */ + static BigUint pow(BigUint x, ulong y) pure nothrow + { + // Deal with the degenerate cases first. + if (y == 0) return BigUint(ONE); + if (y == 1) return x; + if (x == 0 || x == 1) return x; + + BigUint result; + + // Simplify, step 1: Remove all powers of 2. + uint firstnonzero = firstNonZeroDigit(x.data); + // Now we know x = x[firstnonzero..$] * (2^^(firstnonzero*BigDigitBits)) + // where BigDigitBits = BigDigit.sizeof * 8 + + // See if x[firstnonzero..$] can now fit into a single digit. + bool singledigit = ((x.data.length - firstnonzero) == 1); + // If true, then x0 is that digit + // and the result will be (x0 ^^ y) * (2^^(firstnonzero*y*BigDigitBits)) + BigDigit x0 = x.data[firstnonzero]; + assert(x0 != 0); + // Length of the non-zero portion + size_t nonzerolength = x.data.length - firstnonzero; + ulong y0; + uint evenbits = 0; // number of even bits in the bottom of x + while (!(x0 & 1)) + { + x0 >>= 1; + ++evenbits; + } + + if (x.data.length- firstnonzero == 2) + { + // Check for a single digit straddling a digit boundary + const BigDigit x1 = x.data[firstnonzero+1]; + if ((x1 >> evenbits) == 0) + { + x0 |= (x1 << (BigDigit.sizeof * 8 - evenbits)); + singledigit = true; + } + } + // Now if (singledigit), x^^y = (x0 ^^ y) * 2^^(evenbits * y) * 2^^(firstnonzero*y*BigDigitBits)) + + uint evenshiftbits = 0; // Total powers of 2 to shift by, at the end + + // Simplify, step 2: For singledigits, see if we can trivially reduce y + + BigDigit finalMultiplier = 1UL; + + if (singledigit) + { + // x fits into a single digit. Raise it to the highest power we can + // that still fits into a single digit, then reduce the exponent accordingly. + // We're quite likely to have a residual multiply at the end. + // For example, 10^^100 = (((5^^13)^^7) * 5^^9) * 2^^100. + // and 5^^13 still fits into a uint. + evenshiftbits = cast(uint)( (evenbits * y) & BIGDIGITSHIFTMASK); + if (x0 == 1) + { // Perfect power of 2 + result = 1UL; + return result << (evenbits + firstnonzero * 8 * BigDigit.sizeof) * y; + } + immutable p = highestPowerBelowUintMax(x0); + if (y <= p) + { // Just do it with pow + result = cast(ulong) intpow(x0, y); + if (evenbits + firstnonzero == 0) + return result; + return result << (evenbits + firstnonzero * 8 * BigDigit.sizeof) * y; + } + y0 = y / p; + finalMultiplier = intpow(x0, y - y0*p); + x0 = intpow(x0, p); + // Result is x0 + nonzerolength = 1; + } + // Now if (singledigit), x^^y = finalMultiplier * (x0 ^^ y0) * 2^^(evenbits * y) * 2^^(firstnonzero*y*BigDigitBits)) + + // Perform a crude check for overflow and allocate result buffer. + // The length required is y * lg2(x) bits. + // which will always fit into y*x.length digits. But this is + // a gross overestimate if x is small (length 1 or 2) and the highest + // digit is nearly empty. + // A better estimate is: + // y * lg2(x[$-1]/BigDigit.max) + y * (x.length - 1) digits, + // and the first term is always between + // y * (bsr(x.data[$-1]) + 1) / BIGDIGITBITS and + // y * (bsr(x.data[$-1]) + 2) / BIGDIGITBITS + // For single digit payloads, we already have + // x^^y = finalMultiplier * (x0 ^^ y0) * 2^^(evenbits * y) * 2^^(firstnonzero*y*BigDigitBits)) + // and x0 is almost a full digit, so it's a tight estimate. + // Number of digits is therefore 1 + x0.length*y0 + (evenbits*y)/BIGDIGIT + firstnonzero*y + // Note that the divisions must be rounded up. + + // Estimated length in BigDigits + immutable estimatelength = singledigit + ? 1 + y0 + ((evenbits*y + BigDigit.sizeof * 8 - 1) / (BigDigit.sizeof *8)) + firstnonzero*y + : x.data.length * y; + // Imprecise check for overflow. Makes the extreme cases easier to debug + // (less extreme overflow will result in an out of memory error). + if (estimatelength > uint.max/(4*BigDigit.sizeof)) + assert(0, "Overflow in BigInt.pow"); + + // The result buffer includes space for all the trailing zeros + BigDigit [] resultBuffer = new BigDigit[cast(size_t) estimatelength]; + + // Do all the powers of 2! + size_t result_start = cast(size_t)( firstnonzero * y + + (singledigit ? ((evenbits * y) >> LG2BIGDIGITBITS) : 0)); + + resultBuffer[0 .. result_start] = 0; + BigDigit [] t1 = resultBuffer[result_start..$]; + BigDigit [] r1; + + if (singledigit) + { + r1 = t1[0 .. 1]; + r1[0] = x0; + y = y0; + } + else + { + // It's not worth right shifting by evenbits unless we also shrink the length after each + // multiply or squaring operation. That might still be worthwhile for large y. + r1 = t1[0 .. x.data.length - firstnonzero]; + r1[0..$] = x.data[firstnonzero..$]; + } + + if (y>1) + { // Set r1 = r1 ^^ y. + // The secondary buffer only needs space for the multiplication results + BigDigit [] t2 = new BigDigit[resultBuffer.length - result_start]; + BigDigit [] r2; + + int shifts = 63; // num bits in a long + while (!(y & 0x8000_0000_0000_0000L)) + { + y <<= 1; + --shifts; + } + y <<=1; + + while (y != 0) + { + // For each bit of y: Set r1 = r1 * r1 + // If the bit is 1, set r1 = r1 * x + // Eg, if y is 0b101, result = ((x^^2)^^2)*x == x^^5. + // Optimization opportunity: if more than 2 bits in y are set, + // it's usually possible to reduce the number of multiplies + // by caching odd powers of x. eg for y = 54, + // (0b110110), set u = x^^3, and result is ((u^^8)*u)^^2 + r2 = t2[0 .. r1.length*2]; + squareInternal(r2, r1); + if (y & 0x8000_0000_0000_0000L) + { + r1 = t1[0 .. r2.length + nonzerolength]; + if (singledigit) + { + r1[$-1] = multibyteMul(r1[0 .. $-1], r2, x0, 0); + } + else + { + mulInternal(r1, r2, x.data[firstnonzero..$]); + } + } + else + { + r1 = t1[0 .. r2.length]; + r1[] = r2[]; + } + y <<=1; + shifts--; + } + while (shifts>0) + { + r2 = t2[0 .. r1.length * 2]; + squareInternal(r2, r1); + r1 = t1[0 .. r2.length]; + r1[] = r2[]; + --shifts; + } + } + + if (finalMultiplier != 1) + { + const BigDigit carry = multibyteMul(r1, r1, finalMultiplier, 0); + if (carry) + { + r1 = t1[0 .. r1.length + 1]; + r1[$-1] = carry; + } + } + if (evenshiftbits) + { + const BigDigit carry = multibyteShl(r1, r1, evenshiftbits); + if (carry != 0) + { + r1 = t1[0 .. r1.length + 1]; + r1[$ - 1] = carry; + } + } + while (r1[$ - 1]==0) + { + r1=r1[0 .. $ - 1]; + } + return BigUint(trustedAssumeUnique(resultBuffer[0 .. result_start + r1.length])); + } + + // Implement toHash so that BigUint works properly as an AA key. + size_t toHash() const @trusted nothrow + { + return typeid(data).getHash(&data); + } + +} // end BigUint + +@safe pure nothrow unittest +{ + // ulong comparison test + BigUint a = [1]; + assert(a == 1); + assert(a < 0x8000_0000_0000_0000UL); // bug 9548 + + // bug 12234 + BigUint z = [0]; + assert(z == 0UL); + assert(!(z > 0UL)); + assert(!(z < 0UL)); +} + +// Remove leading zeros from x, to restore the BigUint invariant +inout(BigDigit) [] removeLeadingZeros(inout(BigDigit) [] x) pure nothrow @safe +{ + size_t k = x.length; + while (k>1 && x[k - 1]==0) --k; + return x[0 .. k]; +} + +pure @system unittest +{ + BigUint r = BigUint([5]); + BigUint t = BigUint([7]); + BigUint s = BigUint.mod(r, t); + assert(s == 5); +} + + +@safe pure unittest +{ + BigUint r; + r = 5UL; + assert(r.peekUlong(0) == 5UL); + assert(r.peekUint(0) == 5U); + r = 0x1234_5678_9ABC_DEF0UL; + assert(r.peekUlong(0) == 0x1234_5678_9ABC_DEF0UL); + assert(r.peekUint(0) == 0x9ABC_DEF0U); +} + + +// Pow tests +pure @system unittest +{ + BigUint r, s; + r.fromHexString("80000000_00000001"); + s = BigUint.pow(r, 5); + r.fromHexString("08000000_00000000_50000000_00000001_40000000_00000002_80000000" + ~ "_00000002_80000000_00000001"); + assert(s == r); + s = 10UL; + s = BigUint.pow(s, 39); + r.fromDecimalString("1000000000000000000000000000000000000000"); + assert(s == r); + r.fromHexString("1_E1178E81_00000000"); + s = BigUint.pow(r, 15); // Regression test: this used to overflow array bounds + + r.fromDecimalString("000_000_00"); + assert(r == 0); + r.fromDecimalString("0007"); + assert(r == 7); + r.fromDecimalString("0"); + assert(r == 0); +} + +// Radix conversion tests +@safe pure unittest +{ + BigUint r; + r.fromHexString("1_E1178E81_00000000"); + assert(r.toHexString(0, '_', 0) == "1_E1178E81_00000000"); + assert(r.toHexString(0, '_', 20) == "0001_E1178E81_00000000"); + assert(r.toHexString(0, '_', 16+8) == "00000001_E1178E81_00000000"); + assert(r.toHexString(0, '_', 16+9) == "0_00000001_E1178E81_00000000"); + assert(r.toHexString(0, '_', 16+8+8) == "00000000_00000001_E1178E81_00000000"); + assert(r.toHexString(0, '_', 16+8+8+1) == "0_00000000_00000001_E1178E81_00000000"); + assert(r.toHexString(0, '_', 16+8+8+1, ' ') == " 1_E1178E81_00000000"); + assert(r.toHexString(0, 0, 16+8+8+1) == "00000000000000001E1178E8100000000"); + r = 0UL; + assert(r.toHexString(0, '_', 0) == "0"); + assert(r.toHexString(0, '_', 7) == "0000000"); + assert(r.toHexString(0, '_', 7, ' ') == " 0"); + assert(r.toHexString(0, '#', 9) == "0#00000000"); + assert(r.toHexString(0, 0, 9) == "000000000"); +} + +// +@safe pure unittest +{ + BigUint r; + r.fromHexString("1_E1178E81_00000000"); + assert(r.toHexString(0, '_', 0, '0', LetterCase.lower) == "1_e1178e81_00000000"); + assert(r.toHexString(0, '_', 20, '0', LetterCase.lower) == "0001_e1178e81_00000000"); + assert(r.toHexString(0, '_', 16+8, '0', LetterCase.lower) == "00000001_e1178e81_00000000"); + assert(r.toHexString(0, '_', 16+9, '0', LetterCase.lower) == "0_00000001_e1178e81_00000000"); + assert(r.toHexString(0, '_', 16+8+8, '0', LetterCase.lower) == "00000000_00000001_e1178e81_00000000"); + assert(r.toHexString(0, '_', 16+8+8+1, '0', LetterCase.lower) == "0_00000000_00000001_e1178e81_00000000"); + assert(r.toHexString(0, '_', 16+8+8+1, ' ', LetterCase.lower) == " 1_e1178e81_00000000"); + assert(r.toHexString(0, 0, 16+8+8+1, '0', LetterCase.lower) == "00000000000000001e1178e8100000000"); + r = 0UL; + assert(r.toHexString(0, '_', 0, '0', LetterCase.lower) == "0"); + assert(r.toHexString(0, '_', 7, '0', LetterCase.lower) == "0000000"); + assert(r.toHexString(0, '_', 7, ' ', LetterCase.lower) == " 0"); + assert(r.toHexString(0, '#', 9, '0', LetterCase.lower) == "0#00000000"); + assert(r.toHexString(0, 'Z', 9, '0', LetterCase.lower) == "0Z00000000"); + assert(r.toHexString(0, 0, 9, '0', LetterCase.lower) == "000000000"); +} + + +private: +void twosComplement(const(BigDigit) [] x, BigDigit[] result) +pure nothrow @safe +{ + foreach (i; 0 .. x.length) + { + result[i] = ~x[i]; + } + result[x.length..$] = BigDigit.max; + + foreach (i; 0 .. result.length) + { + if (result[i] == BigDigit.max) + { + result[i] = 0; + } + else + { + result[i] += 1; + break; + } + } +} + +// Encode BigInt as BigDigit array (sign and 2's complement) +BigDigit[] includeSign(const(BigDigit) [] x, size_t minSize, bool sign) +pure nothrow @safe +{ + size_t length = (x.length > minSize) ? x.length : minSize; + BigDigit [] result = new BigDigit[length]; + if (sign) + { + twosComplement(x, result); + } + else + { + result[0 .. x.length] = x; + } + return result; +} + +// works for any type +T intpow(T)(T x, ulong n) pure nothrow @safe +{ + T p; + + switch (n) + { + case 0: + p = 1; + break; + + case 1: + p = x; + break; + + case 2: + p = x * x; + break; + + default: + p = 1; + while (1) + { + if (n & 1) + p *= x; + n >>= 1; + if (!n) + break; + x *= x; + } + break; + } + return p; +} + + +// returns the maximum power of x that will fit in a uint. +int highestPowerBelowUintMax(uint x) pure nothrow @safe +{ + assert(x>1); + static immutable ubyte [22] maxpwr = [ 31, 20, 15, 13, 12, 11, 10, 10, 9, 9, + 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 7, 7]; + if (x<24) return maxpwr[x-2]; + if (x<41) return 6; + if (x<85) return 5; + if (x<256) return 4; + if (x<1626) return 3; + if (x<65_536) return 2; + return 1; +} + +// returns the maximum power of x that will fit in a ulong. +int highestPowerBelowUlongMax(uint x) pure nothrow @safe +{ + assert(x>1); + static immutable ubyte [39] maxpwr = [ 63, 40, 31, 27, 24, 22, 21, 20, 19, 18, + 17, 17, 16, 16, 15, 15, 15, 15, 14, 14, + 14, 14, 13, 13, 13, 13, 13, 13, 13, 12, + 12, 12, 12, 12, 12, 12, 12, 12, 12]; + if (x<41) return maxpwr[x-2]; + if (x<57) return 11; + if (x<85) return 10; + if (x<139) return 9; + if (x<256) return 8; + if (x<566) return 7; + if (x<1626) return 6; + if (x<7132) return 5; + if (x<65_536) return 4; + if (x<2_642_246) return 3; + return 2; +} + +version (unittest) +{ + +int slowHighestPowerBelowUintMax(uint x) pure nothrow @safe +{ + int pwr = 1; + for (ulong q = x;x*q < cast(ulong) uint.max; ) + { + q*=x; ++pwr; + } + return pwr; +} + +@safe pure unittest +{ + assert(highestPowerBelowUintMax(10)==9); + for (int k=82; k<88; ++k) + { + assert(highestPowerBelowUintMax(k)== slowHighestPowerBelowUintMax(k)); + } +} + +} + + +/* General unsigned subtraction routine for bigints. + * Sets result = x - y. If the result is negative, negative will be true. + */ +BigDigit [] sub(const BigDigit [] x, const BigDigit [] y, bool *negative) +pure nothrow +{ + if (x.length == y.length) + { + // There's a possibility of cancellation, if x and y are almost equal. + ptrdiff_t last = highestDifferentDigit(x, y); + BigDigit [] result = new BigDigit[last+1]; + if (x[last] < y[last]) + { // we know result is negative + multibyteSub(result[0 .. last+1], y[0 .. last+1], x[0 .. last+1], 0); + *negative = true; + } + else + { // positive or zero result + multibyteSub(result[0 .. last+1], x[0 .. last+1], y[0 .. last+1], 0); + *negative = false; + } + while (result.length > 1 && result[$-1] == 0) + { + result = result[0..$-1]; + } +// if (result.length >1 && result[$-1]==0) return result[0..$-1]; + return result; + } + // Lengths are different + const(BigDigit) [] large, small; + if (x.length < y.length) + { + *negative = true; + large = y; small = x; + } + else + { + *negative = false; + large = x; small = y; + } + // result.length will be equal to larger length, or could decrease by 1. + + + BigDigit [] result = new BigDigit[large.length]; + BigDigit carry = multibyteSub(result[0 .. small.length], large[0 .. small.length], small, 0); + result[small.length..$] = large[small.length..$]; + if (carry) + { + multibyteIncrementAssign!('-')(result[small.length..$], carry); + } + while (result.length > 1 && result[$-1] == 0) + { + result = result[0..$-1]; + } + return result; +} + + +// return a + b +BigDigit [] add(const BigDigit [] a, const BigDigit [] b) pure nothrow +{ + const(BigDigit) [] x, y; + if (a.length < b.length) + { + x = b; y = a; + } + else + { + x = a; y = b; + } + // now we know x.length > y.length + // create result. add 1 in case it overflows + BigDigit [] result = new BigDigit[x.length + 1]; + + BigDigit carry = multibyteAdd(result[0 .. y.length], x[0 .. y.length], y, 0); + if (x.length != y.length) + { + result[y.length..$-1]= x[y.length..$]; + carry = multibyteIncrementAssign!('+')(result[y.length..$-1], carry); + } + if (carry) + { + result[$-1] = carry; + return result; + } + else + return result[0..$-1]; +} + +/** return x + y + */ +BigDigit [] addInt(const BigDigit[] x, ulong y) pure nothrow +{ + uint hi = cast(uint)(y >>> 32); + uint lo = cast(uint)(y& 0xFFFF_FFFF); + auto len = x.length; + if (x.length < 2 && hi != 0) ++len; + BigDigit [] result = new BigDigit[len+1]; + result[0 .. x.length] = x[]; + if (x.length < 2 && hi != 0) + { + result[1]=hi; + hi=0; + } + uint carry = multibyteIncrementAssign!('+')(result[0..$-1], lo); + if (hi != 0) carry += multibyteIncrementAssign!('+')(result[1..$-1], hi); + if (carry) + { + result[$-1] = carry; + return result; + } + else + return result[0..$-1]; +} + +/** Return x - y. + * x must be greater than y. + */ +BigDigit [] subInt(const BigDigit[] x, ulong y) pure nothrow +{ + uint hi = cast(uint)(y >>> 32); + uint lo = cast(uint)(y & 0xFFFF_FFFF); + BigDigit [] result = new BigDigit[x.length]; + result[] = x[]; + multibyteIncrementAssign!('-')(result[], lo); + if (hi) + multibyteIncrementAssign!('-')(result[1..$], hi); + if (result[$-1] == 0) + return result[0..$-1]; + else + return result; +} + +/** General unsigned multiply routine for bigints. + * Sets result = x * y. + * + * The length of y must not be larger than the length of x. + * Different algorithms are used, depending on the lengths of x and y. + * TODO: "Modern Computer Arithmetic" suggests the OddEvenKaratsuba algorithm for the + * unbalanced case. (But I doubt it would be faster in practice). + * + */ +void mulInternal(BigDigit[] result, const(BigDigit)[] x, const(BigDigit)[] y) + pure nothrow +{ + import core.memory : GC; + assert( result.length == x.length + y.length ); + assert( y.length > 0 ); + assert( x.length >= y.length); + if (y.length <= KARATSUBALIMIT) + { + // Small multiplier, we'll just use the asm classic multiply. + if (y.length == 1) + { // Trivial case, no cache effects to worry about + result[x.length] = multibyteMul(result[0 .. x.length], x, y[0], 0); + return; + } + + if (x.length + y.length < CACHELIMIT) + return mulSimple(result, x, y); + + // If x is so big that it won't fit into the cache, we divide it into chunks + // Every chunk must be greater than y.length. + // We make the first chunk shorter, if necessary, to ensure this. + + auto chunksize = CACHELIMIT / y.length; + immutable residual = x.length % chunksize; + if (residual < y.length) + { + chunksize -= y.length; + } + + // Use schoolbook multiply. + mulSimple(result[0 .. chunksize + y.length], x[0 .. chunksize], y); + auto done = chunksize; + + while (done < x.length) + { + // result[done .. done+ylength] already has a value. + chunksize = (done + (CACHELIMIT / y.length) < x.length) ? (CACHELIMIT / y.length) : x.length - done; + BigDigit [KARATSUBALIMIT] partial; + partial[0 .. y.length] = result[done .. done+y.length]; + mulSimple(result[done .. done+chunksize+y.length], x[done .. done+chunksize], y); + addAssignSimple(result[done .. done+chunksize + y.length], partial[0 .. y.length]); + done += chunksize; + } + return; + } + + immutable half = (x.length >> 1) + (x.length & 1); + if (2*y.length*y.length <= x.length*x.length) + { + // UNBALANCED MULTIPLY + // Use school multiply to cut into quasi-squares of Karatsuba-size + // or larger. The ratio of the two sides of the 'square' must be + // between 1.414:1 and 1:1. Use Karatsuba on each chunk. + // + // For maximum performance, we want the ratio to be as close to + // 1:1 as possible. To achieve this, we can either pad x or y. + // The best choice depends on the modulus x%y. + auto numchunks = x.length / y.length; + auto chunksize = y.length; + auto extra = x.length % y.length; + auto maxchunk = chunksize + extra; + bool paddingY; // true = we're padding Y, false = we're padding X. + if (extra * extra * 2 < y.length*y.length) + { + // The leftover bit is small enough that it should be incorporated + // in the existing chunks. + // Make all the chunks a tiny bit bigger + // (We're padding y with zeros) + chunksize += extra / numchunks; + extra = x.length - chunksize*numchunks; + // there will probably be a few left over. + // Every chunk will either have size chunksize, or chunksize+1. + maxchunk = chunksize + 1; + paddingY = true; + assert(chunksize + extra + chunksize *(numchunks-1) == x.length ); + } + else + { + // the extra bit is large enough that it's worth making a new chunk. + // (This means we're padding x with zeros, when doing the first one). + maxchunk = chunksize; + ++numchunks; + paddingY = false; + assert(extra + chunksize *(numchunks-1) == x.length ); + } + // We make the buffer a bit bigger so we have space for the partial sums. + BigDigit [] scratchbuff = new BigDigit[karatsubaRequiredBuffSize(maxchunk) + y.length]; + BigDigit [] partial = scratchbuff[$ - y.length .. $]; + size_t done; // how much of X have we done so far? + if (paddingY) + { + // If the first chunk is bigger, do it first. We're padding y. + mulKaratsuba(result[0 .. y.length + chunksize + (extra > 0 ? 1 : 0 )], + x[0 .. chunksize + (extra>0?1:0)], y, scratchbuff); + done = chunksize + (extra > 0 ? 1 : 0); + if (extra) --extra; + } + else + { // We're padding X. Begin with the extra bit. + mulKaratsuba(result[0 .. y.length + extra], y, x[0 .. extra], scratchbuff); + done = extra; + extra = 0; + } + immutable basechunksize = chunksize; + while (done < x.length) + { + chunksize = basechunksize + (extra > 0 ? 1 : 0); + if (extra) --extra; + partial[] = result[done .. done+y.length]; + mulKaratsuba(result[done .. done + y.length + chunksize], + x[done .. done+chunksize], y, scratchbuff); + addAssignSimple(result[done .. done + y.length + chunksize], partial); + done += chunksize; + } + () @trusted { GC.free(scratchbuff.ptr); } (); + } + else + { + // Balanced. Use Karatsuba directly. + BigDigit [] scratchbuff = new BigDigit[karatsubaRequiredBuffSize(x.length)]; + mulKaratsuba(result, x, y, scratchbuff); + () @trusted { GC.free(scratchbuff.ptr); } (); + } +} + +/** General unsigned squaring routine for BigInts. + * Sets result = x*x. + * NOTE: If the highest half-digit of x is zero, the highest digit of result will + * also be zero. + */ +void squareInternal(BigDigit[] result, const BigDigit[] x) pure nothrow +{ + import core.memory : GC; + // Squaring is potentially half a multiply, plus add the squares of + // the diagonal elements. + assert(result.length == 2*x.length); + if (x.length <= KARATSUBASQUARELIMIT) + { + if (x.length == 1) + { + result[1] = multibyteMul(result[0 .. 1], x, x[0], 0); + return; + } + return squareSimple(result, x); + } + // The nice thing about squaring is that it always stays balanced + BigDigit [] scratchbuff = new BigDigit[karatsubaRequiredBuffSize(x.length)]; + squareKaratsuba(result, x, scratchbuff); + () @trusted { GC.free(scratchbuff.ptr); } (); +} + + +import core.bitop : bsr; + +/// if remainder is null, only calculate quotient. +void divModInternal(BigDigit [] quotient, BigDigit[] remainder, const BigDigit [] u, + const BigDigit [] v) pure nothrow +{ + import core.memory : GC; + assert(quotient.length == u.length - v.length + 1); + assert(remainder == null || remainder.length == v.length); + assert(v.length > 1); + assert(u.length >= v.length); + + // Normalize by shifting v left just enough so that + // its high-order bit is on, and shift u left the + // same amount. The highest bit of u will never be set. + + BigDigit [] vn = new BigDigit[v.length]; + BigDigit [] un = new BigDigit[u.length + 1]; + // How much to left shift v, so that its MSB is set. + uint s = BIGDIGITSHIFTMASK - bsr(v[$-1]); + if (s != 0) + { + multibyteShl(vn, v, s); + un[$-1] = multibyteShl(un[0..$-1], u, s); + } + else + { + vn[] = v[]; + un[0..$-1] = u[]; + un[$-1] = 0; + } + if (quotient.length<FASTDIVLIMIT) + { + schoolbookDivMod(quotient, un, vn); + } + else + { + blockDivMod(quotient, un, vn); + } + + // Unnormalize remainder, if required. + if (remainder != null) + { + if (s == 0) remainder[] = un[0 .. vn.length]; + else multibyteShr(remainder, un[0 .. vn.length+1], s); + } + () @trusted { GC.free(un.ptr); GC.free(vn.ptr); } (); +} + +pure @system unittest +{ + immutable(uint) [] u = [0, 0xFFFF_FFFE, 0x8000_0000]; + immutable(uint) [] v = [0xFFFF_FFFF, 0x8000_0000]; + uint [] q = new uint[u.length - v.length + 1]; + uint [] r = new uint[2]; + divModInternal(q, r, u, v); + assert(q[]==[0xFFFF_FFFFu, 0]); + assert(r[]==[0xFFFF_FFFFu, 0x7FFF_FFFF]); + u = [0, 0xFFFF_FFFE, 0x8000_0001]; + v = [0xFFFF_FFFF, 0x8000_0000]; + divModInternal(q, r, u, v); +} + + +private: +// Converts a big uint to a hexadecimal string. +// +// Optionally, a separator character (eg, an underscore) may be added between +// every 8 digits. +// buff.length must be data.length*8 if separator is zero, +// or data.length*9 if separator is non-zero. It will be completely filled. +char [] biguintToHex(char [] buff, const BigDigit [] data, char separator=0, + LetterCase letterCase = LetterCase.upper) pure nothrow @safe +{ + int x=0; + for (ptrdiff_t i=data.length - 1; i >= 0; --i) + { + toHexZeroPadded(buff[x .. x+8], data[i], letterCase); + x+=8; + if (separator) + { + if (i>0) buff[x] = separator; + ++x; + } + } + return buff; +} + +/** + * Convert a big uint into an octal string. + * + * Params: + * buff = The destination buffer for the octal string. Must be large enough to + * store the result, including leading zeroes, which is + * ceil(data.length * BigDigitBits / 3) characters. + * The buffer is filled from back to front, starting from `buff[$-1]`. + * data = The biguint to be converted. + * + * Returns: The index of the leading non-zero digit in `buff`. Will be + * `buff.length - 1` if the entire big uint is zero. + */ +size_t biguintToOctal(char[] buff, const(BigDigit)[] data) + pure nothrow @safe @nogc +{ + ubyte carry = 0; + int shift = 0; + size_t penPos = buff.length - 1; + size_t lastNonZero = buff.length - 1; + + pragma(inline) void output(uint digit) @nogc nothrow + { + if (digit != 0) + lastNonZero = penPos; + buff[penPos--] = cast(char)('0' + digit); + } + + foreach (bigdigit; data) + { + if (shift < 0) + { + // Some bits were carried over from previous word. + assert(shift > -3); + output(((bigdigit << -shift) | carry) & 0b111); + shift += 3; + assert(shift > 0); + } + + while (shift <= BigDigitBits - 3) + { + output((bigdigit >>> shift) & 0b111); + shift += 3; + } + + if (shift < BigDigitBits) + { + // Some bits need to be carried forward. + carry = (bigdigit >>> shift) & 0b11; + } + shift -= BigDigitBits; + assert(shift >= -2 && shift <= 0); + } + + if (shift < 0) + { + // Last word had bits that haven't been output yet. + assert(shift > -3); + output(carry); + } + + return lastNonZero; +} + +/** Convert a big uint into a decimal string. + * + * Params: + * data The biguint to be converted. Will be destroyed. + * buff The destination buffer for the decimal string. Must be + * large enough to store the result, including leading zeros. + * Will be filled backwards, starting from buff[$-1]. + * + * buff.length must be >= (data.length*32)/log2(10) = 9.63296 * data.length. + * Returns: + * the lowest index of buff which was used. + */ +size_t biguintToDecimal(char [] buff, BigDigit [] data) pure nothrow +{ + ptrdiff_t sofar = buff.length; + // Might be better to divide by (10^38/2^32) since that gives 38 digits for + // the price of 3 divisions and a shr; this version only gives 27 digits + // for 3 divisions. + while (data.length>1) + { + uint rem = multibyteDivAssign(data, 10_0000_0000, 0); + itoaZeroPadded(buff[sofar-9 .. sofar], rem); + sofar -= 9; + if (data[$-1] == 0 && data.length > 1) + { + data.length = data.length - 1; + } + } + itoaZeroPadded(buff[sofar-10 .. sofar], data[0]); + sofar -= 10; + // and strip off the leading zeros + while (sofar != buff.length-1 && buff[sofar] == '0') + sofar++; + return sofar; +} + +/** Convert a decimal string into a big uint. + * + * Params: + * data The biguint to be receive the result. Must be large enough to + * store the result. + * s The decimal string. May contain _ or 0 .. 9 + * + * The required length for the destination buffer is slightly less than + * 1 + s.length/log2(10) = 1 + s.length/3.3219. + * + * Returns: + * the highest index of data which was used. + */ +int biguintFromDecimal(Range)(BigDigit[] data, Range s) +if ( + isInputRange!Range && + isSomeChar!(ElementType!Range) && + !isInfinite!Range) +in +{ + static if (hasLength!Range) + assert((data.length >= 2) || (data.length == 1 && s.length == 1)); +} +body +{ + import std.conv : ConvException; + + // Convert to base 1e19 = 10_000_000_000_000_000_000. + // (this is the largest power of 10 that will fit into a long). + // The length will be less than 1 + s.length/log2(10) = 1 + s.length/3.3219. + // 485 bits will only just fit into 146 decimal digits. + // As we convert the string, we record the number of digits we've seen in base 19: + // hi is the number of digits/19, lo is the extra digits (0 to 18). + // TODO: This is inefficient for very large strings (it is O(n^^2)). + // We should take advantage of fast multiplication once the numbers exceed + // Karatsuba size. + uint lo = 0; // number of powers of digits, 0 .. 18 + uint x = 0; + ulong y = 0; + uint hi = 0; // number of base 1e19 digits + data[0] = 0; // initially number is 0. + if (data.length > 1) + data[1] = 0; + + foreach (character; s) + { + if (character == '_') + continue; + + if (character < '0' || character > '9') + throw new ConvException("invalid digit"); + x *= 10; + x += character - '0'; + ++lo; + if (lo == 9) + { + y = x; + x = 0; + } + if (lo == 18) + { + y *= 10_0000_0000; + y += x; + x = 0; + } + if (lo == 19) + { + y *= 10; + y += x; + x = 0; + // Multiply existing number by 10^19, then add y1. + if (hi>0) + { + data[hi] = multibyteMul(data[0 .. hi], data[0 .. hi], 1_220_703_125*2u, 0); // 5^13*2 = 0x9184_E72A + ++hi; + data[hi] = multibyteMul(data[0 .. hi], data[0 .. hi], 15_625*262_144u, 0); // 5^6*2^18 = 0xF424_0000 + ++hi; + } + else + hi = 2; + uint c = multibyteIncrementAssign!('+')(data[0 .. hi], cast(uint)(y&0xFFFF_FFFF)); + c += multibyteIncrementAssign!('+')(data[1 .. hi], cast(uint)(y >> 32)); + if (c != 0) + { + data[hi]=c; + ++hi; + } + y = 0; + lo = 0; + } + } + // Now set y = all remaining digits. + if (lo >= 18) + { + } + else if (lo >= 9) + { + for (int k=9; k<lo; ++k) y*=10; + y+=x; + } + else + { + for (int k=0; k<lo; ++k) y*=10; + y+=x; + } + if (lo != 0) + { + if (hi == 0) + { + data[0] = cast(uint) y; + if (data.length == 1) + { + hi = 1; + } + else + { + data[1] = cast(uint)(y >>> 32); + hi=2; + } + } + else + { + while (lo>0) + { + immutable c = multibyteMul(data[0 .. hi], data[0 .. hi], 10, 0); + if (c != 0) + { + data[hi]=c; + ++hi; + } + --lo; + } + uint c = multibyteIncrementAssign!('+')(data[0 .. hi], cast(uint)(y&0xFFFF_FFFF)); + if (y > 0xFFFF_FFFFL) + { + c += multibyteIncrementAssign!('+')(data[1 .. hi], cast(uint)(y >> 32)); + } + if (c != 0) + { + data[hi]=c; + ++hi; + } + } + } + while (hi>1 && data[hi-1]==0) + --hi; + return hi; +} + + +private: +// ------------------------ +// These in-place functions are only for internal use; they are incompatible +// with COW. + +// Classic 'schoolbook' multiplication. +void mulSimple(BigDigit[] result, const(BigDigit) [] left, + const(BigDigit)[] right) pure nothrow +in +{ + assert(result.length == left.length + right.length); + assert(right.length>1); +} +body +{ + result[left.length] = multibyteMul(result[0 .. left.length], left, right[0], 0); + multibyteMultiplyAccumulate(result[1..$], left, right[1..$]); +} + +// Classic 'schoolbook' squaring +void squareSimple(BigDigit[] result, const(BigDigit) [] x) pure nothrow +in +{ + assert(result.length == 2*x.length); + assert(x.length>1); +} +body +{ + multibyteSquare(result, x); +} + + +// add two uints of possibly different lengths. Result must be as long +// as the larger length. +// Returns carry (0 or 1). +uint addSimple(BigDigit[] result, const BigDigit [] left, const BigDigit [] right) +pure nothrow +in +{ + assert(result.length == left.length); + assert(left.length >= right.length); + assert(right.length>0); +} +body +{ + uint carry = multibyteAdd(result[0 .. right.length], + left[0 .. right.length], right, 0); + if (right.length < left.length) + { + result[right.length .. left.length] = left[right.length .. $]; + carry = multibyteIncrementAssign!('+')(result[right.length..$], carry); + } + return carry; +} + +// result = left - right +// returns carry (0 or 1) +BigDigit subSimple(BigDigit [] result,const(BigDigit) [] left, + const(BigDigit) [] right) pure nothrow +in +{ + assert(result.length == left.length); + assert(left.length >= right.length); + assert(right.length>0); +} +body +{ + BigDigit carry = multibyteSub(result[0 .. right.length], + left[0 .. right.length], right, 0); + if (right.length < left.length) + { + result[right.length .. left.length] = left[right.length .. $]; + carry = multibyteIncrementAssign!('-')(result[right.length..$], carry); + } //else if (result.length == left.length+1) { result[$-1] = carry; carry=0; } + return carry; +} + + +/* result = result - right + * Returns carry = 1 if result was less than right. +*/ +BigDigit subAssignSimple(BigDigit [] result, const(BigDigit) [] right) +pure nothrow +{ + assert(result.length >= right.length); + uint c = multibyteSub(result[0 .. right.length], result[0 .. right.length], right, 0); + if (c && result.length > right.length) + c = multibyteIncrementAssign!('-')(result[right.length .. $], c); + return c; +} + +/* result = result + right +*/ +BigDigit addAssignSimple(BigDigit [] result, const(BigDigit) [] right) +pure nothrow +{ + assert(result.length >= right.length); + uint c = multibyteAdd(result[0 .. right.length], result[0 .. right.length], right, 0); + if (c && result.length > right.length) + c = multibyteIncrementAssign!('+')(result[right.length .. $], c); + return c; +} + +/* performs result += wantSub? - right : right; +*/ +BigDigit addOrSubAssignSimple(BigDigit [] result, const(BigDigit) [] right, + bool wantSub) pure nothrow +{ + if (wantSub) + return subAssignSimple(result, right); + else + return addAssignSimple(result, right); +} + + +// return true if x<y, considering leading zeros +bool less(const(BigDigit)[] x, const(BigDigit)[] y) pure nothrow +{ + assert(x.length >= y.length); + auto k = x.length-1; + while (x[k]==0 && k >= y.length) + --k; + if (k >= y.length) + return false; + while (k>0 && x[k]==y[k]) + --k; + return x[k] < y[k]; +} + +// Set result = abs(x-y), return true if result is negative(x<y), false if x <= y. +bool inplaceSub(BigDigit[] result, const(BigDigit)[] x, const(BigDigit)[] y) + pure nothrow +{ + assert(result.length == (x.length >= y.length) ? x.length : y.length); + + size_t minlen; + bool negative; + if (x.length >= y.length) + { + minlen = y.length; + negative = less(x, y); + } + else + { + minlen = x.length; + negative = !less(y, x); + } + const (BigDigit)[] large, small; + if (negative) + { + large = y; small = x; + } + else + { + large = x; small = y; + } + + BigDigit carry = multibyteSub(result[0 .. minlen], large[0 .. minlen], small[0 .. minlen], 0); + if (x.length != y.length) + { + result[minlen .. large.length]= large[minlen..$]; + result[large.length..$] = 0; + if (carry) + multibyteIncrementAssign!('-')(result[minlen..$], carry); + } + return negative; +} + +/* Determine how much space is required for the temporaries + * when performing a Karatsuba multiplication. + */ +size_t karatsubaRequiredBuffSize(size_t xlen) pure nothrow @safe +{ + return xlen <= KARATSUBALIMIT ? 0 : 2*xlen; // - KARATSUBALIMIT+2; +} + +/* Sets result = x*y, using Karatsuba multiplication. +* x must be longer or equal to y. +* Valid only for balanced multiplies, where x is not shorter than y. +* It is superior to schoolbook multiplication if and only if +* sqrt(2)*y.length > x.length > y.length. +* Karatsuba multiplication is O(n^1.59), whereas schoolbook is O(n^2) +* The maximum allowable length of x and y is uint.max; but better algorithms +* should be used far before that length is reached. +* Params: +* scratchbuff An array long enough to store all the temporaries. Will be destroyed. +*/ +void mulKaratsuba(BigDigit [] result, const(BigDigit) [] x, + const(BigDigit)[] y, BigDigit [] scratchbuff) pure nothrow +{ + assert(x.length >= y.length); + assert(result.length < uint.max, "Operands too large"); + assert(result.length == x.length + y.length); + if (x.length <= KARATSUBALIMIT) + { + return mulSimple(result, x, y); + } + // Must be almost square (otherwise, a schoolbook iteration is better) + assert(2L * y.length * y.length > (x.length-1) * (x.length-1), + "Bigint Internal Error: Asymmetric Karatsuba"); + + // The subtractive version of Karatsuba multiply uses the following result: + // (Nx1 + x0)*(Ny1 + y0) = (N*N)*x1y1 + x0y0 + N * (x0y0 + x1y1 - mid) + // where mid = (x0-x1)*(y0-y1) + // requiring 3 multiplies of length N, instead of 4. + // The advantage of the subtractive over the additive version is that + // the mid multiply cannot exceed length N. But there are subtleties: + // (x0-x1),(y0-y1) may be negative or zero. To keep it simple, we + // retain all of the leading zeros in the subtractions + + // half length, round up. + auto half = (x.length >> 1) + (x.length & 1); + + const(BigDigit) [] x0 = x[0 .. half]; + const(BigDigit) [] x1 = x[half .. $]; + const(BigDigit) [] y0 = y[0 .. half]; + const(BigDigit) [] y1 = y[half .. $]; + BigDigit [] mid = scratchbuff[0 .. half*2]; + BigDigit [] newscratchbuff = scratchbuff[half*2 .. $]; + BigDigit [] resultLow = result[0 .. 2*half]; + BigDigit [] resultHigh = result[2*half .. $]; + // initially use result to store temporaries + BigDigit [] xdiff= result[0 .. half]; + BigDigit [] ydiff = result[half .. half*2]; + + // First, we calculate mid, and sign of mid + immutable bool midNegative = inplaceSub(xdiff, x0, x1) + ^ inplaceSub(ydiff, y0, y1); + mulKaratsuba(mid, xdiff, ydiff, newscratchbuff); + + // Low half of result gets x0 * y0. High half gets x1 * y1 + + mulKaratsuba(resultLow, x0, y0, newscratchbuff); + + if (2L * y1.length * y1.length < x1.length * x1.length) + { + // an asymmetric situation has been created. + // Worst case is if x:y = 1.414 : 1, then x1:y1 = 2.41 : 1. + // Applying one schoolbook multiply gives us two pieces each 1.2:1 + if (y1.length <= KARATSUBALIMIT) + mulSimple(resultHigh, x1, y1); + else + { + // divide x1 in two, then use schoolbook multiply on the two pieces. + auto quarter = (x1.length >> 1) + (x1.length & 1); + immutable ysmaller = (quarter >= y1.length); + mulKaratsuba(resultHigh[0 .. quarter+y1.length], ysmaller ? x1[0 .. quarter] : y1, + ysmaller ? y1 : x1[0 .. quarter], newscratchbuff); + // Save the part which will be overwritten. + immutable ysmaller2 = ((x1.length - quarter) >= y1.length); + newscratchbuff[0 .. y1.length] = resultHigh[quarter .. quarter + y1.length]; + mulKaratsuba(resultHigh[quarter..$], ysmaller2 ? x1[quarter..$] : y1, + ysmaller2 ? y1 : x1[quarter..$], newscratchbuff[y1.length..$]); + + resultHigh[quarter..$].addAssignSimple(newscratchbuff[0 .. y1.length]); + } + } + else + mulKaratsuba(resultHigh, x1, y1, newscratchbuff); + + /* We now have result = x0y0 + (N*N)*x1y1 + Before adding or subtracting mid, we must calculate + result += N * (x0y0 + x1y1) + We can do this with three half-length additions. With a = x0y0, b = x1y1: + aHI aLO + + aHI aLO + + bHI bLO + + bHI bLO + = R3 R2 R1 R0 + R1 = aHI + bLO + aLO + R2 = aHI + bLO + aHI + carry_from_R1 + R3 = bHi + carry_from_R2 + + It might actually be quicker to do it in two full-length additions: + newscratchbuff[2*half] = addSimple(newscratchbuff[0 .. 2*half], result[0 .. 2*half], result[2*half..$]); + addAssignSimple(result[half..$], newscratchbuff[0 .. 2*half+1]); + */ + BigDigit[] R1 = result[half .. half*2]; + BigDigit[] R2 = result[half*2 .. half*3]; + BigDigit[] R3 = result[half*3..$]; + BigDigit c1 = multibyteAdd(R2, R2, R1, 0); // c1:R2 = R2 + R1 + BigDigit c2 = multibyteAdd(R1, R2, result[0 .. half], 0); // c2:R1 = R2 + R1 + R0 + BigDigit c3 = addAssignSimple(R2, R3); // R2 = R2 + R1 + R3 + if (c1+c2) + multibyteIncrementAssign!('+')(result[half*2..$], c1+c2); + if (c1+c3) + multibyteIncrementAssign!('+')(R3, c1+c3); + + // And finally we subtract mid + addOrSubAssignSimple(result[half..$], mid, !midNegative); +} + +void squareKaratsuba(BigDigit [] result, const BigDigit [] x, + BigDigit [] scratchbuff) pure nothrow +{ + // See mulKaratsuba for implementation comments. + // Squaring is simpler, since it never gets asymmetric. + assert(result.length < uint.max, "Operands too large"); + assert(result.length == 2*x.length); + if (x.length <= KARATSUBASQUARELIMIT) + { + return squareSimple(result, x); + } + // half length, round up. + auto half = (x.length >> 1) + (x.length & 1); + + const(BigDigit)[] x0 = x[0 .. half]; + const(BigDigit)[] x1 = x[half .. $]; + BigDigit [] mid = scratchbuff[0 .. half*2]; + BigDigit [] newscratchbuff = scratchbuff[half*2 .. $]; + // initially use result to store temporaries + BigDigit [] xdiff= result[0 .. half]; + const BigDigit [] ydiff = result[half .. half*2]; + + // First, we calculate mid. We don't need its sign + inplaceSub(xdiff, x0, x1); + squareKaratsuba(mid, xdiff, newscratchbuff); + + // Set result = x0x0 + (N*N)*x1x1 + squareKaratsuba(result[0 .. 2*half], x0, newscratchbuff); + squareKaratsuba(result[2*half .. $], x1, newscratchbuff); + + /* result += N * (x0x0 + x1x1) + Do this with three half-length additions. With a = x0x0, b = x1x1: + R1 = aHI + bLO + aLO + R2 = aHI + bLO + aHI + carry_from_R1 + R3 = bHi + carry_from_R2 + */ + BigDigit[] R1 = result[half .. half*2]; + BigDigit[] R2 = result[half*2 .. half*3]; + BigDigit[] R3 = result[half*3..$]; + BigDigit c1 = multibyteAdd(R2, R2, R1, 0); // c1:R2 = R2 + R1 + BigDigit c2 = multibyteAdd(R1, R2, result[0 .. half], 0); // c2:R1 = R2 + R1 + R0 + BigDigit c3 = addAssignSimple(R2, R3); // R2 = R2 + R1 + R3 + if (c1+c2) multibyteIncrementAssign!('+')(result[half*2..$], c1+c2); + if (c1+c3) multibyteIncrementAssign!('+')(R3, c1+c3); + + // And finally we subtract mid, which is always positive + subAssignSimple(result[half..$], mid); +} + +/* Knuth's Algorithm D, as presented in + * H.S. Warren, "Hacker's Delight", Addison-Wesley Professional (2002). + * Also described in "Modern Computer Arithmetic" 0.2, Exercise 1.8.18. + * Given u and v, calculates quotient = u / v, u = u % v. + * v must be normalized (ie, the MSB of v must be 1). + * The most significant words of quotient and u may be zero. + * u[0 .. v.length] holds the remainder. + */ +void schoolbookDivMod(BigDigit [] quotient, BigDigit [] u, in BigDigit [] v) + pure nothrow +{ + assert(quotient.length == u.length - v.length); + assert(v.length > 1); + assert(u.length >= v.length); + assert((v[$-1]&0x8000_0000)!=0); + assert(u[$-1] < v[$-1]); + // BUG: This code only works if BigDigit is uint. + uint vhi = v[$-1]; + uint vlo = v[$-2]; + + for (ptrdiff_t j = u.length - v.length - 1; j >= 0; j--) + { + // Compute estimate of quotient[j], + // qhat = (three most significant words of u)/(two most sig words of v). + uint qhat; + if (u[j + v.length] == vhi) + { + // uu/vhi could exceed uint.max (it will be 0x8000_0000 or 0x8000_0001) + qhat = uint.max; + } + else + { + uint ulo = u[j + v.length - 2]; + version (D_InlineAsm_X86) + { + // Note: On DMD, this is only ~10% faster than the non-asm code. + uint *p = &u[j + v.length - 1]; + asm pure nothrow + { + mov EAX, p; + mov EDX, [EAX+4]; + mov EAX, [EAX]; + div dword ptr [vhi]; + mov qhat, EAX; + mov ECX, EDX; +div3by2correction: + mul dword ptr [vlo]; // EDX:EAX = qhat * vlo + sub EAX, ulo; + sbb EDX, ECX; + jbe div3by2done; + mov EAX, qhat; + dec EAX; + mov qhat, EAX; + add ECX, dword ptr [vhi]; + jnc div3by2correction; +div3by2done: ; + } + } + else + { // version (InlineAsm) + ulong uu = (cast(ulong)(u[j + v.length]) << 32) | u[j + v.length - 1]; + immutable bigqhat = uu / vhi; + ulong rhat = uu - bigqhat * vhi; + qhat = cast(uint) bigqhat; +again: + if (cast(ulong) qhat * vlo > ((rhat << 32) + ulo)) + { + --qhat; + rhat += vhi; + if (!(rhat & 0xFFFF_FFFF_0000_0000L)) + goto again; + } + } // version (InlineAsm) + } + // Multiply and subtract. + uint carry = multibyteMulAdd!('-')(u[j .. j + v.length], v, qhat, 0); + + if (u[j+v.length] < carry) + { + // If we subtracted too much, add back + --qhat; + carry -= multibyteAdd(u[j .. j + v.length],u[j .. j + v.length], v, 0); + } + quotient[j] = qhat; + u[j + v.length] = u[j + v.length] - carry; + } +} + +private: + +// TODO: Replace with a library call +void itoaZeroPadded(char[] output, uint value) + pure nothrow @safe @nogc +{ + for (auto i = output.length; i--;) + { + if (value < 10) + { + output[i] = cast(char)(value + '0'); + value = 0; + } + else + { + output[i] = cast(char)(value % 10 + '0'); + value /= 10; + } + } +} + +void toHexZeroPadded(char[] output, uint value, + LetterCase letterCase = LetterCase.upper) pure nothrow @safe +{ + ptrdiff_t x = output.length - 1; + static immutable string upperHexDigits = "0123456789ABCDEF"; + static immutable string lowerHexDigits = "0123456789abcdef"; + for ( ; x >= 0; --x) + { + if (letterCase == LetterCase.upper) + { + output[x] = upperHexDigits[value & 0xF]; + } + else + { + output[x] = lowerHexDigits[value & 0xF]; + } + value >>= 4; + } +} + +private: + +// Returns the highest value of i for which left[i]!=right[i], +// or 0 if left[] == right[] +size_t highestDifferentDigit(const BigDigit [] left, const BigDigit [] right) +pure nothrow @nogc @safe +{ + assert(left.length == right.length); + for (ptrdiff_t i = left.length - 1; i>0; --i) + { + if (left[i] != right[i]) + return i; + } + return 0; +} + +// Returns the lowest value of i for which x[i]!=0. +int firstNonZeroDigit(const BigDigit [] x) pure nothrow @nogc @safe +{ + int k = 0; + while (x[k]==0) + { + ++k; + assert(k<x.length); + } + return k; +} + +/* + Calculate quotient and remainder of u / v using fast recursive division. + v must be normalised, and must be at least half as long as u. + Given u and v, v normalised, calculates quotient = u/v, u = u%v. + scratch is temporary storage space, length must be >= quotient + 1. + +Returns: + u[0 .. v.length] is the remainder. u[v.length..$] is corrupted. + + Implements algorithm 1.8 from MCA. + This algorithm has an annoying special case. After the first recursion, the + highest bit of the quotient may be set. This means that in the second + recursive call, the 'in' contract would be violated. (This happens only + when the top quarter of u is equal to the top half of v. A base 10 + equivalent example of this situation is 5517/56; the first step gives + 55/5 = 11). To maintain the in contract, we pad a zero to the top of both + u and the quotient. 'mayOverflow' indicates that that the special case + has occurred. + (In MCA, a different strategy is used: the in contract is weakened, and + schoolbookDivMod is more general: it allows the high bit of u to be set). + See also: + - C. Burkinel and J. Ziegler, "Fast Recursive Division", MPI-I-98-1-022, + Max-Planck Institute fuer Informatik, (Oct 1998). +*/ +void recursiveDivMod(BigDigit[] quotient, BigDigit[] u, const(BigDigit)[] v, + BigDigit[] scratch, bool mayOverflow = false) + pure nothrow +in +{ + // v must be normalized + assert(v.length > 1); + assert((v[$ - 1] & 0x8000_0000) != 0); + assert(!(u[$ - 1] & 0x8000_0000)); + assert(quotient.length == u.length - v.length); + if (mayOverflow) + { + assert(u[$-1] == 0); + assert(u[$-2] & 0x8000_0000); + } + + // Must be symmetric. Use block schoolbook division if not. + assert((mayOverflow ? u.length-1 : u.length) <= 2 * v.length); + assert((mayOverflow ? u.length-1 : u.length) >= v.length); + assert(scratch.length >= quotient.length + (mayOverflow ? 0 : 1)); +} +body +{ + if (quotient.length < FASTDIVLIMIT) + { + return schoolbookDivMod(quotient, u, v); + } + + // Split quotient into two halves, but keep padding in the top half + auto k = (mayOverflow ? quotient.length - 1 : quotient.length) >> 1; + + // RECURSION 1: Calculate the high half of the quotient + + // Note that if u and quotient were padded, they remain padded during + // this call, so in contract is satisfied. + recursiveDivMod(quotient[k .. $], u[2 * k .. $], v[k .. $], + scratch, mayOverflow); + + // quotient[k..$] is our guess at the high quotient. + // u[2*k .. 2.*k + v.length - k = k + v.length] is the high part of the + // first remainder. u[0 .. 2*k] is the low part. + + // Calculate the full first remainder to be + // remainder - highQuotient * lowDivisor + // reducing highQuotient until the remainder is positive. + // The low part of the remainder, u[0 .. k], cannot be altered by this. + + adjustRemainder(quotient[k .. $], u[k .. k + v.length], v, k, + scratch[0 .. quotient.length], mayOverflow); + + // RECURSION 2: Calculate the low half of the quotient + // The full first remainder is now in u[0 .. k + v.length]. + + if (u[k + v.length - 1] & 0x8000_0000) + { + // Special case. The high quotient is 0x1_00...000 or 0x1_00...001. + // This means we need an extra quotient word for the next recursion. + // We need to restore the invariant for the recursive calls. + // We do this by padding both u and quotient. Extending u is trivial, + // because the higher words will not be used again. But for the + // quotient, we're clobbering the low word of the high quotient, + // so we need save it, and add it back in after the recursive call. + + auto clobberedQuotient = quotient[k]; + u[k+v.length] = 0; + + recursiveDivMod(quotient[0 .. k+1], u[k .. k + v.length+1], + v[k .. $], scratch, true); + adjustRemainder(quotient[0 .. k+1], u[0 .. v.length], v, k, + scratch[0 .. 2 * k+1], true); + + // Now add the quotient word that got clobbered earlier. + multibyteIncrementAssign!('+')(quotient[k..$], clobberedQuotient); + } + else + { + // The special case has NOT happened. + recursiveDivMod(quotient[0 .. k], u[k .. k + v.length], v[k .. $], + scratch, false); + + // high remainder is in u[k .. k+(v.length-k)] == u[k .. v.length] + + adjustRemainder(quotient[0 .. k], u[0 .. v.length], v, k, + scratch[0 .. 2 * k]); + } +} + +// rem -= quot * v[0 .. k]. +// If would make rem negative, decrease quot until rem is >= 0. +// Needs (quot.length * k) scratch space to store the result of the multiply. +void adjustRemainder(BigDigit[] quot, BigDigit[] rem, const(BigDigit)[] v, + ptrdiff_t k, + BigDigit[] scratch, bool mayOverflow = false) pure nothrow +{ + assert(rem.length == v.length); + mulInternal(scratch, quot, v[0 .. k]); + uint carry = 0; + if (mayOverflow) + carry = scratch[$-1] + subAssignSimple(rem, scratch[0..$-1]); + else + carry = subAssignSimple(rem, scratch); + while (carry) + { + multibyteIncrementAssign!('-')(quot, 1); // quot-- + carry -= multibyteAdd(rem, rem, v, 0); + } +} + +// Cope with unbalanced division by performing block schoolbook division. +void blockDivMod(BigDigit [] quotient, BigDigit [] u, in BigDigit [] v) +pure nothrow +{ + import core.memory : GC; + assert(quotient.length == u.length - v.length); + assert(v.length > 1); + assert(u.length >= v.length); + assert((v[$-1] & 0x8000_0000)!=0); + assert((u[$-1] & 0x8000_0000)==0); + BigDigit [] scratch = new BigDigit[v.length + 1]; + + // Perform block schoolbook division, with 'v.length' blocks. + auto m = u.length - v.length; + while (m > v.length) + { + immutable mayOverflow = (u[m + v.length -1 ] & 0x8000_0000)!=0; + BigDigit saveq; + if (mayOverflow) + { + u[m + v.length] = 0; + saveq = quotient[m]; + } + recursiveDivMod(quotient[m-v.length .. m + (mayOverflow? 1: 0)], + u[m - v.length .. m + v.length + (mayOverflow? 1: 0)], v, scratch, mayOverflow); + if (mayOverflow) + { + assert(quotient[m] == 0); + quotient[m] = saveq; + } + m -= v.length; + } + recursiveDivMod(quotient[0 .. m], u[0 .. m + v.length], v, scratch); + () @trusted { GC.free(scratch.ptr); } (); +} + +@system unittest +{ + import core.stdc.stdio; + + void printBiguint(const uint [] data) + { + char [] buff = biguintToHex(new char[data.length*9], data, '_'); + printf("%.*s\n", buff.length, buff.ptr); + } + + void printDecimalBigUint(BigUint data) + { + auto str = data.toDecimalString(0); + printf("%.*s\n", str.length, str.ptr); + } + + uint [] a, b; + a = new uint[43]; + b = new uint[179]; + for (int i=0; i<a.length; ++i) a[i] = 0x1234_B6E9 + i; + for (int i=0; i<b.length; ++i) b[i] = 0x1BCD_8763 - i*546; + + a[$-1] |= 0x8000_0000; + uint [] r = new uint[a.length]; + uint [] q = new uint[b.length-a.length+1]; + + divModInternal(q, r, b, a); + q = q[0..$-1]; + uint [] r1 = r.dup; + uint [] q1 = q.dup; + blockDivMod(q, b, a); + r = b[0 .. a.length]; + assert(r[] == r1[]); + assert(q[] == q1[]); +} + +// biguintToOctal +@safe unittest +{ + enum bufSize = 5 * BigDigitBits / 3 + 1; + auto buf = new char[bufSize]; + size_t i; + BigDigit[] data = [ 342391 ]; + + // Basic functionality with single word + i = biguintToOctal(buf, data); + assert(i == bufSize - 7 && buf[i .. $] == "1234567"); + + // Test carrying bits between words + data = [ 0x77053977, 0x39770539, 0x00000005 ]; + i = biguintToOctal(buf, data); + assert(i == bufSize - 23 && buf[i .. $] == "12345670123456701234567"); + + // Test carried bits in the last word + data = [ 0x80000000 ]; + i = biguintToOctal(buf, data); + assert(buf[i .. $] == "20000000000"); + + // Test boundary between 3rd and 4th word where the number of bits is + // divisible by 3 and no bits should be carried. + // + // The 0xC0000000's are for "poisoning" the carry to be non-zero when the + // rollover happens, so that if any bugs happen in wrongly adding the carry + // to the next word, non-zero bits will show up in the output. + data = [ 0xC0000000, 0xC0000000, 0xC0000000, 0x00000010 ]; + i = biguintToOctal(buf, data); + assert(buf[i .. $] == "2060000000001400000000030000000000"); + + // Boundary case: 0 + data = [ 0 ]; + i = biguintToOctal(buf, data); + assert(buf[i .. $] == "0"); +} diff --git a/libphobos/src/std/internal/math/biguintnoasm.d b/libphobos/src/std/internal/math/biguintnoasm.d new file mode 100644 index 0000000..aea1d50 --- /dev/null +++ b/libphobos/src/std/internal/math/biguintnoasm.d @@ -0,0 +1,370 @@ +/** Arbitrary precision arithmetic ('bignum') for processors with no asm support + * + * All functions operate on arrays of uints, stored LSB first. + * If there is a destination array, it will be the first parameter. + * Currently, all of these functions are subject to change, and are + * intended for internal use only. + * This module is intended only to assist development of high-speed routines + * on currently unsupported processors. + * The X86 asm version is about 30 times faster than the D version (DMD). + */ + +/* Copyright Don Clugston 2008 - 2010. + * Distributed under the Boost Software License, Version 1.0. + * (See accompanying file LICENSE_1_0.txt or copy at + * http://www.boost.org/LICENSE_1_0.txt) + */ + +module std.internal.math.biguintnoasm; + +nothrow: +@safe: + +public: +alias BigDigit = uint; // A Bignum is an array of BigDigits. + + // Limits for when to switch between multiplication algorithms. +enum int KARATSUBALIMIT = 10; // Minimum value for which Karatsuba is worthwhile. +enum int KARATSUBASQUARELIMIT = 12; // Minimum value for which square Karatsuba is worthwhile + + +/** Multi-byte addition or subtraction + * dest[] = src1[] + src2[] + carry (0 or 1). + * or dest[] = src1[] - src2[] - carry (0 or 1). + * Returns carry or borrow (0 or 1). + * Set op == '+' for addition, '-' for subtraction. + */ +uint multibyteAddSub(char op)(uint[] dest, const(uint) [] src1, + const (uint) [] src2, uint carry) pure @nogc @safe +{ + ulong c = carry; + for (size_t i = 0; i < src2.length; ++i) + { + static if (op=='+') c = c + src1[i] + src2[i]; + else c = cast(ulong) src1[i] - src2[i] - c; + dest[i] = cast(uint) c; + c = (c > 0xFFFF_FFFF); + } + return cast(uint) c; +} + +@safe unittest +{ + uint [] a = new uint[40]; + uint [] b = new uint[40]; + uint [] c = new uint[40]; + for (size_t i = 0; i < a.length; ++i) + { + if (i&1) a[i]=cast(uint)(0x8000_0000 + i); + else a[i]=cast(uint) i; + b[i]= 0x8000_0003; + } + c[19]=0x3333_3333; + uint carry = multibyteAddSub!('+')(c[0 .. 18], b[0 .. 18], a[0 .. 18], 0); + assert(c[0]==0x8000_0003); + assert(c[1]==4); + assert(c[19]==0x3333_3333); // check for overrun + assert(carry == 1); + for (size_t i = 0; i < a.length; ++i) + { + a[i] = b[i] = c[i] = 0; + } + a[8]=0x048D159E; + b[8]=0x048D159E; + a[10]=0x1D950C84; + b[10]=0x1D950C84; + a[5] =0x44444444; + carry = multibyteAddSub!('-')(a[0 .. 12], a[0 .. 12], b[0 .. 12], 0); + assert(a[11] == 0); + for (size_t i = 0; i < 10; ++i) + if (i != 5) + assert(a[i] == 0); + + for (size_t q = 3; q < 36; ++q) + { + for (size_t i = 0; i< a.length; ++i) + { + a[i] = b[i] = c[i] = 0; + } + a[q-2]=0x040000; + b[q-2]=0x040000; + carry = multibyteAddSub!('-')(a[0 .. q], a[0 .. q], b[0 .. q], 0); + assert(a[q-2]==0); + } +} + + + +/** dest[] += carry, or dest[] -= carry. + * op must be '+' or '-' + * Returns final carry or borrow (0 or 1) + */ +uint multibyteIncrementAssign(char op)(uint[] dest, uint carry) + pure @nogc @safe +{ + static if (op=='+') + { + ulong c = carry; + c += dest[0]; + dest[0] = cast(uint) c; + if (c <= 0xFFFF_FFFF) + return 0; + + for (size_t i = 1; i < dest.length; ++i) + { + ++dest[i]; + if (dest[i] != 0) + return 0; + } + return 1; + } + else + { + ulong c = carry; + c = dest[0] - c; + dest[0] = cast(uint) c; + if (c <= 0xFFFF_FFFF) + return 0; + for (size_t i = 1; i < dest.length; ++i) + { + --dest[i]; + if (dest[i] != 0xFFFF_FFFF) + return 0; + } + return 1; + } +} + +/** dest[] = src[] << numbits + * numbits must be in the range 1 .. 31 + */ +uint multibyteShl(uint [] dest, const(uint) [] src, uint numbits) + pure @nogc @safe +{ + ulong c = 0; + for (size_t i = 0; i < dest.length; ++i) + { + c += (cast(ulong)(src[i]) << numbits); + dest[i] = cast(uint) c; + c >>>= 32; + } + return cast(uint) c; +} + + +/** dest[] = src[] >> numbits + * numbits must be in the range 1 .. 31 + */ +void multibyteShr(uint [] dest, const(uint) [] src, uint numbits) + pure @nogc @safe +{ + ulong c = 0; + for (ptrdiff_t i = dest.length; i != 0; --i) + { + c += (src[i-1] >>numbits) + (cast(ulong)(src[i-1]) << (64 - numbits)); + dest[i-1] = cast(uint) c; + c >>>= 32; + } +} + +@safe unittest +{ + + uint [] aa = [0x1222_2223, 0x4555_5556, 0x8999_999A, 0xBCCC_CCCD, 0xEEEE_EEEE]; + multibyteShr(aa[0..$-2], aa, 4); + assert(aa[0] == 0x6122_2222 && aa[1] == 0xA455_5555 && aa[2] == 0x0899_9999); + assert(aa[3] == 0xBCCC_CCCD); + + aa = [0x1222_2223, 0x4555_5556, 0x8999_999A, 0xBCCC_CCCD, 0xEEEE_EEEE]; + multibyteShr(aa[0..$-1], aa, 4); + assert(aa[0] == 0x6122_2222 && aa[1] == 0xA455_5555 + && aa[2] == 0xD899_9999 && aa[3] == 0x0BCC_CCCC); + + aa = [0xF0FF_FFFF, 0x1222_2223, 0x4555_5556, 0x8999_999A, 0xBCCC_CCCD, + 0xEEEE_EEEE]; + multibyteShl(aa[1 .. 4], aa[1..$], 4); + assert(aa[0] == 0xF0FF_FFFF && aa[1] == 0x2222_2230 + && aa[2]==0x5555_5561 && aa[3]==0x9999_99A4 && aa[4]==0x0BCCC_CCCD); +} + +/** dest[] = src[] * multiplier + carry. + * Returns carry. + */ +uint multibyteMul(uint[] dest, const(uint)[] src, uint multiplier, uint carry) + pure @nogc @safe +{ + assert(dest.length == src.length); + ulong c = carry; + for (size_t i = 0; i < src.length; ++i) + { + c += cast(ulong)(src[i]) * multiplier; + dest[i] = cast(uint) c; + c>>=32; + } + return cast(uint) c; +} + +@safe unittest +{ + uint [] aa = [0xF0FF_FFFF, 0x1222_2223, 0x4555_5556, 0x8999_999A, + 0xBCCC_CCCD, 0xEEEE_EEEE]; + multibyteMul(aa[1 .. 4], aa[1 .. 4], 16, 0); + assert(aa[0] == 0xF0FF_FFFF && aa[1] == 0x2222_2230 && aa[2]==0x5555_5561 + && aa[3]==0x9999_99A4 && aa[4]==0x0BCCC_CCCD); +} + +/** + * dest[] += src[] * multiplier + carry(0 .. FFFF_FFFF). + * Returns carry out of MSB (0 .. FFFF_FFFF). + */ +uint multibyteMulAdd(char op)(uint [] dest, const(uint)[] src, + uint multiplier, uint carry) pure @nogc @safe +{ + assert(dest.length == src.length); + ulong c = carry; + for (size_t i = 0; i < src.length; ++i) + { + static if (op=='+') + { + c += cast(ulong)(multiplier) * src[i] + dest[i]; + dest[i] = cast(uint) c; + c >>= 32; + } + else + { + c += cast(ulong) multiplier * src[i]; + ulong t = cast(ulong) dest[i] - cast(uint) c; + dest[i] = cast(uint) t; + c = cast(uint)((c >> 32) - (t >> 32)); + } + } + return cast(uint) c; +} + +@safe unittest +{ + + uint [] aa = [0xF0FF_FFFF, 0x1222_2223, 0x4555_5556, 0x8999_999A, + 0xBCCC_CCCD, 0xEEEE_EEEE]; + uint [] bb = [0x1234_1234, 0xF0F0_F0F0, 0x00C0_C0C0, 0xF0F0_F0F0, + 0xC0C0_C0C0]; + multibyteMulAdd!('+')(bb[1..$-1], aa[1..$-2], 16, 5); + assert(bb[0] == 0x1234_1234 && bb[4] == 0xC0C0_C0C0); + assert(bb[1] == 0x2222_2230 + 0xF0F0_F0F0 + 5 + && bb[2] == 0x5555_5561 + 0x00C0_C0C0 + 1 + && bb[3] == 0x9999_99A4 + 0xF0F0_F0F0 ); +} + + +/** + Sets result = result[0 .. left.length] + left * right + + It is defined in this way to allow cache-efficient multiplication. + This function is equivalent to: + ---- + for (size_t i = 0; i< right.length; ++i) + { + dest[left.length + i] = multibyteMulAdd(dest[i .. left.length+i], + left, right[i], 0); + } + ---- + */ +void multibyteMultiplyAccumulate(uint [] dest, const(uint)[] left, const(uint) + [] right) pure @nogc @safe +{ + for (size_t i = 0; i < right.length; ++i) + { + dest[left.length + i] = multibyteMulAdd!('+')(dest[i .. left.length+i], + left, right[i], 0); + } +} + +/** dest[] /= divisor. + * overflow is the initial remainder, and must be in the range 0 .. divisor-1. + */ +uint multibyteDivAssign(uint [] dest, uint divisor, uint overflow) + pure @nogc @safe +{ + ulong c = cast(ulong) overflow; + for (ptrdiff_t i = dest.length-1; i >= 0; --i) + { + c = (c << 32) + cast(ulong)(dest[i]); + uint q = cast(uint)(c/divisor); + c -= divisor * q; + dest[i] = q; + } + return cast(uint) c; +} + +@safe unittest +{ + uint [] aa = new uint[101]; + for (uint i = 0; i < aa.length; ++i) + aa[i] = 0x8765_4321 * (i+3); + uint overflow = multibyteMul(aa, aa, 0x8EFD_FCFB, 0x33FF_7461); + uint r = multibyteDivAssign(aa, 0x8EFD_FCFB, overflow); + for (uint i=0; i<aa.length; ++i) + { + assert(aa[i] == 0x8765_4321 * (i+3)); + } + assert(r == 0x33FF_7461); + +} +// Set dest[2*i .. 2*i+1]+=src[i]*src[i] +void multibyteAddDiagonalSquares(uint[] dest, const(uint)[] src) + pure @nogc @safe +{ + ulong c = 0; + for (size_t i = 0; i < src.length; ++i) + { + // At this point, c is 0 or 1, since FFFF*FFFF+FFFF_FFFF = 1_0000_0000. + c += cast(ulong)(src[i]) * src[i] + dest[2*i]; + dest[2*i] = cast(uint) c; + c = (c>>=32) + dest[2*i+1]; + dest[2*i+1] = cast(uint) c; + c >>= 32; + } +} + +// Does half a square multiply. (square = diagonal + 2*triangle) +void multibyteTriangleAccumulate(uint[] dest, const(uint)[] x) + pure @nogc @safe +{ + // x[0]*x[1...$] + x[1]*x[2..$] + ... + x[$-2]x[$-1..$] + dest[x.length] = multibyteMul(dest[1 .. x.length], x[1..$], x[0], 0); + if (x.length < 4) + { + if (x.length == 3) + { + ulong c = cast(ulong)(x[$-1]) * x[$-2] + dest[2*x.length-3]; + dest[2*x.length - 3] = cast(uint) c; + c >>= 32; + dest[2*x.length - 2] = cast(uint) c; + } + return; + } + for (size_t i = 2; i < x.length - 2; ++i) + { + dest[i-1+ x.length] = multibyteMulAdd!('+')( + dest[i+i-1 .. i+x.length-1], x[i..$], x[i-1], 0); + } + // Unroll the last two entries, to reduce loop overhead: + ulong c = cast(ulong)(x[$-3]) * x[$-2] + dest[2*x.length-5]; + dest[2*x.length-5] = cast(uint) c; + c >>= 32; + c += cast(ulong)(x[$-3]) * x[$-1] + dest[2*x.length-4]; + dest[2*x.length-4] = cast(uint) c; + c >>= 32; + c += cast(ulong)(x[$-1]) * x[$-2]; + dest[2*x.length-3] = cast(uint) c; + c >>= 32; + dest[2*x.length-2] = cast(uint) c; +} + +void multibyteSquare(BigDigit[] result, const(BigDigit) [] x) pure @nogc @safe +{ + multibyteTriangleAccumulate(result, x); + result[$-1] = multibyteShl(result[1..$-1], result[1..$-1], 1); // mul by 2 + result[0] = 0; + multibyteAddDiagonalSquares(result, x); +} diff --git a/libphobos/src/std/internal/math/biguintx86.d b/libphobos/src/std/internal/math/biguintx86.d new file mode 100644 index 0000000..bd03d2e --- /dev/null +++ b/libphobos/src/std/internal/math/biguintx86.d @@ -0,0 +1,1353 @@ +/** Optimised asm arbitrary precision arithmetic ('bignum') + * routines for X86 processors. + * + * All functions operate on arrays of uints, stored LSB first. + * If there is a destination array, it will be the first parameter. + * Currently, all of these functions are subject to change, and are + * intended for internal use only. + * The symbol [#] indicates an array of machine words which is to be + * interpreted as a multi-byte number. + */ + +/* Copyright Don Clugston 2008 - 2010. + * Distributed under the Boost Software License, Version 1.0. + * (See accompanying file LICENSE_1_0.txt or copy at + * http://www.boost.org/LICENSE_1_0.txt) + */ +/** + * In simple terms, there are 3 modern x86 microarchitectures: + * (a) the P6 family (Pentium Pro, PII, PIII, PM, Core), produced by Intel; + * (b) the K6, Athlon, and AMD64 families, produced by AMD; and + * (c) the Pentium 4, produced by Marketing. + * + * This code has been optimised for the Intel P6 family. + * Generally the code remains near-optimal for Intel Core2/Corei7, after + * translating EAX-> RAX, etc, since all these CPUs use essentially the same + * pipeline, and are typically limited by memory access. + * The code uses techniques described in Agner Fog's superb Pentium manuals + * available at www.agner.org. + * Not optimised for AMD, which can do two memory loads per cycle (Intel + * CPUs can only do one). Despite this, performance is superior on AMD. + * Performance is dreadful on P4. + * + * Timing results (cycles per int) + * --Intel Pentium-- --AMD-- + * PM P4 Core2 K7 + * +,- 2.25 15.6 2.25 1.5 + * <<,>> 2.0 6.6 2.0 5.0 + * (<< MMX) 1.7 5.3 1.5 1.2 + * * 5.0 15.0 4.0 4.3 + * mulAdd 5.7 19.0 4.9 4.0 + * div 30.0 32.0 32.0 22.4 + * mulAcc(32) 6.5 20.0 5.4 4.9 + * + * mulAcc(32) is multiplyAccumulate() for a 32*32 multiply. Thus it includes + * function call overhead. + * The timing for Div is quite unpredictable, but it's probably too slow + * to be useful. On 64-bit processors, these times should + * halve if run in 64-bit mode, except for the MMX functions. + */ + +module std.internal.math.biguintx86; + +@system: +pure: +nothrow: + +/* + Naked asm is used throughout, because: + (a) it frees up the EBP register + (b) compiler bugs prevent the use of .ptr when a frame pointer is used. +*/ + +version (D_InlineAsm_X86) +{ + +private: + +/* Duplicate string s, with n times, substituting index for '@'. + * + * Each instance of '@' in s is replaced by 0,1,...n-1. This is a helper + * function for some of the asm routines. + */ +string indexedLoopUnroll(int n, string s) pure @safe +{ + string u; + for (int i = 0; i<n; ++i) + { + string nstr= (i>9 ? ""~ cast(char)('0'+i/10) : "") ~ cast(char)('0' + i%10); + + int last = 0; + for (int j = 0; j<s.length; ++j) + { + if (s[j]=='@') + { + u ~= s[last .. j] ~ nstr; + last = j+1; + } + } + if (last<s.length) u = u ~ s[last..$]; + + } + return u; +} +@safe unittest +{ + assert(indexedLoopUnroll(3, "@*23;")=="0*23;1*23;2*23;"); +} + +public: + +alias BigDigit = uint; // A Bignum is an array of BigDigits. Usually the machine word size. + +// Limits for when to switch between multiplication algorithms. +enum : int { KARATSUBALIMIT = 18 }; // Minimum value for which Karatsuba is worthwhile. +enum : int { KARATSUBASQUARELIMIT=26 }; // Minimum value for which square Karatsuba is worthwhile + +/** Multi-byte addition or subtraction + * dest[#] = src1[#] + src2[#] + carry (0 or 1). + * or dest[#] = src1[#] - src2[#] - carry (0 or 1). + * Returns carry or borrow (0 or 1). + * Set op == '+' for addition, '-' for subtraction. + */ +uint multibyteAddSub(char op)(uint[] dest, const uint [] src1, const uint [] + src2, uint carry) pure +{ + // Timing: + // Pentium M: 2.25/int + // P6 family, Core2 have a partial flags stall when reading the carry flag in + // an ADC, SBB operation after an operation such as INC or DEC which + // modifies some, but not all, flags. We avoid this by storing carry into + // a resister (AL), and restoring it after the branch. + + enum { LASTPARAM = 4*4 } // 3* pushes + return address. + asm pure nothrow { + naked; + push EDI; + push EBX; + push ESI; + mov ECX, [ESP + LASTPARAM + 4*4]; // dest.length; + mov EDX, [ESP + LASTPARAM + 3*4]; // src1.ptr + mov ESI, [ESP + LASTPARAM + 1*4]; // src2.ptr + mov EDI, [ESP + LASTPARAM + 5*4]; // dest.ptr + // Carry is in EAX + // Count UP to zero (from -len) to minimize loop overhead. + lea EDX, [EDX + 4*ECX]; // EDX = end of src1. + lea ESI, [ESI + 4*ECX]; // EBP = end of src2. + lea EDI, [EDI + 4*ECX]; // EDI = end of dest. + + neg ECX; + add ECX, 8; + jb L2; // if length < 8 , bypass the unrolled loop. +L_unrolled: + shr AL, 1; // get carry from EAX + } + mixin(" asm pure nothrow {" + ~ indexedLoopUnroll( 8, + "mov EAX, [@*4-8*4+EDX+ECX*4];" + ~ ( op == '+' ? "adc" : "sbb" ) ~ " EAX, [@*4-8*4+ESI+ECX*4];" + ~ "mov [@*4-8*4+EDI+ECX*4], EAX;") + ~ "}"); + asm pure nothrow { + setc AL; // save carry + add ECX, 8; + ja L_unrolled; +L2: // Do the residual 1 .. 7 ints. + + sub ECX, 8; + jz done; +L_residual: + shr AL, 1; // get carry from EAX + } + mixin(" asm pure nothrow {" + ~ indexedLoopUnroll( 1, + "mov EAX, [@*4+EDX+ECX*4];" + ~ ( op == '+' ? "adc" : "sbb" ) ~ " EAX, [@*4+ESI+ECX*4];" + ~ "mov [@*4+EDI+ECX*4], EAX;") ~ "}"); + asm pure nothrow { + setc AL; // save carry + add ECX, 1; + jnz L_residual; +done: + and EAX, 1; // make it O or 1. + pop ESI; + pop EBX; + pop EDI; + ret 6*4; + } +} + +@system unittest +{ + uint [] a = new uint[40]; + uint [] b = new uint[40]; + uint [] c = new uint[40]; + for (int i=0; i<a.length; ++i) + { + if (i&1) a[i]=0x8000_0000 + i; + else a[i]=i; + b[i]= 0x8000_0003; + } + c[19]=0x3333_3333; + uint carry = multibyteAddSub!('+')(c[0 .. 18], a[0 .. 18], b[0 .. 18], 0); + assert(carry == 1); + assert(c[0]==0x8000_0003); + assert(c[1]==4); + assert(c[19]==0x3333_3333); // check for overrun + for (int i=0; i<a.length; ++i) + { + a[i]=b[i]=c[i]=0; + } + a[8]=0x048D159E; + b[8]=0x048D159E; + a[10]=0x1D950C84; + b[10]=0x1D950C84; + a[5] =0x44444444; + carry = multibyteAddSub!('-')(a[0 .. 12], a[0 .. 12], b[0 .. 12], 0); + assert(a[11]==0); + for (int i=0; i<10; ++i) if (i != 5) assert(a[i]==0); + + for (int q=3; q<36;++q) + { + for (int i=0; i<a.length; ++i) + { + a[i]=b[i]=c[i]=0; + } + a[q-2]=0x040000; + b[q-2]=0x040000; + carry = multibyteAddSub!('-')(a[0 .. q], a[0 .. q], b[0 .. q], 0); + assert(a[q-2]==0); + } +} + +/** dest[#] += carry, or dest[#] -= carry. + * op must be '+' or '-' + * Returns final carry or borrow (0 or 1) + */ +uint multibyteIncrementAssign(char op)(uint[] dest, uint carry) pure +{ + enum { LASTPARAM = 1*4 } // 0* pushes + return address. + asm pure nothrow { + naked; + mov ECX, [ESP + LASTPARAM + 0*4]; // dest.length; + mov EDX, [ESP + LASTPARAM + 1*4]; // dest.ptr + // EAX = carry +L1: ; + } + static if (op=='+') + asm pure nothrow { add [EDX], EAX; } + else + asm pure nothrow { sub [EDX], EAX; } + asm pure nothrow { + mov EAX, 1; + jnc L2; + add EDX, 4; + dec ECX; + jnz L1; + mov EAX, 2; +L2: dec EAX; + ret 2*4; + } +} + +/** dest[#] = src[#] << numbits + * numbits must be in the range 1 .. 31 + * Returns the overflow + */ +uint multibyteShlNoMMX(uint [] dest, const uint [] src, uint numbits) pure +{ + // Timing: Optimal for P6 family. + // 2.0 cycles/int on PPro .. PM (limited by execution port p0) + // 5.0 cycles/int on Athlon, which has 7 cycles for SHLD!! + enum { LASTPARAM = 4*4 } // 3* pushes + return address. + asm pure nothrow { + naked; + push ESI; + push EDI; + push EBX; + mov EDI, [ESP + LASTPARAM + 4*3]; //dest.ptr; + mov EBX, [ESP + LASTPARAM + 4*2]; //dest.length; + mov ESI, [ESP + LASTPARAM + 4*1]; //src.ptr; + mov ECX, EAX; // numbits; + + mov EAX, [-4+ESI + 4*EBX]; + mov EDX, 0; + shld EDX, EAX, CL; + push EDX; // Save return value + cmp EBX, 1; + jz L_last; + mov EDX, [-4+ESI + 4*EBX]; + test EBX, 1; + jz L_odd; + sub EBX, 1; +L_even: + mov EDX, [-4+ ESI + 4*EBX]; + shld EAX, EDX, CL; + mov [EDI+4*EBX], EAX; +L_odd: + mov EAX, [-8+ESI + 4*EBX]; + shld EDX, EAX, CL; + mov [-4+EDI + 4*EBX], EDX; + sub EBX, 2; + jg L_even; +L_last: + shl EAX, CL; + mov [EDI], EAX; + pop EAX; // pop return value + pop EBX; + pop EDI; + pop ESI; + ret 4*4; + } +} + +/** dest[#] = src[#] >> numbits + * numbits must be in the range 1 .. 31 + * This version uses MMX. + */ +uint multibyteShl(uint [] dest, const uint [] src, uint numbits) pure +{ + // Timing: + // K7 1.2/int. PM 1.7/int P4 5.3/int + enum { LASTPARAM = 4*4 } // 3* pushes + return address. + asm pure nothrow { + naked; + push ESI; + push EDI; + push EBX; + mov EDI, [ESP + LASTPARAM + 4*3]; //dest.ptr; + mov EBX, [ESP + LASTPARAM + 4*2]; //dest.length; + mov ESI, [ESP + LASTPARAM + 4*1]; //src.ptr; + + movd MM3, EAX; // numbits = bits to shift left + xor EAX, 63; + align 16; + inc EAX; + movd MM4, EAX ; // 64-numbits = bits to shift right + + // Get the return value into EAX + and EAX, 31; // EAX = 32-numbits + movd MM2, EAX; // 32-numbits + movd MM1, [ESI+4*EBX-4]; + psrlq MM1, MM2; + movd EAX, MM1; // EAX = return value + test EBX, 1; + jz L_even; +L_odd: + cmp EBX, 1; + jz L_length1; + + // deal with odd lengths + movq MM1, [ESI+4*EBX-8]; + psrlq MM1, MM2; + movd [EDI +4*EBX-4], MM1; + sub EBX, 1; +L_even: // It's either singly or doubly even + movq MM2, [ESI + 4*EBX - 8]; + psllq MM2, MM3; + sub EBX, 2; + jle L_last; + movq MM1, MM2; + add EBX, 2; + test EBX, 2; + jz L_onceeven; + sub EBX, 2; + + // MAIN LOOP -- 128 bytes per iteration + L_twiceeven: // here MM2 is the carry + movq MM0, [ESI + 4*EBX-8]; + psrlq MM0, MM4; + movq MM1, [ESI + 4*EBX-8]; + psllq MM1, MM3; + por MM2, MM0; + movq [EDI +4*EBX], MM2; +L_onceeven: // here MM1 is the carry + movq MM0, [ESI + 4*EBX-16]; + psrlq MM0, MM4; + movq MM2, [ESI + 4*EBX-16]; + por MM1, MM0; + movq [EDI +4*EBX-8], MM1; + psllq MM2, MM3; + sub EBX, 4; + jg L_twiceeven; +L_last: + movq [EDI +4*EBX], MM2; +L_alldone: + emms; // NOTE: costs 6 cycles on Intel CPUs + pop EBX; + pop EDI; + pop ESI; + ret 4*4; + +L_length1: + // length 1 is a special case + movd MM1, [ESI]; + psllq MM1, MM3; + movd [EDI], MM1; + jmp L_alldone; + } +} + +void multibyteShr(uint [] dest, const uint [] src, uint numbits) pure +{ + enum { LASTPARAM = 4*4 } // 3* pushes + return address. + asm pure nothrow { + naked; + push ESI; + push EDI; + push EBX; + mov EDI, [ESP + LASTPARAM + 4*3]; //dest.ptr; + mov EBX, [ESP + LASTPARAM + 4*2]; //dest.length; +align 16; + mov ESI, [ESP + LASTPARAM + 4*1]; //src.ptr; + lea EDI, [EDI + 4*EBX]; // EDI = end of dest + lea ESI, [ESI + 4*EBX]; // ESI = end of src + neg EBX; // count UP to zero. + + movd MM3, EAX; // numbits = bits to shift right + xor EAX, 63; + inc EAX; + movd MM4, EAX ; // 64-numbits = bits to shift left + + test EBX, 1; + jz L_even; +L_odd: + // deal with odd lengths + and EAX, 31; // EAX = 32-numbits + movd MM2, EAX; // 32-numbits + cmp EBX, -1; + jz L_length1; + + movq MM0, [ESI+4*EBX]; + psrlq MM0, MM3; + movd [EDI +4*EBX], MM0; + add EBX, 1; +L_even: + movq MM2, [ESI + 4*EBX]; + psrlq MM2, MM3; + + movq MM1, MM2; + add EBX, 4; + cmp EBX, -2+4; + jz L_last; + // It's either singly or doubly even + sub EBX, 2; + test EBX, 2; + jnz L_onceeven; + add EBX, 2; + + // MAIN LOOP -- 128 bytes per iteration + L_twiceeven: // here MM2 is the carry + movq MM0, [ESI + 4*EBX-8]; + psllq MM0, MM4; + movq MM1, [ESI + 4*EBX-8]; + psrlq MM1, MM3; + por MM2, MM0; + movq [EDI +4*EBX-16], MM2; +L_onceeven: // here MM1 is the carry + movq MM0, [ESI + 4*EBX]; + psllq MM0, MM4; + movq MM2, [ESI + 4*EBX]; + por MM1, MM0; + movq [EDI +4*EBX-8], MM1; + psrlq MM2, MM3; + add EBX, 4; + jl L_twiceeven; +L_last: + movq [EDI +4*EBX-16], MM2; +L_alldone: + emms; // NOTE: costs 6 cycles on Intel CPUs + pop EBX; + pop EDI; + pop ESI; + ret 4*4; + +L_length1: + // length 1 is a special case + movd MM1, [ESI+4*EBX]; + psrlq MM1, MM3; + movd [EDI +4*EBX], MM1; + jmp L_alldone; + + } +} + +/** dest[#] = src[#] >> numbits + * numbits must be in the range 1 .. 31 + */ +void multibyteShrNoMMX(uint [] dest, const uint [] src, uint numbits) pure +{ + // Timing: Optimal for P6 family. + // 2.0 cycles/int on PPro .. PM (limited by execution port p0) + // Terrible performance on AMD64, which has 7 cycles for SHRD!! + enum { LASTPARAM = 4*4 } // 3* pushes + return address. + asm pure nothrow { + naked; + push ESI; + push EDI; + push EBX; + mov EDI, [ESP + LASTPARAM + 4*3]; //dest.ptr; + mov EBX, [ESP + LASTPARAM + 4*2]; //dest.length; + mov ESI, [ESP + LASTPARAM + 4*1]; //src.ptr; + mov ECX, EAX; // numbits; + + lea EDI, [EDI + 4*EBX]; // EDI = end of dest + lea ESI, [ESI + 4*EBX]; // ESI = end of src + neg EBX; // count UP to zero. + mov EAX, [ESI + 4*EBX]; + cmp EBX, -1; + jz L_last; + mov EDX, [ESI + 4*EBX]; + test EBX, 1; + jz L_odd; + add EBX, 1; +L_even: + mov EDX, [ ESI + 4*EBX]; + shrd EAX, EDX, CL; + mov [-4 + EDI+4*EBX], EAX; +L_odd: + mov EAX, [4 + ESI + 4*EBX]; + shrd EDX, EAX, CL; + mov [EDI + 4*EBX], EDX; + add EBX, 2; + jl L_even; +L_last: + shr EAX, CL; + mov [-4 + EDI], EAX; + + pop EBX; + pop EDI; + pop ESI; + ret 4*4; + } +} + +@system unittest +{ + + uint [] aa = [0x1222_2223, 0x4555_5556, 0x8999_999A, 0xBCCC_CCCD, 0xEEEE_EEEE]; + multibyteShr(aa[0..$-1], aa, 4); + assert(aa[0] == 0x6122_2222 && aa[1]==0xA455_5555 + && aa[2]==0xD899_9999 && aa[3]==0x0BCC_CCCC); + + aa = [0x1222_2223, 0x4555_5556, 0x8999_999A, 0xBCCC_CCCD, 0xEEEE_EEEE]; + multibyteShr(aa[2..$-1], aa[2..$-1], 4); + assert(aa[0] == 0x1222_2223 && aa[1]==0x4555_5556 + && aa[2]==0xD899_9999 && aa[3]==0x0BCC_CCCC); + + aa = [0x1222_2223, 0x4555_5556, 0x8999_999A, 0xBCCC_CCCD, 0xEEEE_EEEE]; + multibyteShr(aa[0..$-2], aa, 4); + assert(aa[1]==0xA455_5555 && aa[2]==0x0899_9999); + assert(aa[0]==0x6122_2222); + assert(aa[3]==0xBCCC_CCCD); + + + aa = [0xF0FF_FFFF, 0x1222_2223, 0x4555_5556, 0x8999_999A, 0xBCCC_CCCD, 0xEEEE_EEEE]; + uint r = multibyteShl(aa[2 .. 4], aa[2 .. 4], 4); + assert(aa[0] == 0xF0FF_FFFF && aa[1]==0x1222_2223 + && aa[2]==0x5555_5560 && aa[3]==0x9999_99A4 && aa[4]==0xBCCC_CCCD); + assert(r == 8); + + aa = [0xF0FF_FFFF, 0x1222_2223, 0x4555_5556, 0x8999_999A, 0xBCCC_CCCD, 0xEEEE_EEEE]; + r = multibyteShl(aa[1 .. 4], aa[1 .. 4], 4); + assert(aa[0] == 0xF0FF_FFFF + && aa[2]==0x5555_5561); + assert(aa[3]==0x9999_99A4 && aa[4]==0xBCCC_CCCD); + assert(r == 8); + assert(aa[1]==0x2222_2230); + + aa = [0xF0FF_FFFF, 0x1222_2223, 0x4555_5556, 0x8999_999A, 0xBCCC_CCCD, 0xEEEE_EEEE]; + r = multibyteShl(aa[0 .. 4], aa[1 .. 5], 31); +} + +/** dest[#] = src[#] * multiplier + carry. + * Returns carry. + */ +uint multibyteMul(uint[] dest, const uint[] src, uint multiplier, uint carry) + pure +{ + // Timing: definitely not optimal. + // Pentium M: 5.0 cycles/operation, has 3 resource stalls/iteration + // Fastest implementation found was 4.6 cycles/op, but not worth the complexity. + + enum { LASTPARAM = 4*4 } // 4* pushes + return address. + // We'll use p2 (load unit) instead of the overworked p0 or p1 (ALU units) + // when initializing variables to zero. + version (D_PIC) + { + enum { zero = 0 } + } + else + { + __gshared int zero = 0; + } + asm pure nothrow { + naked; + push ESI; + push EDI; + push EBX; + + mov EDI, [ESP + LASTPARAM + 4*4]; // dest.ptr + mov EBX, [ESP + LASTPARAM + 4*3]; // dest.length + mov ESI, [ESP + LASTPARAM + 4*2]; // src.ptr + align 16; + lea EDI, [EDI + 4*EBX]; // EDI = end of dest + lea ESI, [ESI + 4*EBX]; // ESI = end of src + mov ECX, EAX; // [carry]; -- last param is in EAX. + neg EBX; // count UP to zero. + test EBX, 1; + jnz L_odd; + add EBX, 1; + L1: + mov EAX, [-4 + ESI + 4*EBX]; + mul int ptr [ESP+LASTPARAM]; //[multiplier]; + add EAX, ECX; + mov ECX, zero; + mov [-4+EDI + 4*EBX], EAX; + adc ECX, EDX; +L_odd: + mov EAX, [ESI + 4*EBX]; // p2 + mul int ptr [ESP+LASTPARAM]; //[multiplier]; // p0*3, + add EAX, ECX; + mov ECX, zero; + adc ECX, EDX; + mov [EDI + 4*EBX], EAX; + add EBX, 2; + jl L1; + + mov EAX, ECX; // get final carry + + pop EBX; + pop EDI; + pop ESI; + ret 5*4; + } +} + +@system unittest +{ + uint [] aa = [0xF0FF_FFFF, 0x1222_2223, 0x4555_5556, 0x8999_999A, 0xBCCC_CCCD, 0xEEEE_EEEE]; + multibyteMul(aa[1 .. 4], aa[1 .. 4], 16, 0); + assert(aa[0] == 0xF0FF_FFFF && aa[1] == 0x2222_2230 && + aa[2]==0x5555_5561 && aa[3]==0x9999_99A4 && aa[4]==0x0BCCC_CCCD); +} + +// The inner multiply-and-add loop, together with the Even entry point. +// Multiples by M_ADDRESS which should be "ESP+LASTPARAM" or "ESP". OP must be "add" or "sub" +// This is the most time-critical code in the BigInt library. +// It is used by both MulAdd, multiplyAccumulate, and triangleAccumulate +string asmMulAdd_innerloop(string OP, string M_ADDRESS) pure { + // The bottlenecks in this code are extremely complicated. The MUL, ADD, and ADC + // need 4 cycles on each of the ALUs units p0 and p1. So we use memory load + // (unit p2) for initializing registers to zero. + // There are also dependencies between the instructions, and we run up against the + // ROB-read limit (can only read 2 registers per cycle). + // We also need the number of uops in the loop to be a multiple of 3. + // The only available execution unit for this is p3 (memory write). Unfortunately we can't do that + // if Position-Independent Code is required. + + // Register usage + // ESI = end of src + // EDI = end of dest + // EBX = index. Counts up to zero (in steps of 2). + // EDX:EAX = scratch, used in multiply. + // ECX = carry1. + // EBP = carry2. + // ESP = points to the multiplier. + + // The first member of 'dest' which will be modified is [EDI+4*EBX]. + // EAX must already contain the first member of 'src', [ESI+4*EBX]. + + version (D_PIC) { bool using_PIC = true; } else { bool using_PIC = false; } + return " + // Entry point for even length + add EBX, 1; + mov EBP, ECX; // carry + + mul int ptr [" ~ M_ADDRESS ~ "]; // M + mov ECX, 0; + + add EBP, EAX; + mov EAX, [ESI+4*EBX]; + adc ECX, EDX; + + mul int ptr [" ~ M_ADDRESS ~ "]; // M + " ~ OP ~ " [-4+EDI+4*EBX], EBP; + mov EBP, zero; + + adc ECX, EAX; + mov EAX, [4+ESI+4*EBX]; + + adc EBP, EDX; + add EBX, 2; + jnl L_done; +L1: + mul int ptr [" ~ M_ADDRESS ~ "]; + " ~ OP ~ " [-8+EDI+4*EBX], ECX; + adc EBP, EAX; + mov ECX, zero; + mov EAX, [ESI+4*EBX]; + adc ECX, EDX; +" ~ + (using_PIC ? "" : " mov storagenop, EDX; ") // make #uops in loop a multiple of 3, can't do this in PIC mode. +~ " + mul int ptr [" ~ M_ADDRESS ~ "]; + " ~ OP ~ " [-4+EDI+4*EBX], EBP; + mov EBP, zero; + + adc ECX, EAX; + mov EAX, [4+ESI+4*EBX]; + + adc EBP, EDX; + add EBX, 2; + jl L1; +L_done: " ~ OP ~ " [-8+EDI+4*EBX], ECX; + adc EBP, 0; +"; + // final carry is now in EBP +} + +string asmMulAdd_enter_odd(string OP, string M_ADDRESS) pure +{ + return " + mul int ptr [" ~M_ADDRESS ~"]; + mov EBP, zero; + add ECX, EAX; + mov EAX, [4+ESI+4*EBX]; + + adc EBP, EDX; + add EBX, 2; + jl L1; + jmp L_done; +"; +} + + + +/** + * dest[#] += src[#] * multiplier OP carry(0 .. FFFF_FFFF). + * where op == '+' or '-' + * Returns carry out of MSB (0 .. FFFF_FFFF). + */ +uint multibyteMulAdd(char op)(uint [] dest, const uint [] src, uint + multiplier, uint carry) pure { + // Timing: This is the most time-critical bignum function. + // Pentium M: 5.4 cycles/operation, still has 2 resource stalls + 1load block/iteration + + // The main loop is pipelined and unrolled by 2, + // so entry to the loop is also complicated. + + // Register usage + // EDX:EAX = multiply + // EBX = counter + // ECX = carry1 + // EBP = carry2 + // EDI = dest + // ESI = src + + enum string OP = (op=='+')? "add" : "sub"; + version (D_PIC) + { + enum { zero = 0 } + } + else + { + // use p2 (load unit) instead of the overworked p0 or p1 (ALU units) + // when initializing registers to zero. + __gshared int zero = 0; + // use p3/p4 units + __gshared int storagenop; // write-only + } + + enum { LASTPARAM = 5*4 } // 4* pushes + return address. + asm pure nothrow { + naked; + + push ESI; + push EDI; + push EBX; + push EBP; + mov EDI, [ESP + LASTPARAM + 4*4]; // dest.ptr + mov EBX, [ESP + LASTPARAM + 4*3]; // dest.length + align 16; + nop; + mov ESI, [ESP + LASTPARAM + 4*2]; // src.ptr + lea EDI, [EDI + 4*EBX]; // EDI = end of dest + lea ESI, [ESI + 4*EBX]; // ESI = end of src + mov EBP, 0; + mov ECX, EAX; // ECX = input carry. + neg EBX; // count UP to zero. + mov EAX, [ESI+4*EBX]; + test EBX, 1; + jnz L_enter_odd; + } + // Main loop, with entry point for even length + mixin("asm pure nothrow {" ~ asmMulAdd_innerloop(OP, "ESP+LASTPARAM") ~ "}"); + asm pure nothrow { + mov EAX, EBP; // get final carry + pop EBP; + pop EBX; + pop EDI; + pop ESI; + ret 5*4; + } +L_enter_odd: + mixin("asm pure nothrow {" ~ asmMulAdd_enter_odd(OP, "ESP+LASTPARAM") ~ "}"); +} + +@system unittest +{ + + uint [] aa = [0xF0FF_FFFF, 0x1222_2223, 0x4555_5556, 0x8999_999A, 0xBCCC_CCCD, 0xEEEE_EEEE]; + uint [] bb = [0x1234_1234, 0xF0F0_F0F0, 0x00C0_C0C0, 0xF0F0_F0F0, 0xC0C0_C0C0]; + multibyteMulAdd!('+')(bb[1..$-1], aa[1..$-2], 16, 5); + assert(bb[0] == 0x1234_1234 && bb[4] == 0xC0C0_C0C0); + assert(bb[1] == 0x2222_2230 + 0xF0F0_F0F0+5 && bb[2] == 0x5555_5561+0x00C0_C0C0+1 + && bb[3] == 0x9999_99A4+0xF0F0_F0F0 ); +} + +/** + Sets result[#] = result[0 .. left.length] + left[#] * right[#] + + It is defined in this way to allow cache-efficient multiplication. + This function is equivalent to: + ---- + for (int i = 0; i< right.length; ++i) + { + dest[left.length + i] = multibyteMulAdd(dest[i .. left.length+i], + left, right[i], 0); + } + ---- + */ +void multibyteMultiplyAccumulate(uint [] dest, const uint[] left, + const uint [] right) pure { + // Register usage + // EDX:EAX = used in multiply + // EBX = index + // ECX = carry1 + // EBP = carry2 + // EDI = end of dest for this pass through the loop. Index for outer loop. + // ESI = end of left. never changes + // [ESP] = M = right[i] = multiplier for this pass through the loop. + // right.length is changed into dest.ptr+dest.length + version (D_PIC) + { + enum { zero = 0 } + } + else + { + // use p2 (load unit) instead of the overworked p0 or p1 (ALU units) + // when initializing registers to zero. + __gshared int zero = 0; + // use p3/p4 units + __gshared int storagenop; // write-only + } + + enum { LASTPARAM = 6*4 } // 4* pushes + local + return address. + asm pure nothrow { + naked; + + push ESI; + push EDI; + align 16; + push EBX; + push EBP; + push EAX; // local variable M + mov EDI, [ESP + LASTPARAM + 4*5]; // dest.ptr + mov EBX, [ESP + LASTPARAM + 4*2]; // left.length + mov ESI, [ESP + LASTPARAM + 4*3]; // left.ptr + lea EDI, [EDI + 4*EBX]; // EDI = end of dest for first pass + + mov EAX, [ESP + LASTPARAM + 4*0]; // right.length + lea EAX, [EDI + 4*EAX]; + mov [ESP + LASTPARAM + 4*0], EAX; // last value for EDI + + lea ESI, [ESI + 4*EBX]; // ESI = end of left + mov EAX, [ESP + LASTPARAM + 4*1]; // right.ptr + mov EAX, [EAX]; + mov [ESP], EAX; // M +outer_loop: + mov EBP, 0; + mov ECX, 0; // ECX = input carry. + neg EBX; // count UP to zero. + mov EAX, [ESI+4*EBX]; + test EBX, 1; + jnz L_enter_odd; + } + // -- Inner loop, with even entry point + mixin("asm pure nothrow { " ~ asmMulAdd_innerloop("add", "ESP") ~ "}"); + asm pure nothrow { + mov [-4+EDI+4*EBX], EBP; + add EDI, 4; + cmp EDI, [ESP + LASTPARAM + 4*0]; // is EDI = &dest[$]? + jz outer_done; + mov EAX, [ESP + LASTPARAM + 4*1]; // right.ptr + mov EAX, [EAX+4]; // get new M + mov [ESP], EAX; // save new M + add int ptr [ESP + LASTPARAM + 4*1], 4; // right.ptr + mov EBX, [ESP + LASTPARAM + 4*2]; // left.length + jmp outer_loop; +outer_done: + pop EAX; + pop EBP; + pop EBX; + pop EDI; + pop ESI; + ret 6*4; + } +L_enter_odd: + mixin("asm pure nothrow {" ~ asmMulAdd_enter_odd("add", "ESP") ~ "}"); +} + +/** dest[#] /= divisor. + * overflow is the initial remainder, and must be in the range 0 .. divisor-1. + * divisor must not be a power of 2 (use right shift for that case; + * A division by zero will occur if divisor is a power of 2). + * Returns the final remainder + * + * Based on public domain code by Eric Bainville. + * (http://www.bealto.com/) Used with permission. + */ +uint multibyteDivAssign(uint [] dest, uint divisor, uint overflow) pure +{ + // Timing: limited by a horrible dependency chain. + // Pentium M: 18 cycles/op, 8 resource stalls/op. + // EAX, EDX = scratch, used by MUL + // EDI = dest + // CL = shift + // ESI = quotient + // EBX = remainderhi + // EBP = remainderlo + // [ESP-4] = mask + // [ESP] = kinv (2^64 /divisor) + enum { LASTPARAM = 5*4 } // 4* pushes + return address. + enum { LOCALS = 2*4} // MASK, KINV + asm pure nothrow { + naked; + + push ESI; + push EDI; + push EBX; + push EBP; + + mov EDI, [ESP + LASTPARAM + 4*2]; // dest.ptr + mov EBX, [ESP + LASTPARAM + 4*1]; // dest.length + + // Loop from msb to lsb + lea EDI, [EDI + 4*EBX]; + mov EBP, EAX; // rem is the input remainder, in 0 .. divisor-1 + // Build the pseudo-inverse of divisor k: 2^64/k + // First determine the shift in ecx to get the max number of bits in kinv + xor ECX, ECX; + mov EAX, [ESP + LASTPARAM]; //divisor; + mov EDX, 1; +kinv1: + inc ECX; + ror EDX, 1; + shl EAX, 1; + jnc kinv1; + dec ECX; + // Here, ecx is a left shift moving the msb of k to bit 32 + + mov EAX, 1; + shl EAX, CL; + dec EAX; + ror EAX, CL ; //ecx bits at msb + push EAX; // MASK + + // Then divide 2^(32+cx) by divisor (edx already ok) + xor EAX, EAX; + div int ptr [ESP + LASTPARAM + LOCALS-4*1]; //divisor; + push EAX; // kinv + align 16; +L2: + // Get 32 bits of quotient approx, multiplying + // most significant word of (rem*2^32+input) + mov EAX, [ESP+4]; //MASK; + and EAX, [EDI - 4]; + or EAX, EBP; + rol EAX, CL; + mov EBX, EBP; + mov EBP, [EDI - 4]; + mul int ptr [ESP]; //KINV; + + shl EAX, 1; + rcl EDX, 1; + + // Multiply by k and subtract to get remainder + // Subtraction must be done on two words + mov EAX, EDX; + mov ESI, EDX; // quot = high word + mul int ptr [ESP + LASTPARAM+LOCALS]; //divisor; + sub EBP, EAX; + sbb EBX, EDX; + jz Lb; // high word is 0, goto adjust on single word + + // Adjust quotient and remainder on two words +Ld: inc ESI; + sub EBP, [ESP + LASTPARAM+LOCALS]; //divisor; + sbb EBX, 0; + jnz Ld; + + // Adjust quotient and remainder on single word +Lb: cmp EBP, [ESP + LASTPARAM+LOCALS]; //divisor; + jc Lc; // rem in 0 .. divisor-1, OK + sub EBP, [ESP + LASTPARAM+LOCALS]; //divisor; + inc ESI; + jmp Lb; + + // Store result +Lc: + mov [EDI - 4], ESI; + lea EDI, [EDI - 4]; + dec int ptr [ESP + LASTPARAM + 4*1+LOCALS]; // len + jnz L2; + + pop EAX; // discard kinv + pop EAX; // discard mask + + mov EAX, EBP; // return final remainder + pop EBP; + pop EBX; + pop EDI; + pop ESI; + ret 3*4; + } +} + +@system unittest +{ + uint [] aa = new uint[101]; + for (int i=0; i<aa.length; ++i) aa[i] = 0x8765_4321 * (i+3); + uint overflow = multibyteMul(aa, aa, 0x8EFD_FCFB, 0x33FF_7461); + uint r = multibyteDivAssign(aa, 0x8EFD_FCFB, overflow); + for (int i=0; i<aa.length-1; ++i) assert(aa[i] == 0x8765_4321 * (i+3)); + assert(r == 0x33FF_7461); +} + +// Set dest[2*i .. 2*i+1]+=src[i]*src[i] +void multibyteAddDiagonalSquares(uint [] dest, const uint [] src) pure +{ + /* Unlike mulAdd, the carry is only 1 bit, + since FFFF*FFFF+FFFF_FFFF = 1_0000_0000. + Note also that on the last iteration, no carry can occur. + As for multibyteAdd, we save & restore carry flag through the loop. + + The timing is entirely dictated by the dependency chain. We could + improve it by moving the mov EAX after the adc [EDI], EAX. Probably not worthwhile. + */ + enum { LASTPARAM = 4*5 } // 4* pushes + return address. + asm pure nothrow { + naked; + push ESI; + push EDI; + push EBX; + push ECX; + mov EDI, [ESP + LASTPARAM + 4*3]; //dest.ptr; + mov EBX, [ESP + LASTPARAM + 4*0]; //src.length; + mov ESI, [ESP + LASTPARAM + 4*1]; //src.ptr; + lea EDI, [EDI + 8*EBX]; // EDI = end of dest + lea ESI, [ESI + 4*EBX]; // ESI = end of src + neg EBX; // count UP to zero. + xor ECX, ECX; // initial carry = 0. +L1: + mov EAX, [ESI + 4*EBX]; + mul EAX, EAX; + shr CL, 1; // get carry + adc [EDI + 8*EBX], EAX; + adc [EDI + 8*EBX + 4], EDX; + setc CL; // save carry + inc EBX; + jnz L1; + + pop ECX; + pop EBX; + pop EDI; + pop ESI; + ret 4*4; + } +} + +@system unittest +{ + uint [] aa = new uint[13]; + uint [] bb = new uint[6]; + for (int i=0; i<aa.length; ++i) aa[i] = 0x8000_0000; + for (int i=0; i<bb.length; ++i) bb[i] = i; + aa[$-1]= 7; + multibyteAddDiagonalSquares(aa[0..$-1], bb); + assert(aa[$-1]==7); + for (int i=0; i<bb.length; ++i) { assert(aa[2*i]==0x8000_0000+i*i); assert(aa[2*i+1]==0x8000_0000); } +} + +void multibyteTriangleAccumulateD(uint[] dest, uint[] x) pure +{ + for (int i = 0; i < x.length-3; ++i) + { + dest[i+x.length] = multibyteMulAdd!('+')( + dest[i+i+1 .. i+x.length], x[i+1..$], x[i], 0); + } + ulong c = cast(ulong)(x[$-3]) * x[$-2] + dest[$-5]; + dest[$-5] = cast(uint) c; + c >>= 32; + c += cast(ulong)(x[$-3]) * x[$-1] + dest[$-4]; + dest[$-4] = cast(uint) c; + c >>= 32; +length2: + c += cast(ulong)(x[$-2]) * x[$-1]; + dest[$-3] = cast(uint) c; + c >>= 32; + dest[$-2] = cast(uint) c; +} + +//dest += src[0]*src[1...$] + src[1]*src[2..$] + ... + src[$-3]*src[$-2..$]+ src[$-2]*src[$-1] +// assert(dest.length = src.length*2); +// assert(src.length >= 3); +void multibyteTriangleAccumulateAsm(uint[] dest, const uint[] src) pure +{ + // Register usage + // EDX:EAX = used in multiply + // EBX = index + // ECX = carry1 + // EBP = carry2 + // EDI = end of dest for this pass through the loop. Index for outer loop. + // ESI = end of src. never changes + // [ESP] = M = src[i] = multiplier for this pass through the loop. + // dest.length is changed into dest.ptr+dest.length + version (D_PIC) + { + enum { zero = 0 } + } + else + { + // use p2 (load unit) instead of the overworked p0 or p1 (ALU units) + // when initializing registers to zero. + __gshared int zero = 0; + // use p3/p4 units + __gshared int storagenop; // write-only + } + + enum { LASTPARAM = 6*4 } // 4* pushes + local + return address. + asm pure nothrow { + naked; + + push ESI; + push EDI; + align 16; + push EBX; + push EBP; + push EAX; // local variable M= src[i] + mov EDI, [ESP + LASTPARAM + 4*3]; // dest.ptr + mov EBX, [ESP + LASTPARAM + 4*0]; // src.length + mov ESI, [ESP + LASTPARAM + 4*1]; // src.ptr + + lea ESI, [ESI + 4*EBX]; // ESI = end of left + add int ptr [ESP + LASTPARAM + 4*1], 4; // src.ptr, used for getting M + + // local variable [ESP + LASTPARAM + 4*2] = last value for EDI + lea EDI, [EDI + 4*EBX]; // EDI = end of dest for first pass + + lea EAX, [EDI + 4*EBX-3*4]; // up to src.length - 3 + mov [ESP + LASTPARAM + 4*2], EAX; // last value for EDI = &dest[src.length*2 -3] + + cmp EBX, 3; + jz length_is_3; + + // We start at src[1], not src[0]. + dec EBX; + mov [ESP + LASTPARAM + 4*0], EBX; + +outer_loop: + mov EBX, [ESP + LASTPARAM + 4*0]; // src.length + mov EBP, 0; + mov ECX, 0; // ECX = input carry. + dec [ESP + LASTPARAM + 4*0]; // Next time, the length will be shorter by 1. + neg EBX; // count UP to zero. + + mov EAX, [ESI + 4*EBX - 4*1]; // get new M + mov [ESP], EAX; // save new M + + mov EAX, [ESI+4*EBX]; + test EBX, 1; + jnz L_enter_odd; + } + // -- Inner loop, with even entry point + mixin("asm pure nothrow { " ~ asmMulAdd_innerloop("add", "ESP") ~ "}"); + asm pure nothrow { + mov [-4+EDI+4*EBX], EBP; + add EDI, 4; + cmp EDI, [ESP + LASTPARAM + 4*2]; // is EDI = &dest[$-3]? + jnz outer_loop; +length_is_3: + mov EAX, [ESI - 4*3]; + mul EAX, [ESI - 4*2]; + mov ECX, 0; + add [EDI-2*4], EAX; // ECX:dest[$-5] += x[$-3] * x[$-2] + adc ECX, EDX; + + mov EAX, [ESI - 4*3]; + mul EAX, [ESI - 4*1]; // x[$-3] * x[$-1] + add EAX, ECX; + mov ECX, 0; + adc EDX, 0; + // now EDX: EAX = c + x[$-3] * x[$-1] + add [EDI-1*4], EAX; // ECX:dest[$-4] += (EDX:EAX) + adc ECX, EDX; // ECX holds dest[$-3], it acts as carry for the last row +// do length == 2 + mov EAX, [ESI - 4*2]; + mul EAX, [ESI - 4*1]; + add ECX, EAX; + adc EDX, 0; + mov [EDI - 0*4], ECX; // dest[$-2:$-3] = c + x[$-2] * x[$-1]; + mov [EDI + 1*4], EDX; + + pop EAX; + pop EBP; + pop EBX; + pop EDI; + pop ESI; + ret 4*4; + } +L_enter_odd: + mixin("asm pure nothrow {" ~ asmMulAdd_enter_odd("add", "ESP") ~ "}"); +} + +@system unittest +{ + uint [] aa = new uint[200]; + uint [] a = aa[0 .. 100]; + uint [] b = new uint [100]; + aa[] = 761; + a[] = 0; + b[] = 0; + a[3] = 6; + b[0]=1; + b[1] = 17; + b[50 .. 100]=78; + multibyteTriangleAccumulateAsm(a, b[0 .. 50]); + uint [] c = new uint[100]; + c[] = 0; + c[1] = 17; + c[3] = 6; + assert(a[]==c[]); + assert(a[0]==0); + aa[] = 0xFFFF_FFFF; + a[] = 0; + b[] = 0; + b[0]= 0xbf6a1f01; + b[1]= 0x6e38ed64; + b[2]= 0xdaa797ed; + b[3] = 0; + + multibyteTriangleAccumulateAsm(a[0 .. 8], b[0 .. 4]); + assert(a[1]==0x3a600964); + assert(a[2]==0x339974f6); + assert(a[3]==0x46736fce); + assert(a[4]==0x5e24a2b4); + + b[3] = 0xe93ff9f4; + b[4] = 0x184f03; + a[]=0; + multibyteTriangleAccumulateAsm(a[0 .. 14], b[0 .. 7]); + assert(a[3]==0x79fff5c2); + assert(a[4]==0xcf384241); + assert(a[5]== 0x4a17fc8); + assert(a[6]==0x4d549025); +} + + +void multibyteSquare(BigDigit[] result, const BigDigit [] x) pure +{ + if (x.length < 4) + { + // Special cases, not worth doing triangular. + result[x.length] = multibyteMul(result[0 .. x.length], x, x[0], 0); + multibyteMultiplyAccumulate(result[1..$], x, x[1..$]); + return; + } + // Do half a square multiply. + // dest += src[0]*src[1...$] + src[1]*src[2..$] + ... + src[$-3]*src[$-2..$]+ src[$-2]*src[$-1] + result[x.length] = multibyteMul(result[1 .. x.length], x[1..$], x[0], 0); + multibyteTriangleAccumulateAsm(result[2..$], x[1..$]); + // Multiply by 2 + result[$-1] = multibyteShlNoMMX(result[1..$-1], result[1..$-1], 1); + // And add the diagonal elements + result[0] = 0; + multibyteAddDiagonalSquares(result, x); +} + +version (BignumPerformanceTest) +{ +import core.stdc.stdio; +int clock() { asm { push EBX; xor EAX, EAX; cpuid; pop EBX; rdtsc; } } + +__gshared uint [2200] X1; +__gshared uint [2200] Y1; +__gshared uint [4000] Z1; + +void testPerformance() pure +{ + // The performance results at the top of this file were obtained using + // a Windows device driver to access the CPU performance counters. + // The code below is less accurate but more widely usable. + // The value for division is quite inconsistent. + for (int i=0; i<X1.length; ++i) { X1[i]=i; Y1[i]=i; Z1[i]=i; } + int t, t0; + multibyteShl(Z1[0 .. 2000], X1[0 .. 2000], 7); + t0 = clock(); + multibyteShl(Z1[0 .. 1000], X1[0 .. 1000], 7); + t = clock(); + multibyteShl(Z1[0 .. 2000], X1[0 .. 2000], 7); + auto shltime = (clock() - t) - (t - t0); + t0 = clock(); + multibyteShr(Z1[2 .. 1002], X1[4 .. 1004], 13); + t = clock(); + multibyteShr(Z1[2 .. 2002], X1[4 .. 2004], 13); + auto shrtime = (clock() - t) - (t - t0); + t0 = clock(); + multibyteAddSub!('+')(Z1[0 .. 1000], X1[0 .. 1000], Y1[0 .. 1000], 0); + t = clock(); + multibyteAddSub!('+')(Z1[0 .. 2000], X1[0 .. 2000], Y1[0 .. 2000], 0); + auto addtime = (clock() - t) - (t-t0); + t0 = clock(); + multibyteMul(Z1[0 .. 1000], X1[0 .. 1000], 7, 0); + t = clock(); + multibyteMul(Z1[0 .. 2000], X1[0 .. 2000], 7, 0); + auto multime = (clock() - t) - (t - t0); + multibyteMulAdd!('+')(Z1[0 .. 2000], X1[0 .. 2000], 217, 0); + t0 = clock(); + multibyteMulAdd!('+')(Z1[0 .. 1000], X1[0 .. 1000], 217, 0); + t = clock(); + multibyteMulAdd!('+')(Z1[0 .. 2000], X1[0 .. 2000], 217, 0); + auto muladdtime = (clock() - t) - (t - t0); + multibyteMultiplyAccumulate(Z1[0 .. 64], X1[0 .. 32], Y1[0 .. 32]); + t = clock(); + multibyteMultiplyAccumulate(Z1[0 .. 64], X1[0 .. 32], Y1[0 .. 32]); + auto accumtime = clock() - t; + t0 = clock(); + multibyteDivAssign(Z1[0 .. 2000], 217, 0); + t = clock(); + multibyteDivAssign(Z1[0 .. 1000], 37, 0); + auto divtime = (t - t0) - (clock() - t); + t= clock(); + multibyteSquare(Z1[0 .. 64], X1[0 .. 32]); + auto squaretime = clock() - t; + + printf("-- BigInt asm performance (cycles/int) --\n"); + printf("Add: %.2f\n", addtime/1000.0); + printf("Shl: %.2f\n", shltime/1000.0); + printf("Shr: %.2f\n", shrtime/1000.0); + printf("Mul: %.2f\n", multime/1000.0); + printf("MulAdd: %.2f\n", muladdtime/1000.0); + printf("Div: %.2f\n", divtime/1000.0); + printf("MulAccum32: %.2f*n*n (total %d)\n", accumtime/(32.0*32.0), accumtime); + printf("Square32: %.2f*n*n (total %d)\n\n", squaretime/(32.0*32.0), squaretime); +} + +static this() +{ + testPerformance(); +} +} + +} // version (D_InlineAsm_X86) diff --git a/libphobos/src/std/internal/math/errorfunction.d b/libphobos/src/std/internal/math/errorfunction.d new file mode 100644 index 0000000..4012e64 --- /dev/null +++ b/libphobos/src/std/internal/math/errorfunction.d @@ -0,0 +1,1145 @@ +/** + * Error Functions and Normal Distribution. + * + * License: $(HTTP boost.org/LICENSE_1_0.txt, Boost License 1.0). + * Copyright: Based on the CEPHES math library, which is + * Copyright (C) 1994 Stephen L. Moshier (moshier@world.std.com). + * Authors: Stephen L. Moshier, ported to D by Don Clugston and David Nadlinger + */ +/** + * Macros: + * NAN = $(RED NAN) + * SUP = <span style="vertical-align:super;font-size:smaller">$0</span> + * GAMMA = Γ + * INTEGRAL = ∫ + * INTEGRATE = $(BIG ∫<sub>$(SMALL $1)</sub><sup>$2</sup>) + * POWER = $1<sup>$2</sup> + * BIGSUM = $(BIG Σ <sup>$2</sup><sub>$(SMALL $1)</sub>) + * CHOOSE = $(BIG () <sup>$(SMALL $1)</sup><sub>$(SMALL $2)</sub> $(BIG )) + * TABLE_SV = <table border="1" cellpadding="4" cellspacing="0"> + * <caption>Special Values</caption> + * $0</table> + * SVH = $(TR $(TH $1) $(TH $2)) + * SV = $(TR $(TD $1) $(TD $2)) + */ +module std.internal.math.errorfunction; +import std.math; + +pure: +nothrow: +@safe: +@nogc: + +private { +immutable real EXP_2 = 0.135335283236612691893999494972484403L; /* exp(-2) */ +enum real SQRT2PI = 2.50662827463100050241576528481104525L; // sqrt(2pi) + + +enum real MAXLOG = 0x1.62e42fefa39ef358p+13L; // log(real.max) +enum real MINLOG = -0x1.6436716d5406e6d8p+13L; // log(real.min*real.epsilon) = log(smallest denormal) +} + +T rationalPoly(T)(T x, const(T) [] numerator, const(T) [] denominator) pure nothrow +{ + return poly(x, numerator)/poly(x, denominator); +} + + +private { + +/* erfc(x) = exp(-x^2) P(1/x)/Q(1/x) + 1/8 <= 1/x <= 1 + Peak relative error 5.8e-21 */ +immutable real[10] P = [ -0x1.30dfa809b3cc6676p-17, 0x1.38637cd0913c0288p+18, + 0x1.2f015e047b4476bp+22, 0x1.24726f46aa9ab08p+25, 0x1.64b13c6395dc9c26p+27, + 0x1.294c93046ad55b5p+29, 0x1.5962a82f92576dap+30, 0x1.11a709299faba04ap+31, + 0x1.11028065b087be46p+31, 0x1.0d8ef40735b097ep+30 +]; + +immutable real[11] Q = [ 0x1.14d8e2a72dec49f4p+19, 0x1.0c880ff467626e1p+23, + 0x1.04417ef060b58996p+26, 0x1.404e61ba86df4ebap+28, 0x1.0f81887bc82b873ap+30, + 0x1.4552a5e39fb49322p+31, 0x1.11779a0ceb2a01cep+32, 0x1.3544dd691b5b1d5cp+32, + 0x1.a91781f12251f02ep+31, 0x1.0d8ef3da605a1c86p+30, 1.0 +]; + +// For 128 bit quadruple-precision floats, we use a higher-precision implementation +// with more polynomial segments. +enum isIEEEQuadruple = floatTraits!real.realFormat == RealFormat.ieeeQuadruple; +static if (isIEEEQuadruple) +{ + // erfc(x + 0.25) = erfc(0.25) + x R(x) + // 0 <= x < 0.125 + // Peak relative error 1.4e-35 + immutable real[9] RNr13 = [ + -2.353707097641280550282633036456457014829E3L, + 3.871159656228743599994116143079870279866E2L, + -3.888105134258266192210485617504098426679E2L, + -2.129998539120061668038806696199343094971E1L, + -8.125462263594034672468446317145384108734E1L, + 8.151549093983505810118308635926270319660E0L, + -5.033362032729207310462422357772568553670E0L, + -4.253956621135136090295893547735851168471E-2L, + -8.098602878463854789780108161581050357814E-2L + ]; + immutable real[9] RDr13 = [ + 2.220448796306693503549505450626652881752E3L, + 1.899133258779578688791041599040951431383E2L, + 1.061906712284961110196427571557149268454E3L, + 7.497086072306967965180978101974566760042E1L, + 2.146796115662672795876463568170441327274E2L, + 1.120156008362573736664338015952284925592E1L, + 2.211014952075052616409845051695042741074E1L, + 6.469655675326150785692908453094054988938E-1L, + 1.0 + ]; + + // erfc(0.25) = C13a + C13b to extra precision. + immutable real C13a = 0.723663330078125L; + immutable real C13b = 1.0279753638067014931732235184287934646022E-5L; + + // erfc(x + 0.375) = erfc(0.375) + x R(x) + // 0 <= x < 0.125 + // Peak relative error 1.2e-35 + immutable real[9] RNr14 = [ + -2.446164016404426277577283038988918202456E3L, + 6.718753324496563913392217011618096698140E2L, + -4.581631138049836157425391886957389240794E2L, + -2.382844088987092233033215402335026078208E1L, + -7.119237852400600507927038680970936336458E1L, + 1.313609646108420136332418282286454287146E1L, + -6.188608702082264389155862490056401365834E0L, + -2.787116601106678287277373011101132659279E-2L, + -2.230395570574153963203348263549700967918E-2L + ]; + immutable real[9] RDr14 = [ + 2.495187439241869732696223349840963702875E3L, + 2.503549449872925580011284635695738412162E2L, + 1.159033560988895481698051531263861842461E3L, + 9.493751466542304491261487998684383688622E1L, + 2.276214929562354328261422263078480321204E2L, + 1.367697521219069280358984081407807931847E1L, + 2.276988395995528495055594829206582732682E1L, + 7.647745753648996559837591812375456641163E-1L, + 1.0 + ]; + + // erfc(0.375) = C14a + C14b to extra precision. + immutable real C14a = 0.5958709716796875L; + immutable real C14b = 1.2118885490201676174914080878232469565953E-5L; + + // erfc(x + 0.5) = erfc(0.5) + x R(x) + // 0 <= x < 0.125 + // Peak relative error 4.7e-36 + immutable real[9] RNr15 = [ + -2.624212418011181487924855581955853461925E3L, + 8.473828904647825181073831556439301342756E2L, + -5.286207458628380765099405359607331669027E2L, + -3.895781234155315729088407259045269652318E1L, + -6.200857908065163618041240848728398496256E1L, + 1.469324610346924001393137895116129204737E1L, + -6.961356525370658572800674953305625578903E0L, + 5.145724386641163809595512876629030548495E-3L, + 1.990253655948179713415957791776180406812E-2L + ]; + immutable real[9] RDr15 = [ + 2.986190760847974943034021764693341524962E3L, + 5.288262758961073066335410218650047725985E2L, + 1.363649178071006978355113026427856008978E3L, + 1.921707975649915894241864988942255320833E2L, + 2.588651100651029023069013885900085533226E2L, + 2.628752920321455606558942309396855629459E1L, + 2.455649035885114308978333741080991380610E1L, + 1.378826653595128464383127836412100939126E0L, + 1.0 + ]; + // erfc(0.5) = C15a + C15b to extra precision. + immutable real C15a = 0.4794921875L; + immutable real C15b = 7.9346869534623172533461080354712635484242E-6L; + + // erfc(x + 0.625) = erfc(0.625) + x R(x) + // 0 <= x < 0.125 + // Peak relative error 5.1e-36 + immutable real[9] RNr16 = [ + -2.347887943200680563784690094002722906820E3L, + 8.008590660692105004780722726421020136482E2L, + -5.257363310384119728760181252132311447963E2L, + -4.471737717857801230450290232600243795637E1L, + -4.849540386452573306708795324759300320304E1L, + 1.140885264677134679275986782978655952843E1L, + -6.731591085460269447926746876983786152300E0L, + 1.370831653033047440345050025876085121231E-1L, + 2.022958279982138755020825717073966576670E-2L, + ]; + immutable real[9] RDr16 = [ + 3.075166170024837215399323264868308087281E3L, + 8.730468942160798031608053127270430036627E2L, + 1.458472799166340479742581949088453244767E3L, + 3.230423687568019709453130785873540386217E2L, + 2.804009872719893612081109617983169474655E2L, + 4.465334221323222943418085830026979293091E1L, + 2.612723259683205928103787842214809134746E1L, + 2.341526751185244109722204018543276124997E0L, + 1.0 + ]; + // erfc(0.625) = C16a + C16b to extra precision. + immutable real C16a = 0.3767547607421875L; + immutable real C16b = 4.3570693945275513594941232097252997287766E-6L; + + // erfc(x + 0.75) = erfc(0.75) + x R(x) + // 0 <= x < 0.125 + // Peak relative error 1.7e-35 + immutable real[9] RNr17 = [ + -1.767068734220277728233364375724380366826E3L, + 6.693746645665242832426891888805363898707E2L, + -4.746224241837275958126060307406616817753E2L, + -2.274160637728782675145666064841883803196E1L, + -3.541232266140939050094370552538987982637E1L, + 6.988950514747052676394491563585179503865E0L, + -5.807687216836540830881352383529281215100E0L, + 3.631915988567346438830283503729569443642E-1L, + -1.488945487149634820537348176770282391202E-2L + ]; + immutable real[9] RDr17 = [ + 2.748457523498150741964464942246913394647E3L, + 1.020213390713477686776037331757871252652E3L, + 1.388857635935432621972601695296561952738E3L, + 3.903363681143817750895999579637315491087E2L, + 2.784568344378139499217928969529219886578E2L, + 5.555800830216764702779238020065345401144E1L, + 2.646215470959050279430447295801291168941E1L, + 2.984905282103517497081766758550112011265E0L, + 1.0 + ]; + // erfc(0.75) = C17a + C17b to extra precision. + immutable real C17a = 0.2888336181640625L; + immutable real C17b = 1.0748182422368401062165408589222625794046E-5L; + + + // erfc(x + 0.875) = erfc(0.875) + x R(x) + // 0 <= x < 0.125 + // Peak relative error 2.2e-35 + immutable real[9] RNr18 = [ + -1.342044899087593397419622771847219619588E3L, + 6.127221294229172997509252330961641850598E2L, + -4.519821356522291185621206350470820610727E2L, + 1.223275177825128732497510264197915160235E1L, + -2.730789571382971355625020710543532867692E1L, + 4.045181204921538886880171727755445395862E0L, + -4.925146477876592723401384464691452700539E0L, + 5.933878036611279244654299924101068088582E-1L, + -5.557645435858916025452563379795159124753E-2L + ]; + immutable real[9] RDr18 = [ + 2.557518000661700588758505116291983092951E3L, + 1.070171433382888994954602511991940418588E3L, + 1.344842834423493081054489613250688918709E3L, + 4.161144478449381901208660598266288188426E2L, + 2.763670252219855198052378138756906980422E2L, + 5.998153487868943708236273854747564557632E1L, + 2.657695108438628847733050476209037025318E1L, + 3.252140524394421868923289114410336976512E0L, + 1.0 + ]; + + // erfc(0.875) = C18a + C18b to extra precision. + immutable real C18a = 0.215911865234375L; + immutable real C18b = 1.3073705765341685464282101150637224028267E-5L; + + // erfc(x + 1.0) = erfc(1.0) + x R(x) + // 0 <= x < 0.125 + // Peak relative error 1.6e-35 + immutable real[9] RNr19 = [ + -1.139180936454157193495882956565663294826E3L, + 6.134903129086899737514712477207945973616E2L, + -4.628909024715329562325555164720732868263E2L, + 4.165702387210732352564932347500364010833E1L, + -2.286979913515229747204101330405771801610E1L, + 1.870695256449872743066783202326943667722E0L, + -4.177486601273105752879868187237000032364E0L, + 7.533980372789646140112424811291782526263E-1L, + -8.629945436917752003058064731308767664446E-2L + ]; + immutable real[9] RDr19 = [ + 2.744303447981132701432716278363418643778E3L, + 1.266396359526187065222528050591302171471E3L, + 1.466739461422073351497972255511919814273E3L, + 4.868710570759693955597496520298058147162E2L, + 2.993694301559756046478189634131722579643E2L, + 6.868976819510254139741559102693828237440E1L, + 2.801505816247677193480190483913753613630E1L, + 3.604439909194350263552750347742663954481E0L, + 1.0 + ]; + + // erfc(1.0) = C19a + C19b to extra precision. + immutable real C19a = 0.15728759765625L; + immutable real C19b = 1.1609394035130658779364917390740703933002E-5L; + + // erfc(x + 1.125) = erfc(1.125) + x R(x) + // 0 <= x < 0.125 + // Peak relative error 3.6e-36 + immutable real[9] RNr20 = [ + -9.652706916457973956366721379612508047640E2L, + 5.577066396050932776683469951773643880634E2L, + -4.406335508848496713572223098693575485978E2L, + 5.202893466490242733570232680736966655434E1L, + -1.931311847665757913322495948705563937159E1L, + -9.364318268748287664267341457164918090611E-2L, + -3.306390351286352764891355375882586201069E0L, + 7.573806045289044647727613003096916516475E-1L, + -9.611744011489092894027478899545635991213E-2L + ]; + immutable real[9] RDr20 = [ + 3.032829629520142564106649167182428189014E3L, + 1.659648470721967719961167083684972196891E3L, + 1.703545128657284619402511356932569292535E3L, + 6.393465677731598872500200253155257708763E2L, + 3.489131397281030947405287112726059221934E2L, + 8.848641738570783406484348434387611713070E1L, + 3.132269062552392974833215844236160958502E1L, + 4.430131663290563523933419966185230513168E0L, + 1.0 + ]; + + // erfc(1.125) = C20a + C20b to extra precision. + immutable real C20a = 0.111602783203125L; + immutable real C20b = 8.9850951672359304215530728365232161564636E-6L; + + // erfc(x) = exp(-x^2) 1/x R(1/x^2) / S(1/x^2) + // 7/8 <= 1/x < 1 + // Peak relative error 1.4e-35 + immutable real[10] RNr8 = [ + 3.587451489255356250759834295199296936784E1L, + 5.406249749087340431871378009874875889602E2L, + 2.931301290625250886238822286506381194157E3L, + 7.359254185241795584113047248898753470923E3L, + 9.201031849810636104112101947312492532314E3L, + 5.749697096193191467751650366613289284777E3L, + 1.710415234419860825710780802678697889231E3L, + 2.150753982543378580859546706243022719599E2L, + 8.740953582272147335100537849981160931197E0L, + 4.876422978828717219629814794707963640913E-2L + ]; + immutable real[10] RDr8 = [ + 6.358593134096908350929496535931630140282E1L, + 9.900253816552450073757174323424051765523E2L, + 5.642928777856801020545245437089490805186E3L, + 1.524195375199570868195152698617273739609E4L, + 2.113829644500006749947332935305800887345E4L, + 1.526438562626465706267943737310282977138E4L, + 5.561370922149241457131421914140039411782E3L, + 9.394035530179705051609070428036834496942E2L, + 6.147019596150394577984175188032707343615E1L, + 1.0L + ]; + + // erfc(x) = exp(-x^2) 1/x R(1/x^2) / S(1/x^2) + // 3/4 <= 1/x < 7/8 + // Peak relative error 1.7e-36 + immutable real[10] RNr7 = [ + 1.293515364043117601705535485785956592493E2L, + 2.474534371269189867053251150063459055230E3L, + 1.756537563959875738809491329250457486510E4L, + 5.977479535376639763773153344676726091607E4L, + 1.054313816139671870123172936972055385049E5L, + 9.754699773487726957401038094714603033904E4L, + 4.579131242577671038339922925213209214880E4L, + 1.000710322164510887997115157797717324370E4L, + 8.496863250712471449526805271633794700452E2L, + 1.797349831386892396933210199236530557333E1L + ]; + immutable real[11] RDr7 = [ + 2.292696320307033494820399866075534515002E2L, + 4.500632608295626968062258401895610053116E3L, + 3.321218723485498111535866988511716659339E4L, + 1.196084512221845156596781258490840961462E5L, + 2.287033883912529843927983406878910939930E5L, + 2.370223495794642027268482075021298394425E5L, + 1.305173734022437154610938308944995159199E5L, + 3.589386258485887630236490009835928559621E4L, + 4.339996864041074149726360516336773136101E3L, + 1.753135522665469574605384979152863899099E2L, + 1.0L + ]; + + // erfc(x) = exp(-x^2) 1/x R(1/x^2) / S(1/x^2) + // 5/8 <= 1/x < 3/4 + // Peak relative error 1.6e-35 + immutable real[10] RNr6 = [ + 1.423313561367394923305025174137639124533E1L, + 3.244462503273003837514629113846075327206E2L, + 2.784937282403293364911673341412846781934E3L, + 1.163060685810874867196849890286455473382E4L, + 2.554141394931962276102205517358731053756E4L, + 2.982733782500729530503336931258698708782E4L, + 1.789683564523810605328169719436374742840E4L, + 5.056032142227470121262177112822018882754E3L, + 5.605349942234782054561269306895707034586E2L, + 1.561652599080729507274832243665726064881E1L + ]; + immutable real[11] RDr6 = [ + 2.522757606611479946069351519410222913326E1L, + 5.876797910931896554014229647006604017806E2L, + 5.211092128250480712011248211246144751074E3L, + 2.282679910855404599271496827409168580797E4L, + 5.371245819205596609986320599133109262447E4L, + 6.926186102106400355114925675028888924445E4L, + 4.794366033363621432575096487724913414473E4L, + 1.673190682734065914573814938835674963896E4L, + 2.589544846151313120096957014256536236242E3L, + 1.349438432583208276883323156200117027433E2L, + 1.0L + ]; + + // erfc(x) = exp(-x^2) 1/x R(1/x^2) / S(1/x^2) + // 1/2 <= 1/x < 5/8 + // Peak relative error 4.3e-36 + immutable real[11] RNr5 = [ + 6.743447478279267339131211137241149796763E-2L, + 2.031339015932962998168472743355874796350E0L, + 2.369234815713196480221800940201367484379E1L, + 1.387819614016107433603101545594790875922E2L, + 4.435600256936515839937720907171966121786E2L, + 7.881577949936817507981170892417739733047E2L, + 7.615749099291545976179905281851765734680E2L, + 3.752484528663442467089606663006771157777E2L, + 8.279644286027145214308303292537009564726E1L, + 6.201462983413738162709722770960040042647E0L, + 6.649631608720062333043506249503378282697E-2L + ]; + immutable real[11] RDr5 = [ + 1.195244945161889822018178270706903972343E-1L, + 3.660216908153253021384862427197665991311E0L, + 4.373405883243078019655721779021995159854E1L, + 2.653305963056235008916733402786877121865E2L, + 8.921329790491152046318422124415895506335E2L, + 1.705552231555600759729260640562363304312E3L, + 1.832711109606893446763174603477244625325E3L, + 1.056823953275835649973998168744261083316E3L, + 2.975561981792909722126456997074344895584E2L, + 3.393149095158232521894537008472203487436E1L, + 1.0L + ]; + + // erfc(x) = exp(-x^2) 1/x R(1/x^2) / S(1/x^2) + // 3/8 <= 1/x < 1/2 + // Peak relative error 1.8e-36 + immutable real[11] RNr4 = [ + 3.558685919236420073872459554885612994007E-2L, + 1.460223642496950651561817195253277924528E0L, + 2.379856746555189546876720308841066577268E1L, + 2.005205521246554860334064698817220160117E2L, + 9.533374928664989955629120027419490517596E2L, + 2.623024576994438336130421711314560425373E3L, + 4.126446434603735586340585027628851620886E3L, + 3.540675861596687801829655387867654030013E3L, + 1.506037084891064572653273761987617394259E3L, + 2.630715699182706745867272452228891752353E2L, + 1.202476629652900619635409242749750364878E1L + ]; + immutable real[12] RDr4 = [ + 6.307606561714590590399683184410336583739E-2L, + 2.619717051134271249293056836082721776665E0L, + 4.344441402681725017630451522968410844608E1L, + 3.752891116408399440953195184301023399176E2L, + 1.849305988804654653921972804388006355502E3L, + 5.358505261991675891835885654499883449403E3L, + 9.091890995405251314631428721090705475825E3L, + 8.731418313949291797856351745278287516416E3L, + 4.420211285043270337492325400764271868740E3L, + 1.031487363021856106882306509107923200832E3L, + 8.387036084846046121805145056040429461783E1L, + 1.0L + ]; + + // erfc(x) = exp(-x^2) 1/x R(1/x^2) / S(1/x^2) + // 1/4 <= 1/x < 3/8 + // Peak relative error 8.1e-37 + immutable real[12] RNr3 = [ + 4.584481542956275354582319313040418316755E-5L, + 2.674923158288848442110883948437930349128E-3L, + 6.344232532055212248017211243740466847311E-2L, + 7.985145965992002744933550450451513513963E-1L, + 5.845078061888281665064746347663123946270E0L, + 2.566625318816866587482759497608029522596E1L, + 6.736225182343446605268837827950856640948E1L, + 1.021796460139598089409347761712730512053E2L, + 8.344336615515430530929955615400706619764E1L, + 3.207749011528249356283897356277376306967E1L, + 4.386185123863412086856423971695142026036E0L, + 8.971576448581208351826868348023528863856E-2L + ]; + immutable real[12] RDr3 = [ + 8.125781965218112303281657065320409661370E-5L, + 4.781806762611504685247817818428945295520E-3L, + 1.147785538413798317790357996845767614561E-1L, + 1.469285552007088106614218996464752307606E0L, + 1.101712261349880339221039938999124077650E1L, + 5.008507527095093413713171655268276861071E1L, + 1.383058691613468970486425146336829447433E2L, + 2.264114250278912520501010108736339599752E2L, + 2.081377197698598680576330179979996940039E2L, + 9.724438129690013609440151781601781137944E1L, + 1.907905050771832372089975877589291760121E1L, + 1.0L + ]; + + // erfc(x) = exp(-x^2) 1/x R(1/x^2) / S(1/x^2) + // 1/8 <= 1/x < 1/4 + // Peak relative error 1.5e-36 + immutable real[12] RNr2 = [ + 6.928615158005256885698045840589513728399E-7L, + 5.616245938942075826026382337922413007879E-5L, + 1.871624980715261794832358438894219696113E-3L, + 3.349922063795792371642023765253747563009E-2L, + 3.531865233974349943956345502463135695834E-1L, + 2.264714157625072773976468825160906342360E0L, + 8.810720294489253776747319730638214883026E0L, + 2.014056685571655833019183248931442888437E1L, + 2.524586947657190747039554310814128743320E1L, + 1.520656940937208886246188940244581671609E1L, + 3.334145500790963675035841482334493680498E0L, + 1.122108380007109245896534245151140632457E-1L + ]; + immutable real[12] RDr2 = [ + 1.228065061824874795984937092427781089256E-6L, + 1.001593999520159167559129042893802235969E-4L, + 3.366527555699367241421450749821030974446E-3L, + 6.098626947195865254152265585991861150369E-2L, + 6.541547922508613985813189387198804660235E-1L, + 4.301130233305371976727117480925676583204E0L, + 1.737155892350891711527711121692994762909E1L, + 4.206892112110558214680649401236873828801E1L, + 5.787487996025016843403524261574779631219E1L, + 4.094047148590822715163181507813774861621E1L, + 1.230603930056944875836549716515643997094E1L, + 1.0L + ]; + + // erfc(x) = exp(-x^2) 1/x R(1/x^2) / S(1/x^2) + // 1/128 <= 1/x < 1/8 + // Peak relative error 2.2e-36 + immutable real[10] RNr1 = [ + 1.293111801138199795250035229383033322539E-6L, + 9.785224720880980456171438759161161816706E-5L, + 2.932474396233212166056331430621176065943E-3L, + 4.496429595719847083917435337780697436921E-2L, + 3.805989855927720844877478869846718877846E-1L, + 1.789745532222426292126781724570152590071E0L, + 4.465737379634389318903237306594171764628E0L, + 5.268535822258082278401240171488850433767E0L, + 2.258357766807433839494276681092713991651E0L, + 1.504459334078750002966538036652860809497E-1L + ]; + immutable real[10] RDr1 = [ + 2.291980991578770070179177302906728371406E-6L, + 1.745845828808028552029674694534934620384E-4L, + 5.283248841982102317072923869576785278019E-3L, + 8.221212297078141470944454807434634848018E-2L, + 7.120500674861902950423510939060230945621E-1L, + 3.475435367560809622183983439133664598155E0L, + 9.243253391989233533874386043611304387113E0L, + 1.227894792475280941511758877318903197188E1L, + 6.789361410398841316638617624392719077724E0L, + 1.0L + ]; + + // erf(z+1) = erfConst + P(z)/Q(z) + // -.125 <= z <= 0 + // Peak relative error 7.3e-36 + immutable real erfConst = 0.845062911510467529296875L; + immutable real[9] TN2 = [ + -4.088889697077485301010486931817357000235E1L, + 7.157046430681808553842307502826960051036E3L, + -2.191561912574409865550015485451373731780E3L, + 2.180174916555316874988981177654057337219E3L, + 2.848578658049670668231333682379720943455E2L, + 1.630362490952512836762810462174798925274E2L, + 6.317712353961866974143739396865293596895E0L, + 2.450441034183492434655586496522857578066E1L, + 5.127662277706787664956025545897050896203E-1L + ]; + immutable real[10] TD2 = [ + 1.731026445926834008273768924015161048885E4L, + 1.209682239007990370796112604286048173750E4L, + 1.160950290217993641320602282462976163857E4L, + 5.394294645127126577825507169061355698157E3L, + 2.791239340533632669442158497532521776093E3L, + 8.989365571337319032943005387378993827684E2L, + 2.974016493766349409725385710897298069677E2L, + 6.148192754590376378740261072533527271947E1L, + 1.178502892490738445655468927408440847480E1L, + 1.0L + ]; + + // erf(x) = x + x P(x^2)/Q(x^2) + // 0 <= x <= 7/8 + // Peak relative error 1.8e-35 + immutable real[9] TN1 = [ + -3.858252324254637124543172907442106422373E10L, + 9.580319248590464682316366876952214879858E10L, + 1.302170519734879977595901236693040544854E10L, + 2.922956950426397417800321486727032845006E9L, + 1.764317520783319397868923218385468729799E8L, + 1.573436014601118630105796794840834145120E7L, + 4.028077380105721388745632295157816229289E5L, + 1.644056806467289066852135096352853491530E4L, + 3.390868480059991640235675479463287886081E1L + ]; + immutable real[10] TD1 = [ + -3.005357030696532927149885530689529032152E11L, + -1.342602283126282827411658673839982164042E11L, + -2.777153893355340961288511024443668743399E10L, + -3.483826391033531996955620074072768276974E9L, + -2.906321047071299585682722511260895227921E8L, + -1.653347985722154162439387878512427542691E7L, + -6.245520581562848778466500301865173123136E5L, + -1.402124304177498828590239373389110545142E4L, + -1.209368072473510674493129989468348633579E2L, + 1.0L + ]; +} +else +{ + /* erfc(x) = exp(-x^2) 1/x R(1/x^2) / S(1/x^2) + 1/128 <= 1/x < 1/8 + Peak relative error 1.9e-21 */ + immutable real[5] R = [ 0x1.b9f6d8b78e22459ep-6, 0x1.1b84686b0a4ea43ap-1, + 0x1.b8f6aebe96000c2ap+1, 0x1.cb1dbedac27c8ec2p+2, 0x1.cf885f8f572a4c14p+1 + ]; + + immutable real[6] S = [ + 0x1.87ae3cae5f65eb5ep-5, 0x1.01616f266f306d08p+0, 0x1.a4abe0411eed6c22p+2, + 0x1.eac9ce3da600abaap+3, 0x1.5752a9ac2faebbccp+3, 1.0 + ]; + + /* erf(x) = x P(x^2)/Q(x^2) + 0 <= x <= 1 + Peak relative error 7.6e-23 */ + immutable real[7] T = [ 0x1.0da01654d757888cp+20, 0x1.2eb7497bc8b4f4acp+17, + 0x1.79078c19530f72a8p+15, 0x1.4eaf2126c0b2c23p+11, 0x1.1f2ea81c9d272a2ep+8, + 0x1.59ca6e2d866e625p+2, 0x1.c188e0b67435faf4p-4 + ]; + + immutable real[7] U = [ 0x1.dde6025c395ae34ep+19, 0x1.c4bc8b6235df35aap+18, + 0x1.8465900e88b6903ap+16, 0x1.855877093959ffdp+13, 0x1.e5c44395625ee358p+9, + 0x1.6a0fed103f1c68a6p+5, 1.0 + ]; +} +} + +/** + * Complementary error function + * + * erfc(x) = 1 - erf(x), and has high relative accuracy for + * values of x far from zero. (For values near zero, use erf(x)). + * + * 1 - erf(x) = 2/ $(SQRT)(π) + * $(INTEGRAL x, $(INFINITY)) exp( - $(POWER t, 2)) dt + * + * + * For small x, erfc(x) = 1 - erf(x); otherwise rational + * approximations are computed. + * + * A special function expx2(x) is used to suppress error amplification + * in computing exp(-x^2). + */ +real erfc(real a) +{ + if (a == real.infinity) + return 0.0; + if (a == -real.infinity) + return 2.0; + + immutable x = (a < 0.0) ? -a : a; + + if (x < (isIEEEQuadruple ? 0.25 : 1.0)) + return 1.0 - erf(a); + + static if (isIEEEQuadruple) + { + if (x < 1.25) + { + real y; + final switch (cast(int)(8.0 * x)) + { + case 2: + const z = x - 0.25; + y = C13b + z * rationalPoly(z, RNr13, RDr13); + y += C13a; + break; + case 3: + const z = x - 0.375; + y = C14b + z * rationalPoly(z, RNr14, RDr14); + y += C14a; + break; + case 4: + const z = x - 0.5; + y = C15b + z * rationalPoly(z, RNr15, RDr15); + y += C15a; + break; + case 5: + const z = x - 0.625; + y = C16b + z * rationalPoly(z, RNr16, RDr16); + y += C16a; + break; + case 6: + const z = x - 0.75; + y = C17b + z * rationalPoly(z, RNr17, RDr17); + y += C17a; + break; + case 7: + const z = x - 0.875; + y = C18b + z * rationalPoly(z, RNr18, RDr18); + y += C18a; + break; + case 8: + const z = x - 1.0; + y = C19b + z * rationalPoly(z, RNr19, RDr19); + y += C19a; + break; + case 9: + const z = x - 1.125; + y = C20b + z * rationalPoly(z, RNr20, RDr20); + y += C20a; + break; + } + if (a < 0.0) + y = 2.0 - y; + return y; + } + } + + if (-a * a < -MAXLOG) + { + // underflow + if (a < 0.0) return 2.0; + else return 0.0; + } + + real y; + immutable z = expx2(a, -1); + + static if (isIEEEQuadruple) + { + y = z * erfce(x); + } + else + { + y = 1.0 / x; + if (x < 8.0) + y = z * rationalPoly(y, P, Q); + else + y = z * y * rationalPoly(y * y, R, S); + } + + if (a < 0.0) + y = 2.0 - y; + + if (y == 0.0) + { + // underflow + if (a < 0.0) return 2.0; + else return 0.0; + } + + return y; +} + + +private { +/* Exponentially scaled erfc function + exp(x^2) erfc(x) + valid for x > 1. + Use with normalDistribution and expx2. */ +static if (isIEEEQuadruple) +{ + real erfce(real x) + { + immutable z = 1.0L / (x * x); + + real p; + switch (cast(int)(8.0 / x)) + { + default: + case 0: + p = rationalPoly(z, RNr1, RDr1); + break; + case 1: + p = rationalPoly(z, RNr2, RDr2); + break; + case 2: + p = rationalPoly(z, RNr3, RDr3); + break; + case 3: + p = rationalPoly(z, RNr4, RDr4); + break; + case 4: + p = rationalPoly(z, RNr5, RDr5); + break; + case 5: + p = rationalPoly(z, RNr6, RDr6); + break; + case 6: + p = rationalPoly(z, RNr7, RDr7); + break; + case 7: + p = rationalPoly(z, RNr8, RDr8); + break; + } + return p / x; + } +} +else +{ + real erfce(real x) + { + real y = 1.0/x; + + if (x < 8.0) + { + return rationalPoly(y, P, Q); + } + else + { + return y * rationalPoly(y * y, R, S); + } + } +} +} + +/** + * Error function + * + * The integral is + * + * erf(x) = 2/ $(SQRT)(π) + * $(INTEGRAL 0, x) exp( - $(POWER t, 2)) dt + * + * The magnitude of x is limited to about 106.56 for IEEE 80-bit + * arithmetic; 1 or -1 is returned outside this range. + * + * For 0 <= |x| < 1, a rational polynomials are used; otherwise + * erf(x) = 1 - erfc(x). + * + * ACCURACY: + * Relative error: + * arithmetic domain # trials peak rms + * IEEE 0,1 50000 2.0e-19 5.7e-20 + */ +real erf(real x) +{ + if (x == 0.0) + return x; // deal with negative zero + if (x == -real.infinity) + return -1.0; + if (x == real.infinity) + return 1.0; + immutable ax = abs(x); + if (ax > 1.0L) + return 1.0L - erfc(x); + + static if (isIEEEQuadruple) + { + immutable z = x * x; + + real y; + if (ax < 0.875) + { + y = ax + ax * rationalPoly(x * x, TN1, TD1); + } + else + { + y = erfConst + rationalPoly(ax - 1.0L, TN2, TD2); + } + + if (x < 0) + y = -y; + return y; + } + else + { + real z = x * x; + return x * rationalPoly(x * x, T, U); + } +} + +@safe unittest +{ + // High resolution test points. + enum real erfc0_250 = 0.723663330078125 + 1.0279753638067014931732235184287934646022E-5; + enum real erfc0_375 = 0.5958709716796875 + 1.2118885490201676174914080878232469565953E-5; + enum real erfc0_500 = 0.4794921875 + 7.9346869534623172533461080354712635484242E-6; + enum real erfc0_625 = 0.3767547607421875 + 4.3570693945275513594941232097252997287766E-6; + enum real erfc0_750 = 0.2888336181640625 + 1.0748182422368401062165408589222625794046E-5; + enum real erfc0_875 = 0.215911865234375 + 1.3073705765341685464282101150637224028267E-5; + enum real erfc1_000 = 0.15728759765625 + 1.1609394035130658779364917390740703933002E-5; + enum real erfc1_125 = 0.111602783203125 + 8.9850951672359304215530728365232161564636E-6; + + enum real erf0_875 = (1-0.215911865234375) - 1.3073705765341685464282101150637224028267E-5; + + static bool isNaNWithPayload(real x, ulong payload) @safe pure nothrow @nogc + { + return isNaN(x) && getNaNPayload(x) == payload; + } + + assert(feqrel(erfc(0.250L), erfc0_250 )>=real.mant_dig-1); + assert(feqrel(erfc(0.375L), erfc0_375 )>=real.mant_dig-0); + assert(feqrel(erfc(0.500L), erfc0_500 )>=real.mant_dig-2); + assert(feqrel(erfc(0.625L), erfc0_625 )>=real.mant_dig-1); + assert(feqrel(erfc(0.750L), erfc0_750 )>=real.mant_dig-1); + assert(feqrel(erfc(0.875L), erfc0_875 )>=real.mant_dig-4); + assert(feqrel(erfc(1.000L), erfc1_000 )>=real.mant_dig-2); + assert(feqrel(erfc(1.125L), erfc1_125 )>=real.mant_dig-2); + assert(feqrel(erf(0.875L), erf0_875 )>=real.mant_dig-1); + // The DMC implementation of erfc() fails this next test (just). + // Test point from Mathematica 11.0. + assert(feqrel(erfc(4.1L), 6.70002765408489837272673380763418472e-9L) >= real.mant_dig-5); + + assert(isIdentical(erf(0.0),0.0)); + assert(isIdentical(erf(-0.0),-0.0)); + assert(erf(real.infinity) == 1.0); + assert(erf(-real.infinity) == -1.0); + assert(isNaNWithPayload(erf(NaN(0xDEF)), 0xDEF)); + assert(isNaNWithPayload(erfc(NaN(0xDEF)), 0xDEF)); + assert(isIdentical(erfc(real.infinity),0.0)); + assert(erfc(-real.infinity) == 2.0); + assert(erfc(0) == 1.0); +} + +/* + * Exponential of squared argument + * + * Computes y = exp(x*x) while suppressing error amplification + * that would ordinarily arise from the inexactness of the + * exponential argument x*x. + * + * If sign < 0, the result is inverted; i.e., y = exp(-x*x) . + * + * ACCURACY: + * Relative error: + * arithmetic domain # trials peak rms + * IEEE -106.566, 106.566 10^5 1.6e-19 4.4e-20 + */ + +real expx2(real x, int sign) +{ + /* + Cephes Math Library Release 2.9: June, 2000 + Copyright 2000 by Stephen L. Moshier + */ + const real M = 32_768.0; + const real MINV = 3.0517578125e-5L; + + x = abs(x); + if (sign < 0) + x = -x; + + /* Represent x as an exact multiple of M plus a residual. + M is a power of 2 chosen so that exp(m * m) does not overflow + or underflow and so that |x - m| is small. */ + real m = MINV * floor(M * x + 0.5L); + real f = x - m; + + /* x^2 = m^2 + 2mf + f^2 */ + real u = m * m; + real u1 = 2 * m * f + f * f; + + if (sign < 0) + { + u = -u; + u1 = -u1; + } + + if ((u+u1) > MAXLOG) + return real.infinity; + + /* u is exact, u1 is small. */ + return exp(u) * exp(u1); +} + + +/* +Computes the normal distribution function. + +The normal (or Gaussian, or bell-shaped) distribution is +defined as: + +normalDist(x) = 1/$(SQRT) π $(INTEGRAL -$(INFINITY), x) exp( - $(POWER t, 2)/2) dt + = 0.5 + 0.5 * erf(x/sqrt(2)) + = 0.5 * erfc(- x/sqrt(2)) + +To maintain accuracy at high values of x, use +normalDistribution(x) = 1 - normalDistribution(-x). + +Accuracy: +Within a few bits of machine resolution over the entire +range. + +References: +$(LINK http://www.netlib.org/cephes/ldoubdoc.html), +G. Marsaglia, "Evaluating the Normal Distribution", +Journal of Statistical Software <b>11</b>, (July 2004). +*/ +real normalDistributionImpl(real a) +{ + real x = a * SQRT1_2; + real z = abs(x); + + if ( z < 1.0 ) + return 0.5L + 0.5L * erf(x); + else + { + real y = 0.5L * erfce(z); + /* Multiply by exp(-x^2 / 2) */ + z = expx2(a, -1); + y = y * sqrt(z); + if ( x > 0.0L ) + y = 1.0L - y; + return y; + } +} + +@safe unittest +{ +assert(fabs(normalDistributionImpl(1L) - (0.841344746068543))< 0.0000000000000005); +assert(isIdentical(normalDistributionImpl(NaN(0x325)), NaN(0x325))); +} + +/* + * Inverse of Normal distribution function + * + * Returns the argument, x, for which the area under the + * Normal probability density function (integrated from + * minus infinity to x) is equal to p. + * + * For small arguments 0 < p < exp(-2), the program computes + * z = sqrt( -2 log(p) ); then the approximation is + * x = z - log(z)/z - (1/z) P(1/z) / Q(1/z) . + * For larger arguments, x/sqrt(2 pi) = w + w^3 R(w^2)/S(w^2)) , + * where w = p - 0.5. + */ +// TODO: isIEEEQuadruple (128 bit) real implementation; not available from CEPHES. +real normalDistributionInvImpl(real p) +in { + assert(p >= 0.0L && p <= 1.0L, "Domain error"); +} +body +{ +static immutable real[8] P0 = +[ -0x1.758f4d969484bfdcp-7, 0x1.53cee17a59259dd2p-3, + -0x1.ea01e4400a9427a2p-1, 0x1.61f7504a0105341ap+1, -0x1.09475a594d0399f6p+2, + 0x1.7c59e7a0df99e3e2p+1, -0x1.87a81da52edcdf14p-1, 0x1.1fb149fd3f83600cp-7 +]; + +static immutable real[8] Q0 = +[ -0x1.64b92ae791e64bb2p-7, 0x1.7585c7d597298286p-3, + -0x1.40011be4f7591ce6p+0, 0x1.1fc067d8430a425ep+2, -0x1.21008ffb1e7ccdf2p+3, + 0x1.3d1581cf9bc12fccp+3, -0x1.53723a89fd8f083cp+2, 1.0 +]; + +static immutable real[10] P1 = +[ 0x1.20ceea49ea142f12p-13, 0x1.cbe8a7267aea80bp-7, + 0x1.79fea765aa787c48p-2, 0x1.d1f59faa1f4c4864p+1, 0x1.1c22e426a013bb96p+4, + 0x1.a8675a0c51ef3202p+5, 0x1.75782c4f83614164p+6, 0x1.7a2f3d90948f1666p+6, + 0x1.5cd116ee4c088c3ap+5, 0x1.1361e3eb6e3cc20ap+2 +]; + +static immutable real[10] Q1 = +[ 0x1.3a4ce1406cea98fap-13, 0x1.f45332623335cda2p-7, + 0x1.98f28bbd4b98db1p-2, 0x1.ec3b24f9c698091cp+1, 0x1.1cc56ecda7cf58e4p+4, + 0x1.92c6f7376bf8c058p+5, 0x1.4154c25aa47519b4p+6, 0x1.1b321d3b927849eap+6, + 0x1.403a5f5a4ce7b202p+4, 1.0 +]; + +static immutable real[8] P2 = +[ 0x1.8c124a850116a6d8p-21, 0x1.534abda3c2fb90bap-13, + 0x1.29a055ec93a4718cp-7, 0x1.6468e98aad6dd474p-3, 0x1.3dab2ef4c67a601cp+0, + 0x1.e1fb3a1e70c67464p+1, 0x1.b6cce8035ff57b02p+2, 0x1.9f4c9e749ff35f62p+1 +]; + +static immutable real[8] Q2 = +[ 0x1.af03f4fc0655e006p-21, 0x1.713192048d11fb2p-13, + 0x1.4357e5bbf5fef536p-7, 0x1.7fdac8749985d43cp-3, 0x1.4a080c813a2d8e84p+0, + 0x1.c3a4b423cdb41bdap+1, 0x1.8160694e24b5557ap+2, 1.0 +]; + +static immutable real[8] P3 = +[ -0x1.55da447ae3806168p-34, -0x1.145635641f8778a6p-24, + -0x1.abf46d6b48040128p-17, -0x1.7da550945da790fcp-11, -0x1.aa0b2a31157775fap-8, + 0x1.b11d97522eed26bcp-3, 0x1.1106d22f9ae89238p+1, 0x1.029a358e1e630f64p+1 +]; + +static immutable real[8] Q3 = +[ -0x1.74022dd5523e6f84p-34, -0x1.2cb60d61e29ee836p-24, + -0x1.d19e6ec03a85e556p-17, -0x1.9ea2a7b4422f6502p-11, -0x1.c54b1e852f107162p-8, + 0x1.e05268dd3c07989ep-3, 0x1.239c6aff14afbf82p+1, 1.0 +]; + + if (p <= 0.0L || p >= 1.0L) + { + if (p == 0.0L) + return -real.infinity; + if ( p == 1.0L ) + return real.infinity; + return real.nan; // domain error + } + int code = 1; + real y = p; + if ( y > (1.0L - EXP_2) ) + { + y = 1.0L - y; + code = 0; + } + + real x, z, y2, x0, x1; + + if ( y > EXP_2 ) + { + y = y - 0.5L; + y2 = y * y; + x = y + y * (y2 * rationalPoly( y2, P0, Q0)); + return x * SQRT2PI; + } + + x = sqrt( -2.0L * log(y) ); + x0 = x - log(x)/x; + z = 1.0L/x; + if ( x < 8.0L ) + { + x1 = z * rationalPoly( z, P1, Q1); + } + else if ( x < 32.0L ) + { + x1 = z * rationalPoly( z, P2, Q2); + } + else + { + x1 = z * rationalPoly( z, P3, Q3); + } + x = x0 - x1; + if ( code != 0 ) + { + x = -x; + } + return x; +} + + +@safe unittest +{ + // TODO: Use verified test points. + // The values below are from Excel 2003. + assert(fabs(normalDistributionInvImpl(0.001) - (-3.09023230616779))< 0.00000000000005); + assert(fabs(normalDistributionInvImpl(1e-50) - (-14.9333375347885))< 0.00000000000005); + assert(feqrel(normalDistributionInvImpl(0.999), -normalDistributionInvImpl(0.001)) > real.mant_dig-6); + + // Excel 2003 gets all the following values wrong! + assert(normalDistributionInvImpl(0.0) == -real.infinity); + assert(normalDistributionInvImpl(1.0) == real.infinity); + assert(normalDistributionInvImpl(0.5) == 0); + // (Excel 2003 returns norminv(p) = -30 for all p < 1e-200). + // The value tested here is the one the function returned in Jan 2006. + real unknown1 = normalDistributionInvImpl(1e-250L); + assert( fabs(unknown1 -(-33.79958617269L) ) < 0.00000005); +} diff --git a/libphobos/src/std/internal/math/gammafunction.d b/libphobos/src/std/internal/math/gammafunction.d new file mode 100644 index 0000000..dd20691 --- /dev/null +++ b/libphobos/src/std/internal/math/gammafunction.d @@ -0,0 +1,1834 @@ +/** + * Implementation of the gamma and beta functions, and their integrals. + * + * License: $(HTTP boost.org/LICENSE_1_0.txt, Boost License 1.0). + * Copyright: Based on the CEPHES math library, which is + * Copyright (C) 1994 Stephen L. Moshier (moshier@world.std.com). + * Authors: Stephen L. Moshier (original C code). Conversion to D by Don Clugston + * + * +Macros: + * TABLE_SV = <table border="1" cellpadding="4" cellspacing="0"> + * <caption>Special Values</caption> + * $0</table> + * SVH = $(TR $(TH $1) $(TH $2)) + * SV = $(TR $(TD $1) $(TD $2)) + * GAMMA = Γ + * INTEGRATE = $(BIG ∫<sub>$(SMALL $1)</sub><sup>$2</sup>) + * POWER = $1<sup>$2</sup> + * NAN = $(RED NAN) + */ +module std.internal.math.gammafunction; +import std.internal.math.errorfunction; +import std.math; + +pure: +nothrow: +@safe: +@nogc: + +private { + +enum real SQRT2PI = 2.50662827463100050242E0L; // sqrt(2pi) +immutable real EULERGAMMA = 0.57721_56649_01532_86060_65120_90082_40243_10421_59335_93992L; /** Euler-Mascheroni constant 0.57721566.. */ + +// Polynomial approximations for gamma and loggamma. + +immutable real[8] GammaNumeratorCoeffs = [ 1.0, + 0x1.acf42d903366539ep-1, 0x1.73a991c8475f1aeap-2, 0x1.c7e918751d6b2a92p-4, + 0x1.86d162cca32cfe86p-6, 0x1.0c378e2e6eaf7cd8p-8, 0x1.dc5c66b7d05feb54p-12, + 0x1.616457b47e448694p-15 +]; + +immutable real[9] GammaDenominatorCoeffs = [ 1.0, + 0x1.a8f9faae5d8fc8bp-2, -0x1.cb7895a6756eebdep-3, -0x1.7b9bab006d30652ap-5, + 0x1.c671af78f312082ep-6, -0x1.a11ebbfaf96252dcp-11, -0x1.447b4d2230a77ddap-10, + 0x1.ec1d45bb85e06696p-13,-0x1.d4ce24d05bd0a8e6p-17 +]; + +immutable real[9] GammaSmallCoeffs = [ 1.0, + 0x1.2788cfc6fb618f52p-1, -0x1.4fcf4026afa2f7ecp-1, -0x1.5815e8fa24d7e306p-5, + 0x1.5512320aea2ad71ap-3, -0x1.59af0fb9d82e216p-5, -0x1.3b4b61d3bfdf244ap-7, + 0x1.d9358e9d9d69fd34p-8, -0x1.38fc4bcbada775d6p-10 +]; + +immutable real[9] GammaSmallNegCoeffs = [ -1.0, + 0x1.2788cfc6fb618f54p-1, 0x1.4fcf4026afa2bc4cp-1, -0x1.5815e8fa2468fec8p-5, + -0x1.5512320baedaf4b6p-3, -0x1.59af0fa283baf07ep-5, 0x1.3b4a70de31e05942p-7, + 0x1.d9398be3bad13136p-8, 0x1.291b73ee05bcbba2p-10 +]; + +immutable real[7] logGammaStirlingCoeffs = [ + 0x1.5555555555553f98p-4, -0x1.6c16c16c07509b1p-9, 0x1.a01a012461cbf1e4p-11, + -0x1.3813089d3f9d164p-11, 0x1.b911a92555a277b8p-11, -0x1.ed0a7b4206087b22p-10, + 0x1.402523859811b308p-8 +]; + +immutable real[7] logGammaNumerator = [ + -0x1.0edd25913aaa40a2p+23, -0x1.31c6ce2e58842d1ep+24, -0x1.f015814039477c3p+23, + -0x1.74ffe40c4b184b34p+22, -0x1.0d9c6d08f9eab55p+20, -0x1.54c6b71935f1fc88p+16, + -0x1.0e761b42932b2aaep+11 +]; + +immutable real[8] logGammaDenominator = [ + -0x1.4055572d75d08c56p+24, -0x1.deeb6013998e4d76p+24, -0x1.106f7cded5dcc79ep+24, + -0x1.25e17184848c66d2p+22, -0x1.301303b99a614a0ap+19, -0x1.09e76ab41ae965p+15, + -0x1.00f95ced9e5f54eep+9, 1.0 +]; + +/* + * Helper function: Gamma function computed by Stirling's formula. + * + * Stirling's formula for the gamma function is: + * + * $(GAMMA)(x) = sqrt(2 π) x<sup>x-0.5</sup> exp(-x) (1 + 1/x P(1/x)) + * + */ +real gammaStirling(real x) +{ + // CEPHES code Copyright 1994 by Stephen L. Moshier + + static immutable real[9] SmallStirlingCoeffs = [ + 0x1.55555555555543aap-4, 0x1.c71c71c720dd8792p-9, -0x1.5f7268f0b5907438p-9, + -0x1.e13cd410e0477de6p-13, 0x1.9b0f31643442616ep-11, 0x1.2527623a3472ae08p-14, + -0x1.37f6bc8ef8b374dep-11,-0x1.8c968886052b872ap-16, 0x1.76baa9c6d3eeddbcp-11 + ]; + + static immutable real[7] LargeStirlingCoeffs = [ 1.0L, + 8.33333333333333333333E-2L, 3.47222222222222222222E-3L, + -2.68132716049382716049E-3L, -2.29472093621399176955E-4L, + 7.84039221720066627474E-4L, 6.97281375836585777429E-5L + ]; + + real w = 1.0L/x; + real y = exp(x); + if ( x > 1024.0L ) + { + // For large x, use rational coefficients from the analytical expansion. + w = poly(w, LargeStirlingCoeffs); + // Avoid overflow in pow() + real v = pow( x, 0.5L * x - 0.25L ); + y = v * (v / y); + } + else + { + w = 1.0L + w * poly( w, SmallStirlingCoeffs); + static if (floatTraits!(real).realFormat == RealFormat.ieeeDouble) + { + // Avoid overflow in pow() for 64-bit reals + if (x > 143.0) + { + real v = pow( x, 0.5 * x - 0.25 ); + y = v * (v / y); + } + else + { + y = pow( x, x - 0.5 ) / y; + } + } + else + { + y = pow( x, x - 0.5L ) / y; + } + } + y = SQRT2PI * y * w; + return y; +} + +/* + * Helper function: Incomplete gamma function computed by Temme's expansion. + * + * This is a port of igamma_temme_large from Boost. + * + */ +real igammaTemmeLarge(real a, real x) +{ + static immutable real[][13] coef = [ + [ -0.333333333333333333333, 0.0833333333333333333333, + -0.0148148148148148148148, 0.00115740740740740740741, + 0.000352733686067019400353, -0.0001787551440329218107, + 0.39192631785224377817e-4, -0.218544851067999216147e-5, + -0.18540622107151599607e-5, 0.829671134095308600502e-6, + -0.176659527368260793044e-6, 0.670785354340149858037e-8, + 0.102618097842403080426e-7, -0.438203601845335318655e-8, + 0.914769958223679023418e-9, -0.255141939949462497669e-10, + -0.583077213255042506746e-10, 0.243619480206674162437e-10, + -0.502766928011417558909e-11 ], + [ -0.00185185185185185185185, -0.00347222222222222222222, + 0.00264550264550264550265, -0.000990226337448559670782, + 0.000205761316872427983539, -0.40187757201646090535e-6, + -0.18098550334489977837e-4, 0.764916091608111008464e-5, + -0.161209008945634460038e-5, 0.464712780280743434226e-8, + 0.137863344691572095931e-6, -0.575254560351770496402e-7, + 0.119516285997781473243e-7, -0.175432417197476476238e-10, + -0.100915437106004126275e-8, 0.416279299184258263623e-9, + -0.856390702649298063807e-10 ], + [ 0.00413359788359788359788, -0.00268132716049382716049, + 0.000771604938271604938272, 0.200938786008230452675e-5, + -0.000107366532263651605215, 0.529234488291201254164e-4, + -0.127606351886187277134e-4, 0.342357873409613807419e-7, + 0.137219573090629332056e-5, -0.629899213838005502291e-6, + 0.142806142060642417916e-6, -0.204770984219908660149e-9, + -0.140925299108675210533e-7, 0.622897408492202203356e-8, + -0.136704883966171134993e-8 ], + [ 0.000649434156378600823045, 0.000229472093621399176955, + -0.000469189494395255712128, 0.000267720632062838852962, + -0.756180167188397641073e-4, -0.239650511386729665193e-6, + 0.110826541153473023615e-4, -0.56749528269915965675e-5, + 0.142309007324358839146e-5, -0.278610802915281422406e-10, + -0.169584040919302772899e-6, 0.809946490538808236335e-7, + -0.191111684859736540607e-7 ], + [ -0.000861888290916711698605, 0.000784039221720066627474, + -0.000299072480303190179733, -0.146384525788434181781e-5, + 0.664149821546512218666e-4, -0.396836504717943466443e-4, + 0.113757269706784190981e-4, 0.250749722623753280165e-9, + -0.169541495365583060147e-5, 0.890750753220530968883e-6, + -0.229293483400080487057e-6], + [ -0.000336798553366358150309, -0.697281375836585777429e-4, + 0.000277275324495939207873, -0.000199325705161888477003, + 0.679778047793720783882e-4, 0.141906292064396701483e-6, + -0.135940481897686932785e-4, 0.801847025633420153972e-5, + -0.229148117650809517038e-5 ], + [ 0.000531307936463992223166, -0.000592166437353693882865, + 0.000270878209671804482771, 0.790235323266032787212e-6, + -0.815396936756196875093e-4, 0.561168275310624965004e-4, + -0.183291165828433755673e-4, -0.307961345060330478256e-8, + 0.346515536880360908674e-5, -0.20291327396058603727e-5, + 0.57887928631490037089e-6 ], + [ 0.000344367606892377671254, 0.517179090826059219337e-4, + -0.000334931610811422363117, 0.000281269515476323702274, + -0.000109765822446847310235, -0.127410090954844853795e-6, + 0.277444515115636441571e-4, -0.182634888057113326614e-4, + 0.578769494973505239894e-5 ], + [ -0.000652623918595309418922, 0.000839498720672087279993, + -0.000438297098541721005061, -0.696909145842055197137e-6, + 0.000166448466420675478374, -0.000127835176797692185853, + 0.462995326369130429061e-4 ], + [ -0.000596761290192746250124, -0.720489541602001055909e-4, + 0.000678230883766732836162, -0.0006401475260262758451, + 0.000277501076343287044992 ], + [ 0.00133244544948006563713, -0.0019144384985654775265, + 0.00110893691345966373396 ], + [ 0.00157972766073083495909, 0.000162516262783915816899, + -0.00206334210355432762645, 0.00213896861856890981541, + -0.00101085593912630031708 ], + [ -0.00407251211951401664727, 0.00640336283380806979482, + -0.00404101610816766177474 ] + ]; + + // avoid nans when one of the arguments is inf: + if (x == real.infinity && a != real.infinity) + return 0; + + if (x != real.infinity && a == real.infinity) + return 1; + + real sigma = (x - a) / a; + real phi = sigma - log(sigma + 1); + + real y = a * phi; + real z = sqrt(2 * phi); + if (x < a) + z = -z; + + real[13] workspace; + foreach (i; 0 .. coef.length) + workspace[i] = poly(z, coef[i]); + + real result = poly(1 / a, workspace); + result *= exp(-y) / sqrt(2 * PI * a); + if (x < a) + result = -result; + + result += erfc(sqrt(y)) / 2; + + return result; +} + +} // private + +public: +/// The maximum value of x for which gamma(x) < real.infinity. +static if (floatTraits!(real).realFormat == RealFormat.ieeeQuadruple) + enum real MAXGAMMA = 1755.5483429L; +else static if (floatTraits!(real).realFormat == RealFormat.ieeeExtended) + enum real MAXGAMMA = 1755.5483429L; +else static if (floatTraits!(real).realFormat == RealFormat.ieeeDouble) + enum real MAXGAMMA = 171.6243769L; +else + static assert(0, "missing MAXGAMMA for other real types"); + + +/***************************************************** + * The Gamma function, $(GAMMA)(x) + * + * $(GAMMA)(x) is a generalisation of the factorial function + * to real and complex numbers. + * Like x!, $(GAMMA)(x+1) = x*$(GAMMA)(x). + * + * Mathematically, if z.re > 0 then + * $(GAMMA)(z) = $(INTEGRATE 0, ∞) $(POWER t, z-1)$(POWER e, -t) dt + * + * $(TABLE_SV + * $(SVH x, $(GAMMA)(x) ) + * $(SV $(NAN), $(NAN) ) + * $(SV ±0.0, ±∞) + * $(SV integer > 0, (x-1)! ) + * $(SV integer < 0, $(NAN) ) + * $(SV +∞, +∞ ) + * $(SV -∞, $(NAN) ) + * ) + */ +real gamma(real x) +{ +/* Based on code from the CEPHES library. + * CEPHES code Copyright 1994 by Stephen L. Moshier + * + * Arguments |x| <= 13 are reduced by recurrence and the function + * approximated by a rational function of degree 7/8 in the + * interval (2,3). Large arguments are handled by Stirling's + * formula. Large negative arguments are made positive using + * a reflection formula. + */ + + real q, z; + if (isNaN(x)) return x; + if (x == -x.infinity) return real.nan; + if ( fabs(x) > MAXGAMMA ) return real.infinity; + if (x == 0) return 1.0 / x; // +- infinity depending on sign of x, create an exception. + + q = fabs(x); + + if ( q > 13.0L ) + { + // Large arguments are handled by Stirling's + // formula. Large negative arguments are made positive using + // the reflection formula. + + if ( x < 0.0L ) + { + if (x < -1/real.epsilon) + { + // Large negatives lose all precision + return real.nan; + } + int sgngam = 1; // sign of gamma. + long intpart = cast(long)(q); + if (q == intpart) + return real.nan; // poles for all integers <0. + real p = intpart; + if ( (intpart & 1) == 0 ) + sgngam = -1; + z = q - p; + if ( z > 0.5L ) + { + p += 1.0L; + z = q - p; + } + z = q * sin( PI * z ); + z = fabs(z) * gammaStirling(q); + if ( z <= PI/real.max ) return sgngam * real.infinity; + return sgngam * PI/z; + } + else + { + return gammaStirling(x); + } + } + + // Arguments |x| <= 13 are reduced by recurrence and the function + // approximated by a rational function of degree 7/8 in the + // interval (2,3). + + z = 1.0L; + while ( x >= 3.0L ) + { + x -= 1.0L; + z *= x; + } + + while ( x < -0.03125L ) + { + z /= x; + x += 1.0L; + } + + if ( x <= 0.03125L ) + { + if ( x == 0.0L ) + return real.nan; + else + { + if ( x < 0.0L ) + { + x = -x; + return z / (x * poly( x, GammaSmallNegCoeffs )); + } + else + { + return z / (x * poly( x, GammaSmallCoeffs )); + } + } + } + + while ( x < 2.0L ) + { + z /= x; + x += 1.0L; + } + if ( x == 2.0L ) return z; + + x -= 2.0L; + return z * poly( x, GammaNumeratorCoeffs ) / poly( x, GammaDenominatorCoeffs ); +} + +@safe unittest +{ + // gamma(n) = factorial(n-1) if n is an integer. + real fact = 1.0L; + for (int i=1; fact<real.max; ++i) + { + // Require exact equality for small factorials + if (i<14) assert(gamma(i*1.0L) == fact); + assert(feqrel(gamma(i*1.0L), fact) >= real.mant_dig-15); + fact *= (i*1.0L); + } + assert(gamma(0.0) == real.infinity); + assert(gamma(-0.0) == -real.infinity); + assert(isNaN(gamma(-1.0))); + assert(isNaN(gamma(-15.0))); + assert(isIdentical(gamma(NaN(0xABC)), NaN(0xABC))); + assert(gamma(real.infinity) == real.infinity); + assert(gamma(real.max) == real.infinity); + assert(isNaN(gamma(-real.infinity))); + assert(gamma(real.min_normal*real.epsilon) == real.infinity); + assert(gamma(MAXGAMMA)< real.infinity); + assert(gamma(MAXGAMMA*2) == real.infinity); + + // Test some high-precision values (50 decimal digits) + real SQRT_PI = 1.77245385090551602729816748334114518279754945612238L; + + + assert(feqrel(gamma(0.5L), SQRT_PI) >= real.mant_dig-1); + assert(feqrel(gamma(17.25L), 4.224986665692703551570937158682064589938e13L) >= real.mant_dig-4); + + assert(feqrel(gamma(1.0 / 3.0L), 2.67893853470774763365569294097467764412868937795730L) >= real.mant_dig-2); + assert(feqrel(gamma(0.25L), + 3.62560990822190831193068515586767200299516768288006L) >= real.mant_dig-1); + assert(feqrel(gamma(1.0 / 5.0L), + 4.59084371199880305320475827592915200343410999829340L) >= real.mant_dig-1); +} + +/***************************************************** + * Natural logarithm of gamma function. + * + * Returns the base e (2.718...) logarithm of the absolute + * value of the gamma function of the argument. + * + * For reals, logGamma is equivalent to log(fabs(gamma(x))). + * + * $(TABLE_SV + * $(SVH x, logGamma(x) ) + * $(SV $(NAN), $(NAN) ) + * $(SV integer <= 0, +∞ ) + * $(SV ±∞, +∞ ) + * ) + */ +real logGamma(real x) +{ + /* Based on code from the CEPHES library. + * CEPHES code Copyright 1994 by Stephen L. Moshier + * + * For arguments greater than 33, the logarithm of the gamma + * function is approximated by the logarithmic version of + * Stirling's formula using a polynomial approximation of + * degree 4. Arguments between -33 and +33 are reduced by + * recurrence to the interval [2,3] of a rational approximation. + * The cosecant reflection formula is employed for arguments + * less than -33. + */ + real q, w, z, f, nx; + + if (isNaN(x)) return x; + if (fabs(x) == x.infinity) return x.infinity; + + if ( x < -34.0L ) + { + q = -x; + w = logGamma(q); + real p = floor(q); + if ( p == q ) + return real.infinity; + int intpart = cast(int)(p); + real sgngam = 1; + if ( (intpart & 1) == 0 ) + sgngam = -1; + z = q - p; + if ( z > 0.5L ) + { + p += 1.0L; + z = p - q; + } + z = q * sin( PI * z ); + if ( z == 0.0L ) + return sgngam * real.infinity; + /* z = LOGPI - logl( z ) - w; */ + z = log( PI/z ) - w; + return z; + } + + if ( x < 13.0L ) + { + z = 1.0L; + nx = floor( x + 0.5L ); + f = x - nx; + while ( x >= 3.0L ) + { + nx -= 1.0L; + x = nx + f; + z *= x; + } + while ( x < 2.0L ) + { + if ( fabs(x) <= 0.03125 ) + { + if ( x == 0.0L ) + return real.infinity; + if ( x < 0.0L ) + { + x = -x; + q = z / (x * poly( x, GammaSmallNegCoeffs)); + } else + q = z / (x * poly( x, GammaSmallCoeffs)); + return log( fabs(q) ); + } + z /= nx + f; + nx += 1.0L; + x = nx + f; + } + z = fabs(z); + if ( x == 2.0L ) + return log(z); + x = (nx - 2.0L) + f; + real p = x * rationalPoly( x, logGammaNumerator, logGammaDenominator); + return log(z) + p; + } + + // const real MAXLGM = 1.04848146839019521116e+4928L; + // if ( x > MAXLGM ) return sgngaml * real.infinity; + + const real LOGSQRT2PI = 0.91893853320467274178L; // log( sqrt( 2*pi ) ) + + q = ( x - 0.5L ) * log(x) - x + LOGSQRT2PI; + if (x > 1.0e10L) return q; + real p = 1.0L / (x*x); + q += poly( p, logGammaStirlingCoeffs ) / x; + return q ; +} + +@safe unittest +{ + assert(isIdentical(logGamma(NaN(0xDEF)), NaN(0xDEF))); + assert(logGamma(real.infinity) == real.infinity); + assert(logGamma(-1.0) == real.infinity); + assert(logGamma(0.0) == real.infinity); + assert(logGamma(-50.0) == real.infinity); + assert(isIdentical(0.0L, logGamma(1.0L))); + assert(isIdentical(0.0L, logGamma(2.0L))); + assert(logGamma(real.min_normal*real.epsilon) == real.infinity); + assert(logGamma(-real.min_normal*real.epsilon) == real.infinity); + + // x, correct loggamma(x), correct d/dx loggamma(x). + immutable static real[] testpoints = [ + 8.0L, 8.525146484375L + 1.48766904143001655310E-5, 2.01564147795560999654E0L, + 8.99993896484375e-1L, 6.6375732421875e-2L + 5.11505711292524166220E-6L, -7.54938684259372234258E-1, + 7.31597900390625e-1L, 2.2369384765625e-1 + 5.21506341809849792422E-6L, -1.13355566660398608343E0L, + 2.31639862060546875e-1L, 1.3686676025390625L + 1.12609441752996145670E-5L, -4.56670961813812679012E0, + 1.73162841796875L, -8.88214111328125e-2L + 3.36207740803753034508E-6L, 2.33339034686200586920E-1L, + 1.23162841796875L, -9.3902587890625e-2L + 1.28765089229009648104E-5L, -2.49677345775751390414E-1L, + 7.3786976294838206464e19L, 3.301798506038663053312e21L - 1.656137564136932662487046269677E5L, + 4.57477139169563904215E1L, + 1.08420217248550443401E-19L, 4.36682586669921875e1L + 1.37082843669932230418E-5L, + -9.22337203685477580858E18L, + 1.0L, 0.0L, -5.77215664901532860607E-1L, + 2.0L, 0.0L, 4.22784335098467139393E-1L, + -0.5L, 1.2655029296875L + 9.19379714539648894580E-6L, 3.64899739785765205590E-2L, + -1.5L, 8.6004638671875e-1L + 6.28657731014510932682E-7L, 7.03156640645243187226E-1L, + -2.5L, -5.6243896484375E-2L + 1.79986700949327405470E-7, 1.10315664064524318723E0L, + -3.5L, -1.30902099609375L + 1.43111007079536392848E-5L, 1.38887092635952890151E0L + ]; + // TODO: test derivatives as well. + for (int i=0; i<testpoints.length; i+=3) + { + assert( feqrel(logGamma(testpoints[i]), testpoints[i+1]) > real.mant_dig-5); + if (testpoints[i]<MAXGAMMA) + { + assert( feqrel(log(fabs(gamma(testpoints[i]))), testpoints[i+1]) > real.mant_dig-5); + } + } + assert(logGamma(-50.2) == log(fabs(gamma(-50.2)))); + assert(logGamma(-0.008) == log(fabs(gamma(-0.008)))); + assert(feqrel(logGamma(-38.8),log(fabs(gamma(-38.8)))) > real.mant_dig-4); + static if (real.mant_dig >= 64) // incl. 80-bit reals + assert(feqrel(logGamma(1500.0L),log(gamma(1500.0L))) > real.mant_dig-2); + else static if (real.mant_dig >= 53) // incl. 64-bit reals + assert(feqrel(logGamma(150.0L),log(gamma(150.0L))) > real.mant_dig-2); +} + + +private { +/* + * These value can be calculated like this: + * 1) Get exact real.max/min_normal/epsilon from compiler: + * writefln!"%a"(real.max/min_normal_epsilon) + * 2) Convert for Wolfram Alpha + * 0xf.fffffffffffffffp+16380 ==> (f.fffffffffffffff base 16) * 2^16380 + * 3) Calculate result on wofram alpha: + * http://www.wolframalpha.com/input/?i=ln((1.ffffffffffffffffffffffffffff+base+16)+*+2%5E16383)+in+base+2 + * 4) Convert to proper format: + * string mantissa = "1.011..."; + * write(mantissa[0 .. 2]); mantissa = mantissa[2 .. $]; + * for (size_t i = 0; i < mantissa.length/4; i++) + * { + * writef!"%x"(to!ubyte(mantissa[0 .. 4], 2)); mantissa = mantissa[4 .. $]; + * } + */ +static if (floatTraits!(real).realFormat == RealFormat.ieeeQuadruple) +{ + enum real MAXLOG = 0x1.62e42fefa39ef35793c7673007e6p+13; // log(real.max) + enum real MINLOG = -0x1.6546282207802c89d24d65e96274p+13; // log(real.min_normal*real.epsilon) = log(smallest denormal) +} +else static if (floatTraits!(real).realFormat == RealFormat.ieeeExtended) +{ + enum real MAXLOG = 0x1.62e42fefa39ef358p+13L; // log(real.max) + enum real MINLOG = -0x1.6436716d5406e6d8p+13L; // log(real.min_normal*real.epsilon) = log(smallest denormal) +} +else static if (floatTraits!(real).realFormat == RealFormat.ieeeDouble) +{ + enum real MAXLOG = 0x1.62e42fefa39efp+9L; // log(real.max) + enum real MINLOG = -0x1.74385446d71c3p+9L; // log(real.min_normal*real.epsilon) = log(smallest denormal) +} +else + static assert(0, "missing MAXLOG and MINLOG for other real types"); + +enum real BETA_BIG = 9.223372036854775808e18L; +enum real BETA_BIGINV = 1.084202172485504434007e-19L; +} + +/** Incomplete beta integral + * + * Returns incomplete beta integral of the arguments, evaluated + * from zero to x. The regularized incomplete beta function is defined as + * + * betaIncomplete(a, b, x) = Γ(a+b)/(Γ(a) Γ(b)) * + * $(INTEGRATE 0, x) $(POWER t, a-1)$(POWER (1-t),b-1) dt + * + * and is the same as the the cumulative distribution function. + * + * The domain of definition is 0 <= x <= 1. In this + * implementation a and b are restricted to positive values. + * The integral from x to 1 may be obtained by the symmetry + * relation + * + * betaIncompleteCompl(a, b, x ) = betaIncomplete( b, a, 1-x ) + * + * The integral is evaluated by a continued fraction expansion + * or, when b*x is small, by a power series. + */ +real betaIncomplete(real aa, real bb, real xx ) +{ + if ( !(aa>0 && bb>0) ) + { + if ( isNaN(aa) ) return aa; + if ( isNaN(bb) ) return bb; + return real.nan; // domain error + } + if (!(xx>0 && xx<1.0)) + { + if (isNaN(xx)) return xx; + if ( xx == 0.0L ) return 0.0; + if ( xx == 1.0L ) return 1.0; + return real.nan; // domain error + } + if ( (bb * xx) <= 1.0L && xx <= 0.95L) + { + return betaDistPowerSeries(aa, bb, xx); + } + real x; + real xc; // = 1 - x + + real a, b; + int flag = 0; + + /* Reverse a and b if x is greater than the mean. */ + if ( xx > (aa/(aa+bb)) ) + { + // here x > aa/(aa+bb) and (bb*x>1 or x>0.95) + flag = 1; + a = bb; + b = aa; + xc = xx; + x = 1.0L - xx; + } + else + { + a = aa; + b = bb; + xc = 1.0L - xx; + x = xx; + } + + if ( flag == 1 && (b * x) <= 1.0L && x <= 0.95L) + { + // here xx > aa/(aa+bb) and ((bb*xx>1) or xx>0.95) and (aa*(1-xx)<=1) and xx > 0.05 + return 1.0 - betaDistPowerSeries(a, b, x); // note loss of precision + } + + real w; + // Choose expansion for optimal convergence + // One is for x * (a+b+2) < (a+1), + // the other is for x * (a+b+2) > (a+1). + real y = x * (a+b-2.0L) - (a-1.0L); + if ( y < 0.0L ) + { + w = betaDistExpansion1( a, b, x ); + } + else + { + w = betaDistExpansion2( a, b, x ) / xc; + } + + /* Multiply w by the factor + a b + x (1-x) Gamma(a+b) / ( a Gamma(a) Gamma(b) ) . */ + + y = a * log(x); + real t = b * log(xc); + if ( (a+b) < MAXGAMMA && fabs(y) < MAXLOG && fabs(t) < MAXLOG ) + { + t = pow(xc,b); + t *= pow(x,a); + t /= a; + t *= w; + t *= gamma(a+b) / (gamma(a) * gamma(b)); + } + else + { + /* Resort to logarithms. */ + y += t + logGamma(a+b) - logGamma(a) - logGamma(b); + y += log(w/a); + + t = exp(y); +/+ + // There seems to be a bug in Cephes at this point. + // Problems occur for y > MAXLOG, not y < MINLOG. + if ( y < MINLOG ) + { + t = 0.0L; + } + else + { + t = exp(y); + } ++/ + } + if ( flag == 1 ) + { +/+ // CEPHES includes this code, but I think it is erroneous. + if ( t <= real.epsilon ) + { + t = 1.0L - real.epsilon; + } else ++/ + t = 1.0L - t; + } + return t; +} + +/** Inverse of incomplete beta integral + * + * Given y, the function finds x such that + * + * betaIncomplete(a, b, x) == y + * + * Newton iterations or interval halving is used. + */ +real betaIncompleteInv(real aa, real bb, real yy0 ) +{ + real a, b, y0, d, y, x, x0, x1, lgm, yp, di, dithresh, yl, yh, xt; + int i, rflg, dir, nflg; + + if (isNaN(yy0)) return yy0; + if (isNaN(aa)) return aa; + if (isNaN(bb)) return bb; + if ( yy0 <= 0.0L ) + return 0.0L; + if ( yy0 >= 1.0L ) + return 1.0L; + x0 = 0.0L; + yl = 0.0L; + x1 = 1.0L; + yh = 1.0L; + if ( aa <= 1.0L || bb <= 1.0L ) + { + dithresh = 1.0e-7L; + rflg = 0; + a = aa; + b = bb; + y0 = yy0; + x = a/(a+b); + y = betaIncomplete( a, b, x ); + nflg = 0; + goto ihalve; + } + else + { + nflg = 0; + dithresh = 1.0e-4L; + } + + // approximation to inverse function + + yp = -normalDistributionInvImpl( yy0 ); + + if ( yy0 > 0.5L ) + { + rflg = 1; + a = bb; + b = aa; + y0 = 1.0L - yy0; + yp = -yp; + } + else + { + rflg = 0; + a = aa; + b = bb; + y0 = yy0; + } + + lgm = (yp * yp - 3.0L)/6.0L; + x = 2.0L/( 1.0L/(2.0L * a-1.0L) + 1.0L/(2.0L * b - 1.0L) ); + d = yp * sqrt( x + lgm ) / x + - ( 1.0L/(2.0L * b - 1.0L) - 1.0L/(2.0L * a - 1.0L) ) + * (lgm + (5.0L/6.0L) - 2.0L/(3.0L * x)); + d = 2.0L * d; + if ( d < MINLOG ) + { + x = 1.0L; + goto under; + } + x = a/( a + b * exp(d) ); + y = betaIncomplete( a, b, x ); + yp = (y - y0)/y0; + if ( fabs(yp) < 0.2 ) + goto newt; + + /* Resort to interval halving if not close enough. */ +ihalve: + + dir = 0; + di = 0.5L; + for ( i=0; i<400; i++ ) + { + if ( i != 0 ) + { + x = x0 + di * (x1 - x0); + if ( x == 1.0L ) + { + x = 1.0L - real.epsilon; + } + if ( x == 0.0L ) + { + di = 0.5; + x = x0 + di * (x1 - x0); + if ( x == 0.0 ) + goto under; + } + y = betaIncomplete( a, b, x ); + yp = (x1 - x0)/(x1 + x0); + if ( fabs(yp) < dithresh ) + goto newt; + yp = (y-y0)/y0; + if ( fabs(yp) < dithresh ) + goto newt; + } + if ( y < y0 ) + { + x0 = x; + yl = y; + if ( dir < 0 ) + { + dir = 0; + di = 0.5L; + } else if ( dir > 3 ) + di = 1.0L - (1.0L - di) * (1.0L - di); + else if ( dir > 1 ) + di = 0.5L * di + 0.5L; + else + di = (y0 - y)/(yh - yl); + dir += 1; + if ( x0 > 0.95L ) + { + if ( rflg == 1 ) + { + rflg = 0; + a = aa; + b = bb; + y0 = yy0; + } + else + { + rflg = 1; + a = bb; + b = aa; + y0 = 1.0 - yy0; + } + x = 1.0L - x; + y = betaIncomplete( a, b, x ); + x0 = 0.0; + yl = 0.0; + x1 = 1.0; + yh = 1.0; + goto ihalve; + } + } + else + { + x1 = x; + if ( rflg == 1 && x1 < real.epsilon ) + { + x = 0.0L; + goto done; + } + yh = y; + if ( dir > 0 ) + { + dir = 0; + di = 0.5L; + } + else if ( dir < -3 ) + di = di * di; + else if ( dir < -1 ) + di = 0.5L * di; + else + di = (y - y0)/(yh - yl); + dir -= 1; + } + } + if ( x0 >= 1.0L ) + { + // partial loss of precision + x = 1.0L - real.epsilon; + goto done; + } + if ( x <= 0.0L ) + { +under: + // underflow has occurred + x = real.min_normal * real.min_normal; + goto done; + } + +newt: + + if ( nflg ) + { + goto done; + } + nflg = 1; + lgm = logGamma(a+b) - logGamma(a) - logGamma(b); + + for ( i=0; i<15; i++ ) + { + /* Compute the function at this point. */ + if ( i != 0 ) + y = betaIncomplete(a,b,x); + if ( y < yl ) + { + x = x0; + y = yl; + } + else if ( y > yh ) + { + x = x1; + y = yh; + } + else if ( y < y0 ) + { + x0 = x; + yl = y; + } + else + { + x1 = x; + yh = y; + } + if ( x == 1.0L || x == 0.0L ) + break; + /* Compute the derivative of the function at this point. */ + d = (a - 1.0L) * log(x) + (b - 1.0L) * log(1.0L - x) + lgm; + if ( d < MINLOG ) + { + goto done; + } + if ( d > MAXLOG ) + { + break; + } + d = exp(d); + /* Compute the step to the next approximation of x. */ + d = (y - y0)/d; + xt = x - d; + if ( xt <= x0 ) + { + y = (x - x0) / (x1 - x0); + xt = x0 + 0.5L * y * (x - x0); + if ( xt <= 0.0L ) + break; + } + if ( xt >= x1 ) + { + y = (x1 - x) / (x1 - x0); + xt = x1 - 0.5L * y * (x1 - x); + if ( xt >= 1.0L ) + break; + } + x = xt; + if ( fabs(d/x) < (128.0L * real.epsilon) ) + goto done; + } + /* Did not converge. */ + dithresh = 256.0L * real.epsilon; + goto ihalve; + +done: + if ( rflg ) + { + if ( x <= real.epsilon ) + x = 1.0L - real.epsilon; + else + x = 1.0L - x; + } + return x; +} + +@safe unittest { // also tested by the normal distribution + // check NaN propagation + assert(isIdentical(betaIncomplete(NaN(0xABC),2,3), NaN(0xABC))); + assert(isIdentical(betaIncomplete(7,NaN(0xABC),3), NaN(0xABC))); + assert(isIdentical(betaIncomplete(7,15,NaN(0xABC)), NaN(0xABC))); + assert(isIdentical(betaIncompleteInv(NaN(0xABC),1,17), NaN(0xABC))); + assert(isIdentical(betaIncompleteInv(2,NaN(0xABC),8), NaN(0xABC))); + assert(isIdentical(betaIncompleteInv(2,3, NaN(0xABC)), NaN(0xABC))); + + assert(isNaN(betaIncomplete(-1, 2, 3))); + + assert(betaIncomplete(1, 2, 0)==0); + assert(betaIncomplete(1, 2, 1)==1); + assert(isNaN(betaIncomplete(1, 2, 3))); + assert(betaIncompleteInv(1, 1, 0)==0); + assert(betaIncompleteInv(1, 1, 1)==1); + + // Test against Mathematica betaRegularized[z,a,b] + // These arbitrary points are chosen to give good code coverage. + assert(feqrel(betaIncomplete(8, 10, 0.2), 0.010_934_315_234_099_2L) >= real.mant_dig - 5); + assert(feqrel(betaIncomplete(2, 2.5, 0.9), 0.989_722_597_604_452_767_171_003_59L) >= real.mant_dig - 1); + static if (real.mant_dig >= 64) // incl. 80-bit reals + assert(feqrel(betaIncomplete(1000, 800, 0.5), 1.179140859734704555102808541457164E-06L) >= real.mant_dig - 13); + else + assert(feqrel(betaIncomplete(1000, 800, 0.5), 1.179140859734704555102808541457164E-06L) >= real.mant_dig - 14); + assert(feqrel(betaIncomplete(0.0001, 10000, 0.0001), 0.999978059362107134278786L) >= real.mant_dig - 18); + assert(betaIncomplete(0.01, 327726.7, 0.545113) == 1.0); + assert(feqrel(betaIncompleteInv(8, 10, 0.010_934_315_234_099_2L), 0.2L) >= real.mant_dig - 2); + assert(feqrel(betaIncomplete(0.01, 498.437, 0.0121433), 0.99999664562033077636065L) >= real.mant_dig - 1); + assert(feqrel(betaIncompleteInv(5, 10, 0.2000002972865658842), 0.229121208190918L) >= real.mant_dig - 3); + assert(feqrel(betaIncompleteInv(4, 7, 0.8000002209179505L), 0.483657360076904L) >= real.mant_dig - 3); + + // Coverage tests. I don't have correct values for these tests, but + // these values cover most of the code, so they are useful for + // regression testing. + // Extensive testing failed to increase the coverage. It seems likely that about + // half the code in this function is unnecessary; there is potential for + // significant improvement over the original CEPHES code. + static if (real.mant_dig == 64) // 80-bit reals + { + assert(betaIncompleteInv(0.01, 8e-48, 5.45464e-20) == 1-real.epsilon); + assert(betaIncompleteInv(0.01, 8e-48, 9e-26) == 1-real.epsilon); + + // Beware: a one-bit change in pow() changes almost all digits in the result! + assert(feqrel( + betaIncompleteInv(0x1.b3d151fbba0eb18p+1, 1.2265e-19, 2.44859e-18), + 0x1.c0110c8531d0952cp-1L + ) > 10); + // This next case uncovered a one-bit difference in the FYL2X instruction + // between Intel and AMD processors. This difference gets magnified by 2^^38. + // WolframAlpha crashes attempting to calculate this. + assert(feqrel(betaIncompleteInv(0x1.ff1275ae5b939bcap-41, 4.6713e18, 0.0813601), + 0x1.f97749d90c7adba8p-63L) >= real.mant_dig - 39); + real a1 = 3.40483; + assert(betaIncompleteInv(a1, 4.0640301659679627772e19L, 0.545113) == 0x1.ba8c08108aaf5d14p-109); + real b1 = 2.82847e-25; + assert(feqrel(betaIncompleteInv(0.01, b1, 9e-26), 0x1.549696104490aa9p-830L) >= real.mant_dig-10); + + // --- Problematic cases --- + // This is a situation where the series expansion fails to converge + assert( isNaN(betaIncompleteInv(0.12167, 4.0640301659679627772e19L, 0.0813601))); + // This next result is almost certainly erroneous. + // Mathematica states: "(cannot be determined by current methods)" + assert(betaIncomplete(1.16251e20, 2.18e39, 5.45e-20) == -real.infinity); + // WolframAlpha gives no result for this, though indicates that it approximately 1.0 - 1.3e-9 + assert(1 - betaIncomplete(0.01, 328222, 4.0375e-5) == 0x1.5f62926b4p-30); + } +} + + +private { +// Implementation functions + +// Continued fraction expansion #1 for incomplete beta integral +// Use when x < (a+1)/(a+b+2) +real betaDistExpansion1(real a, real b, real x ) +{ + real xk, pk, pkm1, pkm2, qk, qkm1, qkm2; + real k1, k2, k3, k4, k5, k6, k7, k8; + real r, t, ans; + int n; + + k1 = a; + k2 = a + b; + k3 = a; + k4 = a + 1.0L; + k5 = 1.0L; + k6 = b - 1.0L; + k7 = k4; + k8 = a + 2.0L; + + pkm2 = 0.0L; + qkm2 = 1.0L; + pkm1 = 1.0L; + qkm1 = 1.0L; + ans = 1.0L; + r = 1.0L; + n = 0; + const real thresh = 3.0L * real.epsilon; + do + { + xk = -( x * k1 * k2 )/( k3 * k4 ); + pk = pkm1 + pkm2 * xk; + qk = qkm1 + qkm2 * xk; + pkm2 = pkm1; + pkm1 = pk; + qkm2 = qkm1; + qkm1 = qk; + + xk = ( x * k5 * k6 )/( k7 * k8 ); + pk = pkm1 + pkm2 * xk; + qk = qkm1 + qkm2 * xk; + pkm2 = pkm1; + pkm1 = pk; + qkm2 = qkm1; + qkm1 = qk; + + if ( qk != 0.0L ) + r = pk/qk; + if ( r != 0.0L ) + { + t = fabs( (ans - r)/r ); + ans = r; + } + else + { + t = 1.0L; + } + + if ( t < thresh ) + return ans; + + k1 += 1.0L; + k2 += 1.0L; + k3 += 2.0L; + k4 += 2.0L; + k5 += 1.0L; + k6 -= 1.0L; + k7 += 2.0L; + k8 += 2.0L; + + if ( (fabs(qk) + fabs(pk)) > BETA_BIG ) + { + pkm2 *= BETA_BIGINV; + pkm1 *= BETA_BIGINV; + qkm2 *= BETA_BIGINV; + qkm1 *= BETA_BIGINV; + } + if ( (fabs(qk) < BETA_BIGINV) || (fabs(pk) < BETA_BIGINV) ) + { + pkm2 *= BETA_BIG; + pkm1 *= BETA_BIG; + qkm2 *= BETA_BIG; + qkm1 *= BETA_BIG; + } + } + while ( ++n < 400 ); +// loss of precision has occurred +// mtherr( "incbetl", PLOSS ); + return ans; +} + +// Continued fraction expansion #2 for incomplete beta integral +// Use when x > (a+1)/(a+b+2) +real betaDistExpansion2(real a, real b, real x ) +{ + real xk, pk, pkm1, pkm2, qk, qkm1, qkm2; + real k1, k2, k3, k4, k5, k6, k7, k8; + real r, t, ans, z; + + k1 = a; + k2 = b - 1.0L; + k3 = a; + k4 = a + 1.0L; + k5 = 1.0L; + k6 = a + b; + k7 = a + 1.0L; + k8 = a + 2.0L; + + pkm2 = 0.0L; + qkm2 = 1.0L; + pkm1 = 1.0L; + qkm1 = 1.0L; + z = x / (1.0L-x); + ans = 1.0L; + r = 1.0L; + int n = 0; + const real thresh = 3.0L * real.epsilon; + do + { + xk = -( z * k1 * k2 )/( k3 * k4 ); + pk = pkm1 + pkm2 * xk; + qk = qkm1 + qkm2 * xk; + pkm2 = pkm1; + pkm1 = pk; + qkm2 = qkm1; + qkm1 = qk; + + xk = ( z * k5 * k6 )/( k7 * k8 ); + pk = pkm1 + pkm2 * xk; + qk = qkm1 + qkm2 * xk; + pkm2 = pkm1; + pkm1 = pk; + qkm2 = qkm1; + qkm1 = qk; + + if ( qk != 0.0L ) + r = pk/qk; + if ( r != 0.0L ) + { + t = fabs( (ans - r)/r ); + ans = r; + } else + t = 1.0L; + + if ( t < thresh ) + return ans; + k1 += 1.0L; + k2 -= 1.0L; + k3 += 2.0L; + k4 += 2.0L; + k5 += 1.0L; + k6 += 1.0L; + k7 += 2.0L; + k8 += 2.0L; + + if ( (fabs(qk) + fabs(pk)) > BETA_BIG ) + { + pkm2 *= BETA_BIGINV; + pkm1 *= BETA_BIGINV; + qkm2 *= BETA_BIGINV; + qkm1 *= BETA_BIGINV; + } + if ( (fabs(qk) < BETA_BIGINV) || (fabs(pk) < BETA_BIGINV) ) + { + pkm2 *= BETA_BIG; + pkm1 *= BETA_BIG; + qkm2 *= BETA_BIG; + qkm1 *= BETA_BIG; + } + } while ( ++n < 400 ); +// loss of precision has occurred +//mtherr( "incbetl", PLOSS ); + return ans; +} + +/* Power series for incomplete gamma integral. + Use when b*x is small. */ +real betaDistPowerSeries(real a, real b, real x ) +{ + real ai = 1.0L / a; + real u = (1.0L - b) * x; + real v = u / (a + 1.0L); + real t1 = v; + real t = u; + real n = 2.0L; + real s = 0.0L; + real z = real.epsilon * ai; + while ( fabs(v) > z ) + { + u = (n - b) * x / n; + t *= u; + v = t / (a + n); + s += v; + n += 1.0L; + } + s += t1; + s += ai; + + u = a * log(x); + if ( (a+b) < MAXGAMMA && fabs(u) < MAXLOG ) + { + t = gamma(a+b)/(gamma(a)*gamma(b)); + s = s * t * pow(x,a); + } + else + { + t = logGamma(a+b) - logGamma(a) - logGamma(b) + u + log(s); + + if ( t < MINLOG ) + { + s = 0.0L; + } else + s = exp(t); + } + return s; +} + +} + +/*************************************** + * Incomplete gamma integral and its complement + * + * These functions are defined by + * + * gammaIncomplete = ( $(INTEGRATE 0, x) $(POWER e, -t) $(POWER t, a-1) dt )/ $(GAMMA)(a) + * + * gammaIncompleteCompl(a,x) = 1 - gammaIncomplete(a,x) + * = ($(INTEGRATE x, ∞) $(POWER e, -t) $(POWER t, a-1) dt )/ $(GAMMA)(a) + * + * In this implementation both arguments must be positive. + * The integral is evaluated by either a power series or + * continued fraction expansion, depending on the relative + * values of a and x. + */ +real gammaIncomplete(real a, real x ) +in { + assert(x >= 0); + assert(a > 0); +} +body { + /* left tail of incomplete gamma function: + * + * inf. k + * a -x - x + * x e > ---------- + * - - + * k=0 | (a+k+1) + * + */ + if (x == 0) + return 0.0L; + + if ( (x > 1.0L) && (x > a ) ) + return 1.0L - gammaIncompleteCompl(a,x); + + real ax = a * log(x) - x - logGamma(a); +/+ + if ( ax < MINLOGL ) return 0; // underflow + // { mtherr( "igaml", UNDERFLOW ); return( 0.0L ); } ++/ + ax = exp(ax); + + /* power series */ + real r = a; + real c = 1.0L; + real ans = 1.0L; + + do + { + r += 1.0L; + c *= x/r; + ans += c; + } while ( c/ans > real.epsilon ); + + return ans * ax/a; +} + +/** ditto */ +real gammaIncompleteCompl(real a, real x ) +in { + assert(x >= 0); + assert(a > 0); +} +body { + if (x == 0) + return 1.0L; + if ( (x < 1.0L) || (x < a) ) + return 1.0L - gammaIncomplete(a,x); + + // DAC (Cephes bug fix): This is necessary to avoid + // spurious nans, eg + // log(x)-x = NaN when x = real.infinity + const real MAXLOGL = 1.1356523406294143949492E4L; + if (x > MAXLOGL) + return igammaTemmeLarge(a, x); + + real ax = a * log(x) - x - logGamma(a); +//const real MINLOGL = -1.1355137111933024058873E4L; +// if ( ax < MINLOGL ) return 0; // underflow; + ax = exp(ax); + + + /* continued fraction */ + real y = 1.0L - a; + real z = x + y + 1.0L; + real c = 0.0L; + + real pk, qk, t; + + real pkm2 = 1.0L; + real qkm2 = x; + real pkm1 = x + 1.0L; + real qkm1 = z * x; + real ans = pkm1/qkm1; + + do + { + c += 1.0L; + y += 1.0L; + z += 2.0L; + real yc = y * c; + pk = pkm1 * z - pkm2 * yc; + qk = qkm1 * z - qkm2 * yc; + if ( qk != 0.0L ) + { + real r = pk/qk; + t = fabs( (ans - r)/r ); + ans = r; + } + else + { + t = 1.0L; + } + pkm2 = pkm1; + pkm1 = pk; + qkm2 = qkm1; + qkm1 = qk; + + const real BIG = 9.223372036854775808e18L; + + if ( fabs(pk) > BIG ) + { + pkm2 /= BIG; + pkm1 /= BIG; + qkm2 /= BIG; + qkm1 /= BIG; + } + } while ( t > real.epsilon ); + + return ans * ax; +} + +/** Inverse of complemented incomplete gamma integral + * + * Given a and p, the function finds x such that + * + * gammaIncompleteCompl( a, x ) = p. + * + * Starting with the approximate value x = a $(POWER t, 3), where + * t = 1 - d - normalDistributionInv(p) sqrt(d), + * and d = 1/9a, + * the routine performs up to 10 Newton iterations to find the + * root of incompleteGammaCompl(a,x) - p = 0. + */ +real gammaIncompleteComplInv(real a, real p) +in { + assert(p >= 0 && p <= 1); + assert(a>0); +} +body { + if (p == 0) return real.infinity; + + real y0 = p; + const real MAXLOGL = 1.1356523406294143949492E4L; + real x0, x1, x, yl, yh, y, d, lgm, dithresh; + int i, dir; + + /* bound the solution */ + x0 = real.max; + yl = 0.0L; + x1 = 0.0L; + yh = 1.0L; + dithresh = 4.0 * real.epsilon; + + /* approximation to inverse function */ + d = 1.0L/(9.0L*a); + y = 1.0L - d - normalDistributionInvImpl(y0) * sqrt(d); + x = a * y * y * y; + + lgm = logGamma(a); + + for ( i=0; i<10; i++ ) + { + if ( x > x0 || x < x1 ) + goto ihalve; + y = gammaIncompleteCompl(a,x); + if ( y < yl || y > yh ) + goto ihalve; + if ( y < y0 ) + { + x0 = x; + yl = y; + } + else + { + x1 = x; + yh = y; + } + /* compute the derivative of the function at this point */ + d = (a - 1.0L) * log(x0) - x0 - lgm; + if ( d < -MAXLOGL ) + goto ihalve; + d = -exp(d); + /* compute the step to the next approximation of x */ + d = (y - y0)/d; + x = x - d; + if ( i < 3 ) continue; + if ( fabs(d/x) < dithresh ) return x; + } + + /* Resort to interval halving if Newton iteration did not converge. */ +ihalve: + d = 0.0625L; + if ( x0 == real.max ) + { + if ( x <= 0.0L ) + x = 1.0L; + while ( x0 == real.max ) + { + x = (1.0L + d) * x; + y = gammaIncompleteCompl( a, x ); + if ( y < y0 ) + { + x0 = x; + yl = y; + break; + } + d = d + d; + } + } + d = 0.5L; + dir = 0; + + for ( i=0; i<400; i++ ) + { + x = x1 + d * (x0 - x1); + y = gammaIncompleteCompl( a, x ); + lgm = (x0 - x1)/(x1 + x0); + if ( fabs(lgm) < dithresh ) + break; + lgm = (y - y0)/y0; + if ( fabs(lgm) < dithresh ) + break; + if ( x <= 0.0L ) + break; + if ( y > y0 ) + { + x1 = x; + yh = y; + if ( dir < 0 ) + { + dir = 0; + d = 0.5L; + } else if ( dir > 1 ) + d = 0.5L * d + 0.5L; + else + d = (y0 - yl)/(yh - yl); + dir += 1; + } + else + { + x0 = x; + yl = y; + if ( dir > 0 ) + { + dir = 0; + d = 0.5L; + } else if ( dir < -1 ) + d = 0.5L * d; + else + d = (y0 - yl)/(yh - yl); + dir -= 1; + } + } + /+ + if ( x == 0.0L ) + mtherr( "igamil", UNDERFLOW ); + +/ + return x; +} + +@safe unittest +{ +//Values from Excel's GammaInv(1-p, x, 1) +assert(fabs(gammaIncompleteComplInv(1, 0.5) - 0.693147188044814) < 0.00000005); +assert(fabs(gammaIncompleteComplInv(12, 0.99) - 5.42818075054289) < 0.00000005); +assert(fabs(gammaIncompleteComplInv(100, 0.8) - 91.5013985848288L) < 0.000005); +assert(gammaIncomplete(1, 0)==0); +assert(gammaIncompleteCompl(1, 0)==1); +assert(gammaIncomplete(4545, real.infinity)==1); + +// Values from Excel's (1-GammaDist(x, alpha, 1, TRUE)) + +assert(fabs(1.0L-gammaIncompleteCompl(0.5, 2) - 0.954499729507309L) < 0.00000005); +assert(fabs(gammaIncomplete(0.5, 2) - 0.954499729507309L) < 0.00000005); +// Fixed Cephes bug: +assert(gammaIncompleteCompl(384, real.infinity)==0); +assert(gammaIncompleteComplInv(3, 0)==real.infinity); +// Fixed a bug that caused gammaIncompleteCompl to return a wrong value when +// x was larger than a, but not by much, and both were large: +// The value is from WolframAlpha (Gamma[100000, 100001, inf] / Gamma[100000]) +static if (real.mant_dig >= 64) // incl. 80-bit reals + assert(fabs(gammaIncompleteCompl(100000, 100001) - 0.49831792109) < 0.000000000005); +else + assert(fabs(gammaIncompleteCompl(100000, 100001) - 0.49831792109) < 0.00000005); +} + + +// DAC: These values are Bn / n for n=2,4,6,8,10,12,14. +immutable real [7] Bn_n = [ + 1.0L/(6*2), -1.0L/(30*4), 1.0L/(42*6), -1.0L/(30*8), + 5.0L/(66*10), -691.0L/(2730*12), 7.0L/(6*14) ]; + +/** Digamma function +* +* The digamma function is the logarithmic derivative of the gamma function. +* +* digamma(x) = d/dx logGamma(x) +* +* References: +* 1. Abramowitz, M., and Stegun, I. A. (1970). +* Handbook of mathematical functions. Dover, New York, +* pages 258-259, equations 6.3.6 and 6.3.18. +*/ +real digamma(real x) +{ + // Based on CEPHES, Stephen L. Moshier. + + real p, q, nz, s, w, y, z; + long i, n; + int negative; + + negative = 0; + nz = 0.0; + + if ( x <= 0.0 ) + { + negative = 1; + q = x; + p = floor(q); + if ( p == q ) + { + return real.nan; // singularity. + } + /* Remove the zeros of tan(PI x) + * by subtracting the nearest integer from x + */ + nz = q - p; + if ( nz != 0.5 ) + { + if ( nz > 0.5 ) + { + p += 1.0; + nz = q - p; + } + nz = PI/tan(PI*nz); + } + else + { + nz = 0.0; + } + x = 1.0 - x; + } + + // check for small positive integer + if ((x <= 13.0) && (x == floor(x)) ) + { + y = 0.0; + n = lrint(x); + // DAC: CEPHES bugfix. Cephes did this in reverse order, which + // created a larger roundoff error. + for (i=n-1; i>0; --i) + { + y+=1.0L/i; + } + y -= EULERGAMMA; + goto done; + } + + s = x; + w = 0.0; + while ( s < 10.0 ) + { + w += 1.0/s; + s += 1.0; + } + + if ( s < 1.0e17 ) + { + z = 1.0/(s * s); + y = z * poly(z, Bn_n); + } else + y = 0.0; + + y = log(s) - 0.5L/s - y - w; + +done: + if ( negative ) + { + y -= nz; + } + return y; +} + +@safe unittest +{ + // Exact values + assert(digamma(1.0)== -EULERGAMMA); + assert(feqrel(digamma(0.25), -PI/2 - 3* LN2 - EULERGAMMA) >= real.mant_dig-7); + assert(feqrel(digamma(1.0L/6), -PI/2 *sqrt(3.0L) - 2* LN2 -1.5*log(3.0L) - EULERGAMMA) >= real.mant_dig-7); + assert(digamma(-5.0).isNaN()); + assert(feqrel(digamma(2.5), -EULERGAMMA - 2*LN2 + 2.0 + 2.0L/3) >= real.mant_dig-9); + assert(isIdentical(digamma(NaN(0xABC)), NaN(0xABC))); + + for (int k=1; k<40; ++k) + { + real y=0; + for (int u=k; u >= 1; --u) + { + y += 1.0L/u; + } + assert(feqrel(digamma(k+1.0), -EULERGAMMA + y) >= real.mant_dig-2); + } +} + +/** Log Minus Digamma function +* +* logmdigamma(x) = log(x) - digamma(x) +* +* References: +* 1. Abramowitz, M., and Stegun, I. A. (1970). +* Handbook of mathematical functions. Dover, New York, +* pages 258-259, equations 6.3.6 and 6.3.18. +*/ +real logmdigamma(real x) +{ + if (x <= 0.0) + { + if (x == 0.0) + { + return real.infinity; + } + return real.nan; + } + + real s = x; + real w = 0.0; + while ( s < 10.0 ) + { + w += 1.0/s; + s += 1.0; + } + + real y; + if ( s < 1.0e17 ) + { + immutable real z = 1.0/(s * s); + y = z * poly(z, Bn_n); + } else + y = 0.0; + + return x == s ? y + 0.5L/s : (log(x/s) + 0.5L/s + y + w); +} + +@safe unittest +{ + assert(logmdigamma(-5.0).isNaN()); + assert(isIdentical(logmdigamma(NaN(0xABC)), NaN(0xABC))); + assert(logmdigamma(0.0) == real.infinity); + for (auto x = 0.01; x < 1.0; x += 0.1) + assert(approxEqual(digamma(x), log(x) - logmdigamma(x))); + for (auto x = 1.0; x < 15.0; x += 1.0) + assert(approxEqual(digamma(x), log(x) - logmdigamma(x))); +} + +/** Inverse of the Log Minus Digamma function + * + * Returns x such $(D log(x) - digamma(x) == y). + * + * References: + * 1. Abramowitz, M., and Stegun, I. A. (1970). + * Handbook of mathematical functions. Dover, New York, + * pages 258-259, equation 6.3.18. + * + * Authors: Ilya Yaroshenko + */ +real logmdigammaInverse(real y) +{ + import std.numeric : findRoot; + // FIXME: should be returned back to enum. + // Fix requires CTFEable `log` on non-x86 targets (check both LDC and GDC). + immutable maxY = logmdigamma(real.min_normal); + assert(maxY > 0 && maxY <= real.max); + + if (y >= maxY) + { + //lim x->0 (log(x)-digamma(x))*x == 1 + return 1 / y; + } + if (y < 0) + { + return real.nan; + } + if (y < real.min_normal) + { + //6.3.18 + return 0.5 / y; + } + if (y > 0) + { + // x/2 <= logmdigamma(1 / x) <= x, x > 0 + // calls logmdigamma ~6 times + return 1 / findRoot((real x) => logmdigamma(1 / x) - y, y, 2*y); + } + return y; //NaN +} + +@safe unittest +{ + import std.typecons; + //WolframAlpha, 22.02.2015 + immutable Tuple!(real, real)[5] testData = [ + tuple(1.0L, 0.615556766479594378978099158335549201923L), + tuple(1.0L/8, 4.15937801516894947161054974029150730555L), + tuple(1.0L/1024, 512.166612384991507850643277924243523243L), + tuple(0.000500083333325000003968249801594877323784632117L, 1000.0L), + tuple(1017.644138623741168814449776695062817947092468536L, 1.0L/1024), + ]; + foreach (test; testData) + assert(approxEqual(logmdigammaInverse(test[0]), test[1], 2e-15, 0)); + + assert(approxEqual(logmdigamma(logmdigammaInverse(1)), 1, 1e-15, 0)); + assert(approxEqual(logmdigamma(logmdigammaInverse(real.min_normal)), real.min_normal, 1e-15, 0)); + assert(approxEqual(logmdigamma(logmdigammaInverse(real.max/2)), real.max/2, 1e-15, 0)); + assert(approxEqual(logmdigammaInverse(logmdigamma(1)), 1, 1e-15, 0)); + assert(approxEqual(logmdigammaInverse(logmdigamma(real.min_normal)), real.min_normal, 1e-15, 0)); + assert(approxEqual(logmdigammaInverse(logmdigamma(real.max/2)), real.max/2, 1e-15, 0)); +} |