From 56dfadd185bbcc6af82a59bbf60f042646c2f636 Mon Sep 17 00:00:00 2001 From: Gerald Pfeifer Date: Sun, 29 Dec 2024 21:44:50 +0800 Subject: libstdc++: Delete leftover from Profile Mode removal Commit 544be2beb1fa in 2019 remove Profile Mode and associated docs including the XML version of profile_mode_diagnostics.html. Somehow the latter survived until now. Simply delete it as well. libstdc++-v3: * doc/html/manual/profile_mode_diagnostics.html: Delete. --- .../doc/html/manual/profile_mode_diagnostics.html | 557 --------------------- 1 file changed, 557 deletions(-) delete mode 100644 libstdc++-v3/doc/html/manual/profile_mode_diagnostics.html (limited to 'libstdc++-v3/doc/html') diff --git a/libstdc++-v3/doc/html/manual/profile_mode_diagnostics.html b/libstdc++-v3/doc/html/manual/profile_mode_diagnostics.html deleted file mode 100644 index 2c4140f..0000000 --- a/libstdc++-v3/doc/html/manual/profile_mode_diagnostics.html +++ /dev/null @@ -1,557 +0,0 @@ - -Diagnostics

Diagnostics

- The table below presents all the diagnostics we intend to implement. - Each diagnostic has a corresponding compile time switch - -D_GLIBCXX_PROFILE_<diagnostic>. - Groups of related diagnostics can be turned on with a single switch. - For instance, -D_GLIBCXX_PROFILE_LOCALITY is equivalent to - -D_GLIBCXX_PROFILE_SOFTWARE_PREFETCH - -D_GLIBCXX_PROFILE_RBTREE_LOCALITY. -

- The benefit, cost, expected frequency and accuracy of each diagnostic - was given a grade from 1 to 10, where 10 is highest. - A high benefit means that, if the diagnostic is accurate, the expected - performance improvement is high. - A high cost means that turning this diagnostic on leads to high slowdown. - A high frequency means that we expect this to occur relatively often. - A high accuracy means that the diagnostic is unlikely to be wrong. - These grades are not perfect. They are just meant to guide users with - specific needs or time budgets. -

Table 19.2. Profile Diagnostics


Diagnostic Template

  • Switch: - _GLIBCXX_PROFILE_<diagnostic>. -

  • Goal: What problem will it diagnose? -

  • Fundamentals:. - What is the fundamental reason why this is a problem

  • Sample runtime reduction: - Percentage reduction in execution time. When reduction is more than - a constant factor, describe the reduction rate formula. -

  • Recommendation: - What would the advise look like?

  • To instrument: - What stdlibc++ components need to be instrumented?

  • Analysis: - How do we decide when to issue the advice?

  • Cost model: - How do we measure benefits? Math goes here.

  • Example: -

    -program code
    -...
    -advice sample
    -

    -

Containers

-Switch: - _GLIBCXX_PROFILE_CONTAINERS. -

Hashtable Too Small

  • Switch: - _GLIBCXX_PROFILE_HASHTABLE_TOO_SMALL. -

  • Goal: Detect hashtables with many - rehash operations, small construction size and large destruction size. -

  • Fundamentals: Rehash is very expensive. - Read content, follow chains within bucket, evaluate hash function, place at - new location in different order.

  • Sample runtime reduction: 36%. - Code similar to example below. -

  • Recommendation: - Set initial size to N at construction site S. -

  • To instrument: - unordered_set, unordered_map constructor, destructor, rehash. -

  • Analysis: - For each dynamic instance of unordered_[multi]set|map, - record initial size and call context of the constructor. - Record size increase, if any, after each relevant operation such as insert. - Record the estimated rehash cost.

  • Cost model: - Number of individual rehash operations * cost per rehash.

  • Example: -

    -1 unordered_set<int> us;
    -2 for (int k = 0; k < 1000000; ++k) {
    -3   us.insert(k);
    -4 }
    -
    -foo.cc:1: advice: Changing initial unordered_set size from 10 to 1000000 saves 1025530 rehash operations.
    -

    -

Hashtable Too Large

  • Switch: - _GLIBCXX_PROFILE_HASHTABLE_TOO_LARGE. -

  • Goal: Detect hashtables which are - never filled up because fewer elements than reserved are ever - inserted. -

  • Fundamentals: Save memory, which - is good in itself and may also improve memory reference performance through - fewer cache and TLB misses.

  • Sample runtime reduction: unknown. -

  • Recommendation: - Set initial size to N at construction site S. -

  • To instrument: - unordered_set, unordered_map constructor, destructor, rehash. -

  • Analysis: - For each dynamic instance of unordered_[multi]set|map, - record initial size and call context of the constructor, and correlate it - with its size at destruction time. -

  • Cost model: - Number of iteration operations + memory saved.

  • Example: -

    -1 vector<unordered_set<int>> v(100000, unordered_set<int>(100)) ;
    -2 for (int k = 0; k < 100000; ++k) {
    -3   for (int j = 0; j < 10; ++j) {
    -4     v[k].insert(k + j);
    -5  }
    -6 }
    -
    -foo.cc:1: advice: Changing initial unordered_set size from 100 to 10 saves N
    -bytes of memory and M iteration steps.
    -

    -

Inefficient Hash

  • Switch: - _GLIBCXX_PROFILE_INEFFICIENT_HASH. -

  • Goal: Detect hashtables with polarized - distribution. -

  • Fundamentals: A non-uniform - distribution may lead to long chains, thus possibly increasing complexity - by a factor up to the number of elements. -

  • Sample runtime reduction: factor up - to container size. -

  • Recommendation: Change hash function - for container built at site S. Distribution score = N. Access score = S. - Longest chain = C, in bucket B. -

  • To instrument: - unordered_set, unordered_map constructor, destructor, [], - insert, iterator. -

  • Analysis: - Count the exact number of link traversals. -

  • Cost model: - Total number of links traversed.

  • Example: -

    -class dumb_hash {
    - public:
    -  size_t operator() (int i) const { return 0; }
    -};
    -...
    -  unordered_set<int, dumb_hash> hs;
    -  ...
    -  for (int i = 0; i < COUNT; ++i) {
    -    hs.find(i);
    -  }
    -

    -

Vector Too Small

  • Switch: - _GLIBCXX_PROFILE_VECTOR_TOO_SMALL. -

  • Goal:Detect vectors with many - resize operations, small construction size and large destruction size.. -

  • Fundamentals:Resizing can be expensive. - Copying large amounts of data takes time. Resizing many small vectors may - have allocation overhead and affect locality.

  • Sample runtime reduction:%. -

  • Recommendation: - Set initial size to N at construction site S.

  • To instrument:vector. -

  • Analysis: - For each dynamic instance of vector, - record initial size and call context of the constructor. - Record size increase, if any, after each relevant operation such as - push_back. Record the estimated resize cost. -

  • Cost model: - Total number of words copied * time to copy a word.

  • Example: -

    -1 vector<int> v;
    -2 for (int k = 0; k < 1000000; ++k) {
    -3   v.push_back(k);
    -4 }
    -
    -foo.cc:1: advice: Changing initial vector size from 10 to 1000000 saves
    -copying 4000000 bytes and 20 memory allocations and deallocations.
    -

    -

Vector Too Large

  • Switch: - _GLIBCXX_PROFILE_VECTOR_TOO_LARGE -

  • Goal:Detect vectors which are - never filled up because fewer elements than reserved are ever - inserted. -

  • Fundamentals:Save memory, which - is good in itself and may also improve memory reference performance through - fewer cache and TLB misses.

  • Sample runtime reduction:%. -

  • Recommendation: - Set initial size to N at construction site S.

  • To instrument:vector. -

  • Analysis: - For each dynamic instance of vector, - record initial size and call context of the constructor, and correlate it - with its size at destruction time.

  • Cost model: - Total amount of memory saved.

  • Example: -

    -1 vector<vector<int>> v(100000, vector<int>(100)) ;
    -2 for (int k = 0; k < 100000; ++k) {
    -3   for (int j = 0; j < 10; ++j) {
    -4     v[k].insert(k + j);
    -5  }
    -6 }
    -
    -foo.cc:1: advice: Changing initial vector size from 100 to 10 saves N
    -bytes of memory and may reduce the number of cache and TLB misses.
    -

    -

Vector to Hashtable

  • Switch: - _GLIBCXX_PROFILE_VECTOR_TO_HASHTABLE. -

  • Goal: Detect uses of - vector that can be substituted with unordered_set - to reduce execution time. -

  • Fundamentals: - Linear search in a vector is very expensive, whereas searching in a hashtable - is very quick.

  • Sample runtime reduction:factor up - to container size. -

  • Recommendation:Replace - vector with unordered_set at site S. -

  • To instrument:vector - operations and access methods.

  • Analysis: - For each dynamic instance of vector, - record call context of the constructor. Issue the advice only if the - only methods called on this vector are push_back, - insert and find. -

  • Cost model: - Cost(vector::push_back) + cost(vector::insert) + cost(find, vector) - - cost(unordered_set::insert) + cost(unordered_set::find). -

  • Example: -

    -1  vector<int> v;
    -...
    -2  for (int i = 0; i < 1000; ++i) {
    -3    find(v.begin(), v.end(), i);
    -4  }
    -
    -foo.cc:1: advice: Changing "vector" to "unordered_set" will save about 500,000
    -comparisons.
    -

    -

Hashtable to Vector

  • Switch: - _GLIBCXX_PROFILE_HASHTABLE_TO_VECTOR. -

  • Goal: Detect uses of - unordered_set that can be substituted with vector - to reduce execution time. -

  • Fundamentals: - Hashtable iterator is slower than vector iterator.

  • Sample runtime reduction:95%. -

  • Recommendation:Replace - unordered_set with vector at site S. -

  • To instrument:unordered_set - operations and access methods.

  • Analysis: - For each dynamic instance of unordered_set, - record call context of the constructor. Issue the advice only if the - number of find, insert and [] - operations on this unordered_set are small relative to the - number of elements, and methods begin or end - are invoked (suggesting iteration).

  • Cost model: - Number of .

  • Example: -

    -1  unordered_set<int> us;
    -...
    -2  int s = 0;
    -3  for (unordered_set<int>::iterator it = us.begin(); it != us.end(); ++it) {
    -4    s += *it;
    -5  }
    -
    -foo.cc:1: advice: Changing "unordered_set" to "vector" will save about N
    -indirections and may achieve better data locality.
    -

    -

Vector to List

  • Switch: - _GLIBCXX_PROFILE_VECTOR_TO_LIST. -

  • Goal: Detect cases where - vector could be substituted with list for - better performance. -

  • Fundamentals: - Inserting in the middle of a vector is expensive compared to inserting in a - list. -

  • Sample runtime reduction:factor up to - container size. -

  • Recommendation:Replace vector with list - at site S.

  • To instrument:vector - operations and access methods.

  • Analysis: - For each dynamic instance of vector, - record the call context of the constructor. Record the overhead of each - insert operation based on current size and insert position. - Report instance with high insertion overhead. -

  • Cost model: - (Sum(cost(vector::method)) - Sum(cost(list::method)), for - method in [push_back, insert, erase]) - + (Cost(iterate vector) - Cost(iterate list))

  • Example: -

    -1  vector<int> v;
    -2  for (int i = 0; i < 10000; ++i) {
    -3    v.insert(v.begin(), i);
    -4  }
    -
    -foo.cc:1: advice: Changing "vector" to "list" will save about 5,000,000
    -operations.
    -

    -

List to Vector

  • Switch: - _GLIBCXX_PROFILE_LIST_TO_VECTOR. -

  • Goal: Detect cases where - list could be substituted with vector for - better performance. -

  • Fundamentals: - Iterating through a vector is faster than through a list. -

  • Sample runtime reduction:64%. -

  • Recommendation:Replace list with vector - at site S.

  • To instrument:vector - operations and access methods.

  • Analysis: - Issue the advice if there are no insert operations. -

  • Cost model: - (Sum(cost(vector::method)) - Sum(cost(list::method)), for - method in [push_back, insert, erase]) - + (Cost(iterate vector) - Cost(iterate list))

  • Example: -

    -1  list<int> l;
    -...
    -2  int sum = 0;
    -3  for (list<int>::iterator it = l.begin(); it != l.end(); ++it) {
    -4    sum += *it;
    -5  }
    -
    -foo.cc:1: advice: Changing "list" to "vector" will save about 1000000 indirect
    -memory references.
    -

    -

List to Forward List (Slist)

  • Switch: - _GLIBCXX_PROFILE_LIST_TO_SLIST. -

  • Goal: Detect cases where - list could be substituted with forward_list for - better performance. -

  • Fundamentals: - The memory footprint of a forward_list is smaller than that of a list. - This has beneficial effects on memory subsystem, e.g., fewer cache misses. -

  • Sample runtime reduction:40%. - Note that the reduction is only noticeable if the size of the forward_list - node is in fact larger than that of the list node. For memory allocators - with size classes, you will only notice an effect when the two node sizes - belong to different allocator size classes. -

  • Recommendation:Replace list with - forward_list at site S.

  • To instrument:list - operations and iteration methods.

  • Analysis: - Issue the advice if there are no backwards traversals - or insertion before a given node. -

  • Cost model: - Always true.

  • Example: -

    -1  list<int> l;
    -...
    -2  int sum = 0;
    -3  for (list<int>::iterator it = l.begin(); it != l.end(); ++it) {
    -4    sum += *it;
    -5  }
    -
    -foo.cc:1: advice: Change "list" to "forward_list".
    -

    -

Ordered to Unordered Associative Container

  • Switch: - _GLIBCXX_PROFILE_ORDERED_TO_UNORDERED. -

  • Goal: Detect cases where ordered - associative containers can be replaced with unordered ones. -

  • Fundamentals: - Insert and search are quicker in a hashtable than in - a red-black tree.

  • Sample runtime reduction:52%. -

  • Recommendation: - Replace set with unordered_set at site S.

  • To instrument: - set, multiset, map, - multimap methods.

  • Analysis: - Issue the advice only if we are not using operator ++ on any - iterator on a particular [multi]set|map. -

  • Cost model: - (Sum(cost(hashtable::method)) - Sum(cost(rbtree::method)), for - method in [insert, erase, find]) - + (Cost(iterate hashtable) - Cost(iterate rbtree))

  • Example: -

    -1  set<int> s;
    -2  for (int i = 0; i < 100000; ++i) {
    -3    s.insert(i);
    -4  }
    -5  int sum = 0;
    -6  for (int i = 0; i < 100000; ++i) {
    -7    sum += *s.find(i);
    -8  }
    -

    -

Algorithms

Switch: - _GLIBCXX_PROFILE_ALGORITHMS. -

Sort Algorithm Performance

  • Switch: - _GLIBCXX_PROFILE_SORT. -

  • Goal: Give measure of sort algorithm - performance based on actual input. For instance, advise Radix Sort over - Quick Sort for a particular call context. -

  • Fundamentals: - See papers: - - A framework for adaptive algorithm selection in STAPL and - - Optimizing Sorting with Machine Learning Algorithms. -

  • Sample runtime reduction:60%. -

  • Recommendation: Change sort algorithm - at site S from X Sort to Y Sort.

  • To instrument: sort - algorithm.

  • Analysis: - Issue the advice if the cost model tells us that another sort algorithm - would do better on this input. Requires us to know what algorithm we - are using in our sort implementation in release mode.

  • Cost model: - Runtime(algo) for algo in [radix, quick, merge, ...]

  • Example: -

    -

    -

Data Locality

Switch: - _GLIBCXX_PROFILE_LOCALITY. -

Need Software Prefetch

  • Switch: - _GLIBCXX_PROFILE_SOFTWARE_PREFETCH. -

  • Goal: Discover sequences of indirect - memory accesses that are not regular, thus cannot be predicted by - hardware prefetchers. -

  • Fundamentals: - Indirect references are hard to predict and are very expensive when they - miss in caches.

  • Sample runtime reduction:25%. -

  • Recommendation: Insert prefetch - instruction.

  • To instrument: Vector iterator and - access operator []. -

  • Analysis: - First, get cache line size and page size from system. - Then record iterator dereference sequences for which the value is a pointer. - For each sequence within a container, issue a warning if successive pointer - addresses are not within cache lines and do not form a linear pattern - (otherwise they may be prefetched by hardware). - If they also step across page boundaries, make the warning stronger. -

    The same analysis applies to containers other than vector. - However, we cannot give the same advice for linked structures, such as list, - as there is no random access to the n-th element. The user may still be - able to benefit from this information, for instance by employing frays (user - level light weight threads) to hide the latency of chasing pointers. -

    - This analysis is a little oversimplified. A better cost model could be - created by understanding the capability of the hardware prefetcher. - This model could be trained automatically by running a set of synthetic - cases. -

  • Cost model: - Total distance between pointer values of successive elements in vectors - of pointers.

  • Example: -

    -1 int zero = 0;
    -2 vector<int*> v(10000000, &zero);
    -3 for (int k = 0; k < 10000000; ++k) {
    -4   v[random() % 10000000] = new int(k);
    -5 }
    -6 for (int j = 0; j < 10000000; ++j) {
    -7   count += (*v[j] == 0 ? 0 : 1);
    -8 }
    -
    -foo.cc:7: advice: Insert prefetch instruction.
    -

    -

Linked Structure Locality

  • Switch: - _GLIBCXX_PROFILE_RBTREE_LOCALITY. -

  • Goal: Give measure of locality of - objects stored in linked structures (lists, red-black trees and hashtables) - with respect to their actual traversal patterns. -

  • Fundamentals:Allocation can be tuned - to a specific traversal pattern, to result in better data locality. - See paper: - - Custom Memory Allocation for Free by Jula and Rauchwerger. -

  • Sample runtime reduction:30%. -

  • Recommendation: - High scatter score N for container built at site S. - Consider changing allocation sequence or choosing a structure conscious - allocator.

  • To instrument: Methods of all - containers using linked structures.

  • Analysis: - First, get cache line size and page size from system. - Then record the number of successive elements that are on different line - or page, for each traversal method such as find. Give advice - only if the ratio between this number and the number of total node hops - is above a threshold.

  • Cost model: - Sum(same_cache_line(this,previous))

  • Example: -

    - 1  set<int> s;
    - 2  for (int i = 0; i < 10000000; ++i) {
    - 3    s.insert(i);
    - 4  }
    - 5  set<int> s1, s2;
    - 6  for (int i = 0; i < 10000000; ++i) {
    - 7    s1.insert(i);
    - 8    s2.insert(i);
    - 9  }
    -...
    -      // Fast, better locality.
    -10    for (set<int>::iterator it = s.begin(); it != s.end(); ++it) {
    -11      sum += *it;
    -12    }
    -      // Slow, elements are further apart.
    -13    for (set<int>::iterator it = s1.begin(); it != s1.end(); ++it) {
    -14      sum += *it;
    -15    }
    -
    -foo.cc:5: advice: High scatter score NNN for set built here.  Consider changing
    -the allocation sequence or switching to a structure conscious allocator.
    -

    -

Multithreaded Data Access

- The diagnostics in this group are not meant to be implemented short term. - They require compiler support to know when container elements are written - to. Instrumentation can only tell us when elements are referenced. -

Switch: - _GLIBCXX_PROFILE_MULTITHREADED. -

Data Dependence Violations at Container Level

  • Switch: - _GLIBCXX_PROFILE_DDTEST. -

  • Goal: Detect container elements - that are referenced from multiple threads in the parallel region or - across parallel regions. -

  • Fundamentals: - Sharing data between threads requires communication and perhaps locking, - which may be expensive. -

  • Sample runtime reduction:?%. -

  • Recommendation: Change data - distribution or parallel algorithm.

  • To instrument: Container access methods - and iterators. -

  • Analysis: - Keep a shadow for each container. Record iterator dereferences and - container member accesses. Issue advice for elements referenced by - multiple threads. - See paper: - The LRPD test: speculative run-time parallelization of loops with - privatization and reduction parallelization. -

  • Cost model: - Number of accesses to elements referenced from multiple threads -

  • Example: -

    -

    -

False Sharing

  • Switch: - _GLIBCXX_PROFILE_FALSE_SHARING. -

  • Goal: Detect elements in the - same container which share a cache line, are written by at least one - thread, and accessed by different threads. -

  • Fundamentals: Under these assumptions, - cache protocols require - communication to invalidate lines, which may be expensive. -

  • Sample runtime reduction:68%. -

  • Recommendation: Reorganize container - or use padding to avoid false sharing.

  • To instrument: Container access methods - and iterators. -

  • Analysis: - First, get the cache line size. - For each shared container, record all the associated iterator dereferences - and member access methods with the thread id. Compare the address lists - across threads to detect references in two different threads to the same - cache line. Issue a warning only if the ratio to total references is - significant. Do the same for iterator dereference values if they are - pointers.

  • Cost model: - Number of accesses to same cache line from different threads. -

  • Example: -

    -1     vector<int> v(2, 0);
    -2 #pragma omp parallel for shared(v, SIZE) schedule(static, 1)
    -3     for (i = 0; i < SIZE; ++i) {
    -4       v[i % 2] += i;
    -5     }
    -
    -OMP_NUM_THREADS=2 ./a.out
    -foo.cc:1: advice: Change container structure or padding to avoid false
    -sharing in multithreaded access at foo.cc:4.  Detected N shared cache lines.
    -

    -

Statistics

-Switch: - _GLIBCXX_PROFILE_STATISTICS. -

- In some cases the cost model may not tell us anything because the costs - appear to offset the benefits. Consider the choice between a vector and - a list. When there are both inserts and iteration, an automatic advice - may not be issued. However, the programmer may still be able to make use - of this information in a different way. -

- This diagnostic will not issue any advice, but it will print statistics for - each container construction site. The statistics will contain the cost - of each operation actually performed on the container. -

\ No newline at end of file -- cgit v1.1