Diagnostics

Diagnostics
Prev	Chapter 19. Profile Mode	Next

- The table below presents all the diagnostics we intend to implement. - Each diagnostic has a corresponding compile time switch - -D_GLIBCXX_PROFILE_<diagnostic>. - Groups of related diagnostics can be turned on with a single switch. - For instance, -D_GLIBCXX_PROFILE_LOCALITY is equivalent to - -D_GLIBCXX_PROFILE_SOFTWARE_PREFETCH - -D_GLIBCXX_PROFILE_RBTREE_LOCALITY. -

- The benefit, cost, expected frequency and accuracy of each diagnostic - was given a grade from 1 to 10, where 10 is highest. - A high benefit means that, if the diagnostic is accurate, the expected - performance improvement is high. - A high cost means that turning this diagnostic on leads to high slowdown. - A high frequency means that we expect this to occur relatively often. - A high accuracy means that the diagnostic is unlikely to be wrong. - These grades are not perfect. They are just meant to guide users with - specific needs or time budgets. -

Table 19.2. Profile Diagnostics

Group	Flag	Benefit	Cost	Freq.	Implemented
- CONTAINERS	- HASHTABLE_TOO_SMALL	10	1		10	yes
	- HASHTABLE_TOO_LARGE	5	1		10	yes
	- INEFFICIENT_HASH	7	3		10	yes
	- VECTOR_TOO_SMALL	8	1		10	yes
	- VECTOR_TOO_LARGE	5	1		10	yes
	- VECTOR_TO_HASHTABLE	7	7		10	no
	- HASHTABLE_TO_VECTOR	7	7		10	no
	- VECTOR_TO_LIST	8	5		10	yes
	- LIST_TO_VECTOR	10	5		10	no
	- ORDERED_TO_UNORDERED	10	5		10	only map/unordered_map
- ALGORITHMS	- SORT	7	8		7	no
- LOCALITY	- SOFTWARE_PREFETCH	8	8		5	no
	- RBTREE_LOCALITY	4	8		5	no
	- FALSE_SHARING	8	10		10	no

Diagnostic Template

Switch: - _GLIBCXX_PROFILE_<diagnostic>. -
Goal: What problem will it diagnose? -
Fundamentals:. - What is the fundamental reason why this is a problem
Sample runtime reduction: - Percentage reduction in execution time. When reduction is more than - a constant factor, describe the reduction rate formula. -
Recommendation: - What would the advise look like?
To instrument: - What stdlibc++ components need to be instrumented?
Analysis: - How do we decide when to issue the advice?
Cost model: - How do we measure benefits? Math goes here.
Example: -
```
-program code
-...
-advice sample
-
```
-

Containers

-Switch: - _GLIBCXX_PROFILE_CONTAINERS. -

Hashtable Too Small

Switch: - _GLIBCXX_PROFILE_HASHTABLE_TOO_SMALL. -
Goal: Detect hashtables with many - rehash operations, small construction size and large destruction size. -
Fundamentals: Rehash is very expensive. - Read content, follow chains within bucket, evaluate hash function, place at - new location in different order.
Sample runtime reduction: 36%. - Code similar to example below. -
Recommendation: - Set initial size to N at construction site S. -
To instrument: - unordered_set, unordered_map constructor, destructor, rehash. -
Analysis: - For each dynamic instance of unordered_[multi]set|map, - record initial size and call context of the constructor. - Record size increase, if any, after each relevant operation such as insert. - Record the estimated rehash cost.
Cost model: - Number of individual rehash operations * cost per rehash.

Example: -

-1 unordered_set<int> us;
-2 for (int k = 0; k < 1000000; ++k) {
-3   us.insert(k);
-4 }
-
-foo.cc:1: advice: Changing initial unordered_set size from 10 to 1000000 saves 1025530 rehash operations.
-

Hashtable Too Large

Switch: - _GLIBCXX_PROFILE_HASHTABLE_TOO_LARGE. -
Goal: Detect hashtables which are - never filled up because fewer elements than reserved are ever - inserted. -
Fundamentals: Save memory, which - is good in itself and may also improve memory reference performance through - fewer cache and TLB misses.
Sample runtime reduction: unknown. -
Recommendation: - Set initial size to N at construction site S. -
To instrument: - unordered_set, unordered_map constructor, destructor, rehash. -
Analysis: - For each dynamic instance of unordered_[multi]set|map, - record initial size and call context of the constructor, and correlate it - with its size at destruction time. -
Cost model: - Number of iteration operations + memory saved.

Example: -

-1 vector<unordered_set<int>> v(100000, unordered_set<int>(100)) ;
-2 for (int k = 0; k < 100000; ++k) {
-3   for (int j = 0; j < 10; ++j) {
-4     v[k].insert(k + j);
-5  }
-6 }
-
-foo.cc:1: advice: Changing initial unordered_set size from 100 to 10 saves N
-bytes of memory and M iteration steps.
-

Inefficient Hash

Switch: - _GLIBCXX_PROFILE_INEFFICIENT_HASH. -
Goal: Detect hashtables with polarized - distribution. -
Fundamentals: A non-uniform - distribution may lead to long chains, thus possibly increasing complexity - by a factor up to the number of elements. -
Sample runtime reduction: factor up - to container size. -
Recommendation: Change hash function - for container built at site S. Distribution score = N. Access score = S. - Longest chain = C, in bucket B. -
To instrument: - unordered_set, unordered_map constructor, destructor, [], - insert, iterator. -
Analysis: - Count the exact number of link traversals. -
Cost model: - Total number of links traversed.

Example: -

-class dumb_hash {
- public:
-  size_t operator() (int i) const { return 0; }
-};
-...
-  unordered_set<int, dumb_hash> hs;
-  ...
-  for (int i = 0; i < COUNT; ++i) {
-    hs.find(i);
-  }
-

Vector Too Small

Switch: - _GLIBCXX_PROFILE_VECTOR_TOO_SMALL. -
Goal:Detect vectors with many - resize operations, small construction size and large destruction size.. -
Fundamentals:Resizing can be expensive. - Copying large amounts of data takes time. Resizing many small vectors may - have allocation overhead and affect locality.
Sample runtime reduction:%. -
Recommendation: - Set initial size to N at construction site S.
To instrument:vector. -
Analysis: - For each dynamic instance of vector, - record initial size and call context of the constructor. - Record size increase, if any, after each relevant operation such as - push_back. Record the estimated resize cost. -
Cost model: - Total number of words copied * time to copy a word.

Example: -

-1 vector<int> v;
-2 for (int k = 0; k < 1000000; ++k) {
-3   v.push_back(k);
-4 }
-
-foo.cc:1: advice: Changing initial vector size from 10 to 1000000 saves
-copying 4000000 bytes and 20 memory allocations and deallocations.
-

Vector Too Large

Switch: - _GLIBCXX_PROFILE_VECTOR_TOO_LARGE -
Goal:Detect vectors which are - never filled up because fewer elements than reserved are ever - inserted. -
Fundamentals:Save memory, which - is good in itself and may also improve memory reference performance through - fewer cache and TLB misses.
Sample runtime reduction:%. -
Recommendation: - Set initial size to N at construction site S.
To instrument:vector. -
Analysis: - For each dynamic instance of vector, - record initial size and call context of the constructor, and correlate it - with its size at destruction time.
Cost model: - Total amount of memory saved.

Example: -

-1 vector<vector<int>> v(100000, vector<int>(100)) ;
-2 for (int k = 0; k < 100000; ++k) {
-3   for (int j = 0; j < 10; ++j) {
-4     v[k].insert(k + j);
-5  }
-6 }
-
-foo.cc:1: advice: Changing initial vector size from 100 to 10 saves N
-bytes of memory and may reduce the number of cache and TLB misses.
-

Vector to Hashtable

Switch: - _GLIBCXX_PROFILE_VECTOR_TO_HASHTABLE. -
Goal: Detect uses of - vector that can be substituted with unordered_set - to reduce execution time. -
Fundamentals: - Linear search in a vector is very expensive, whereas searching in a hashtable - is very quick.
Sample runtime reduction:factor up - to container size. -
Recommendation:Replace - vector with unordered_set at site S. -
To instrument:vector - operations and access methods.
Analysis: - For each dynamic instance of vector, - record call context of the constructor. Issue the advice only if the - only methods called on this vector are push_back, - insert and find. -
Cost model: - Cost(vector::push_back) + cost(vector::insert) + cost(find, vector) - - cost(unordered_set::insert) + cost(unordered_set::find). -

Example: -

-1  vector<int> v;
-...
-2  for (int i = 0; i < 1000; ++i) {
-3    find(v.begin(), v.end(), i);
-4  }
-
-foo.cc:1: advice: Changing "vector" to "unordered_set" will save about 500,000
-comparisons.
-

Hashtable to Vector

Switch: - _GLIBCXX_PROFILE_HASHTABLE_TO_VECTOR. -
Goal: Detect uses of - unordered_set that can be substituted with vector - to reduce execution time. -
Fundamentals: - Hashtable iterator is slower than vector iterator.
Sample runtime reduction:95%. -
Recommendation:Replace - unordered_set with vector at site S. -
To instrument:unordered_set - operations and access methods.
Analysis: - For each dynamic instance of unordered_set, - record call context of the constructor. Issue the advice only if the - number of find, insert and [] - operations on this unordered_set are small relative to the - number of elements, and methods begin or end - are invoked (suggesting iteration).
Cost model: - Number of .

Example: -

-1  unordered_set<int> us;
-...
-2  int s = 0;
-3  for (unordered_set<int>::iterator it = us.begin(); it != us.end(); ++it) {
-4    s += *it;
-5  }
-
-foo.cc:1: advice: Changing "unordered_set" to "vector" will save about N
-indirections and may achieve better data locality.
-

Vector to List

Switch: - _GLIBCXX_PROFILE_VECTOR_TO_LIST. -
Goal: Detect cases where - vector could be substituted with list for - better performance. -
Fundamentals: - Inserting in the middle of a vector is expensive compared to inserting in a - list. -
Sample runtime reduction:factor up to - container size. -
Recommendation:Replace vector with list - at site S.
To instrument:vector - operations and access methods.
Analysis: - For each dynamic instance of vector, - record the call context of the constructor. Record the overhead of each - insert operation based on current size and insert position. - Report instance with high insertion overhead. -
Cost model: - (Sum(cost(vector::method)) - Sum(cost(list::method)), for - method in [push_back, insert, erase]) - + (Cost(iterate vector) - Cost(iterate list))

Example: -

-1  vector<int> v;
-2  for (int i = 0; i < 10000; ++i) {
-3    v.insert(v.begin(), i);
-4  }
-
-foo.cc:1: advice: Changing "vector" to "list" will save about 5,000,000
-operations.
-

List to Vector

Switch: - _GLIBCXX_PROFILE_LIST_TO_VECTOR. -
Goal: Detect cases where - list could be substituted with vector for - better performance. -
Fundamentals: - Iterating through a vector is faster than through a list. -
Sample runtime reduction:64%. -
Recommendation:Replace list with vector - at site S.
To instrument:vector - operations and access methods.
Analysis: - Issue the advice if there are no insert operations. -
Cost model: - (Sum(cost(vector::method)) - Sum(cost(list::method)), for - method in [push_back, insert, erase]) - + (Cost(iterate vector) - Cost(iterate list))

Example: -

-1  list<int> l;
-...
-2  int sum = 0;
-3  for (list<int>::iterator it = l.begin(); it != l.end(); ++it) {
-4    sum += *it;
-5  }
-
-foo.cc:1: advice: Changing "list" to "vector" will save about 1000000 indirect
-memory references.
-

List to Forward List (Slist)

Switch: - _GLIBCXX_PROFILE_LIST_TO_SLIST. -
Goal: Detect cases where - list could be substituted with forward_list for - better performance. -
Fundamentals: - The memory footprint of a forward_list is smaller than that of a list. - This has beneficial effects on memory subsystem, e.g., fewer cache misses. -
Sample runtime reduction:40%. - Note that the reduction is only noticeable if the size of the forward_list - node is in fact larger than that of the list node. For memory allocators - with size classes, you will only notice an effect when the two node sizes - belong to different allocator size classes. -
Recommendation:Replace list with - forward_list at site S.
To instrument:list - operations and iteration methods.
Analysis: - Issue the advice if there are no backwards traversals - or insertion before a given node. -
Cost model: - Always true.

Example: -

-1  list<int> l;
-...
-2  int sum = 0;
-3  for (list<int>::iterator it = l.begin(); it != l.end(); ++it) {
-4    sum += *it;
-5  }
-
-foo.cc:1: advice: Change "list" to "forward_list".
-

Ordered to Unordered Associative Container

Switch: - _GLIBCXX_PROFILE_ORDERED_TO_UNORDERED. -
Goal: Detect cases where ordered - associative containers can be replaced with unordered ones. -
Fundamentals: - Insert and search are quicker in a hashtable than in - a red-black tree.
Sample runtime reduction:52%. -
Recommendation: - Replace set with unordered_set at site S.
To instrument: - set, multiset, map, - multimap methods.
Analysis: - Issue the advice only if we are not using operator ++ on any - iterator on a particular [multi]set|map. -
Cost model: - (Sum(cost(hashtable::method)) - Sum(cost(rbtree::method)), for - method in [insert, erase, find]) - + (Cost(iterate hashtable) - Cost(iterate rbtree))

Example: -

-1  set<int> s;
-2  for (int i = 0; i < 100000; ++i) {
-3    s.insert(i);
-4  }
-5  int sum = 0;
-6  for (int i = 0; i < 100000; ++i) {
-7    sum += *s.find(i);
-8  }
-

Algorithms

Switch: - _GLIBCXX_PROFILE_ALGORITHMS. -

Sort Algorithm Performance

Switch: - _GLIBCXX_PROFILE_SORT. -
Goal: Give measure of sort algorithm - performance based on actual input. For instance, advise Radix Sort over - Quick Sort for a particular call context. -
Fundamentals: - See papers: - - A framework for adaptive algorithm selection in STAPL and - - Optimizing Sorting with Machine Learning Algorithms. -
Sample runtime reduction:60%. -
Recommendation: Change sort algorithm - at site S from X Sort to Y Sort.
To instrument: sort - algorithm.
Analysis: - Issue the advice if the cost model tells us that another sort algorithm - would do better on this input. Requires us to know what algorithm we - are using in our sort implementation in release mode.
Cost model: - Runtime(algo) for algo in [radix, quick, merge, ...]
Example: -
```
-
```
-

Data Locality

Switch: - _GLIBCXX_PROFILE_LOCALITY. -

Need Software Prefetch

Switch: - _GLIBCXX_PROFILE_SOFTWARE_PREFETCH. -
Goal: Discover sequences of indirect - memory accesses that are not regular, thus cannot be predicted by - hardware prefetchers. -
Fundamentals: - Indirect references are hard to predict and are very expensive when they - miss in caches.
Sample runtime reduction:25%. -
Recommendation: Insert prefetch - instruction.
To instrument: Vector iterator and - access operator []. -
Analysis: - First, get cache line size and page size from system. - Then record iterator dereference sequences for which the value is a pointer. - For each sequence within a container, issue a warning if successive pointer - addresses are not within cache lines and do not form a linear pattern - (otherwise they may be prefetched by hardware). - If they also step across page boundaries, make the warning stronger. -
The same analysis applies to containers other than vector. - However, we cannot give the same advice for linked structures, such as list, - as there is no random access to the n-th element. The user may still be - able to benefit from this information, for instance by employing frays (user - level light weight threads) to hide the latency of chasing pointers. -
- This analysis is a little oversimplified. A better cost model could be - created by understanding the capability of the hardware prefetcher. - This model could be trained automatically by running a set of synthetic - cases. -
Cost model: - Total distance between pointer values of successive elements in vectors - of pointers.

Example: -

-1 int zero = 0;
-2 vector<int*> v(10000000, &zero);
-3 for (int k = 0; k < 10000000; ++k) {
-4   v[random() % 10000000] = new int(k);
-5 }
-6 for (int j = 0; j < 10000000; ++j) {
-7   count += (*v[j] == 0 ? 0 : 1);
-8 }
-
-foo.cc:7: advice: Insert prefetch instruction.
-

Linked Structure Locality

Switch: - _GLIBCXX_PROFILE_RBTREE_LOCALITY. -
Goal: Give measure of locality of - objects stored in linked structures (lists, red-black trees and hashtables) - with respect to their actual traversal patterns. -
Fundamentals:Allocation can be tuned - to a specific traversal pattern, to result in better data locality. - See paper: - - Custom Memory Allocation for Free by Jula and Rauchwerger. -
Sample runtime reduction:30%. -
Recommendation: - High scatter score N for container built at site S. - Consider changing allocation sequence or choosing a structure conscious - allocator.
To instrument: Methods of all - containers using linked structures.
Analysis: - First, get cache line size and page size from system. - Then record the number of successive elements that are on different line - or page, for each traversal method such as find. Give advice - only if the ratio between this number and the number of total node hops - is above a threshold.
Cost model: - Sum(same_cache_line(this,previous))

Example: -

- 1  set<int> s;
- 2  for (int i = 0; i < 10000000; ++i) {
- 3    s.insert(i);
- 4  }
- 5  set<int> s1, s2;
- 6  for (int i = 0; i < 10000000; ++i) {
- 7    s1.insert(i);
- 8    s2.insert(i);
- 9  }
-...
-      // Fast, better locality.
-10    for (set<int>::iterator it = s.begin(); it != s.end(); ++it) {
-11      sum += *it;
-12    }
-      // Slow, elements are further apart.
-13    for (set<int>::iterator it = s1.begin(); it != s1.end(); ++it) {
-14      sum += *it;
-15    }
-
-foo.cc:5: advice: High scatter score NNN for set built here.  Consider changing
-the allocation sequence or switching to a structure conscious allocator.
-

Multithreaded Data Access

- The diagnostics in this group are not meant to be implemented short term. - They require compiler support to know when container elements are written - to. Instrumentation can only tell us when elements are referenced. -

Switch: - _GLIBCXX_PROFILE_MULTITHREADED. -

Data Dependence Violations at Container Level

Switch: - _GLIBCXX_PROFILE_DDTEST. -
Goal: Detect container elements - that are referenced from multiple threads in the parallel region or - across parallel regions. -
Fundamentals: - Sharing data between threads requires communication and perhaps locking, - which may be expensive. -
Sample runtime reduction:?%. -
Recommendation: Change data - distribution or parallel algorithm.
To instrument: Container access methods - and iterators. -
Analysis: - Keep a shadow for each container. Record iterator dereferences and - container member accesses. Issue advice for elements referenced by - multiple threads. - See paper: - The LRPD test: speculative run-time parallelization of loops with - privatization and reduction parallelization. -
Cost model: - Number of accesses to elements referenced from multiple threads -
Example: -
```
-
```
-

False Sharing

Switch: - _GLIBCXX_PROFILE_FALSE_SHARING. -
Goal: Detect elements in the - same container which share a cache line, are written by at least one - thread, and accessed by different threads. -
Fundamentals: Under these assumptions, - cache protocols require - communication to invalidate lines, which may be expensive. -
Sample runtime reduction:68%. -
Recommendation: Reorganize container - or use padding to avoid false sharing.
To instrument: Container access methods - and iterators. -
Analysis: - First, get the cache line size. - For each shared container, record all the associated iterator dereferences - and member access methods with the thread id. Compare the address lists - across threads to detect references in two different threads to the same - cache line. Issue a warning only if the ratio to total references is - significant. Do the same for iterator dereference values if they are - pointers.
Cost model: - Number of accesses to same cache line from different threads. -

Example: -

-1     vector<int> v(2, 0);
-2 #pragma omp parallel for shared(v, SIZE) schedule(static, 1)
-3     for (i = 0; i < SIZE; ++i) {
-4       v[i % 2] += i;
-5     }
-
-OMP_NUM_THREADS=2 ./a.out
-foo.cc:1: advice: Change container structure or padding to avoid false
-sharing in multithreaded access at foo.cc:4.  Detected N shared cache lines.
-

Statistics

-Switch: - _GLIBCXX_PROFILE_STATISTICS. -

- In some cases the cost model may not tell us anything because the costs - appear to offset the benefits. Consider the choice between a vector and - a list. When there are both inserts and iteration, an automatic advice - may not be issued. However, the programmer may still be able to make use - of this information in a different way. -

- This diagnostic will not issue any advice, but it will print statistics for - each container construction site. The statistics will contain the cost - of each operation actually performed on the container. -

Prev	Up	Next
Developer Information	Home	Chapter 20. The mt_allocator