diff --git a/_posts/2023-03-02-fast-21.md b/_posts/2023-03-02-fast-21.md index 0a3e0c3..bf03810 100644 --- a/_posts/2023-03-02-fast-21.md +++ b/_posts/2023-03-02-fast-21.md @@ -12,7 +12,7 @@ Originally posted as Fast TotW #21 on January 16, 2020 *By [Paul Wankadia](mailto:junyer@google.com) and [Darryl Gove](mailto:djgove@google.com)* -Updated 2024-10-21 +Updated 2025-09-03 Quicklink: [abseil.io/fast/21](https://abseil.io/fast/21) diff --git a/_posts/2023-03-02-fast-39.md b/_posts/2023-03-02-fast-39.md index 0ddae58..524290a 100644 --- a/_posts/2023-03-02-fast-39.md +++ b/_posts/2023-03-02-fast-39.md @@ -12,7 +12,7 @@ Originally posted as Fast TotW #39 on January 22, 2021 *By [Chris Kennelly](mailto:ckennelly@google.com) and [Alkis Evlogimenos](mailto:alkis@evlogimenos.com)* -Updated 2025-03-24 +Updated 2025-09-29 Quicklink: [abseil.io/fast/39](https://abseil.io/fast/39) @@ -112,10 +112,11 @@ challenging: Microbenchmarks tend to have small working sets that tend to be cache resident. Real code, particularly Google C++, is not. In production, the cacheline holding `kMasks` might be evicted, leading to much -worse stalls (hundreds of cycles to access main memory). Additionally, on x86 -processors since Haswell, this [optimization can be past its prime](/fast/9): -BMI2's `bzhi` instruction is both faster than loading and masking *and* delivers -more consistent performance. +worse stalls +([hundreds of cycles to access main memory](https://sre.google/static/pdf/rule-of-thumb-latency-numbers-letter.pdf)). +Additionally, on x86 processors since Haswell, this +[optimization can be past its prime](/fast/9): BMI2's `bzhi` instruction is both +faster than loading and masking *and* delivers more consistent performance. When developing benchmarks for [SwissMap](https://abseil.io/blog/20180927-swisstables), individual operations diff --git a/_posts/2023-03-02-fast-53.md b/_posts/2023-03-02-fast-53.md index 683b9f1..2275319 100644 --- a/_posts/2023-03-02-fast-53.md +++ b/_posts/2023-03-02-fast-53.md @@ -12,7 +12,7 @@ Originally posted as Fast TotW #53 on October 14, 2021 *By [Mircea Trofin](mailto:mtrofin@google.com)* -Updated 2024-11-19 +Updated 2025-09-03 Quicklink: [abseil.io/fast/53](https://abseil.io/fast/53) @@ -73,7 +73,7 @@ the process of writing a benchmark. An example of its use may be seen [here](https://github.com/llvm/llvm-test-suite/tree/main/MicroBenchmarks/LoopVectorization) The benchmark harness support for performance counters consists of allowing the -user to specify up to 3 counters in a comma-separated list, via the +user to specify counters in a comma-separated list, via the `--benchmark_perf_counters` flag, to be measured alongside the time measurement. Just like time measurement, each counter value is captured right before the benchmarked code is run, and right after. The difference is reported to the user @@ -131,13 +131,15 @@ instructions, and 6 memory ops per iteration. - *Number of counters*: At most 32 events may be requested for simultaneous collection. Note however, that the number of hardware counters available is - much lower (usually 4-8 on modern CPUs) -- requesting more events than the + much lower (usually 4-8 on modern CPUs, see + `PerfCounterValues::kMaxCounters`) -- requesting more events than the hardware counters will cause [multiplexing](https://perf.wiki.kernel.org/index.php/Tutorial#multiplexing_and_scaling_events) and decreased accuracy. -- *Visualization*: There is no visualization available, so the user needs to - rely on collecting JSON result files and summarizing the results. +- *Visualization*: There is no dedicated visualization UI available, so for + complex analysis, users may need to collect JSON result files and summarize + the results. - *Counting vs. Sampling*: The framework only collects counters in "counting" mode -- it answers how many cycles/cache misses/etc. happened, but not does diff --git a/_posts/2023-03-02-fast-9.md b/_posts/2023-03-02-fast-9.md index 4a50ed5..c197a82 100644 --- a/_posts/2023-03-02-fast-9.md +++ b/_posts/2023-03-02-fast-9.md @@ -12,7 +12,7 @@ Originally posted as Fast TotW #9 on June 24, 2019 *By [Chris Kennelly](mailto:ckennelly@google.com)* -Updated 2025-03-27 +Updated 2025-10-03 Quicklink: [abseil.io/fast/9](https://abseil.io/fast/9) diff --git a/_posts/2023-09-14-fast-7.md b/_posts/2023-09-14-fast-7.md index 5b28d6e..ad707b6 100644 --- a/_posts/2023-09-14-fast-7.md +++ b/_posts/2023-09-14-fast-7.md @@ -12,7 +12,7 @@ Originally posted as Fast TotW #7 on June 6, 2019 *By [Chris Kennelly](mailto:ckennelly@google.com)* -Updated 2025-03-25 +Updated 2025-10-03 Quicklink: [abseil.io/fast/7](https://abseil.io/fast/7) diff --git a/_posts/2023-09-30-fast-52.md b/_posts/2023-09-30-fast-52.md index 7db524e..f6465cc 100644 --- a/_posts/2023-09-30-fast-52.md +++ b/_posts/2023-09-30-fast-52.md @@ -12,7 +12,7 @@ Originally posted as Fast TotW #52 on September 30, 2021 *By [Chris Kennelly](mailto:ckennelly@google.com)* -Updated 2025-03-24 +Updated 2025-10-03 Quicklink: [abseil.io/fast/52](https://abseil.io/fast/52) diff --git a/_posts/2023-10-10-fast-64.md b/_posts/2023-10-10-fast-64.md index d3791df..c6ed19f 100644 --- a/_posts/2023-10-10-fast-64.md +++ b/_posts/2023-10-10-fast-64.md @@ -12,7 +12,7 @@ Originally posted as Fast TotW #64 on October 21, 2022 *By [Chris Kennelly](mailto:ckennelly@google.com)* -Updated 2025-03-24 +Updated 2025-09-29 Quicklink: [abseil.io/fast/64](https://abseil.io/fast/64) @@ -192,7 +192,7 @@ that can be returned. This approach has two problems: variable small string object buffer sizes. Returning `const std::string&` constrains the implementation to that particular size of buffer. -In contrast, by returning `std::string_view` (or our +In contrast, by returning [`std::string_view`](/tips/1) (or our [internal predecessor](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3442.html), `StringPiece`), we decouple callers from the internal representation. The API is the same, independent of whether the string is constant data (backed by the diff --git a/_posts/2023-10-15-fast-60.md b/_posts/2023-10-15-fast-60.md index fdfa157..838f3f8 100644 --- a/_posts/2023-10-15-fast-60.md +++ b/_posts/2023-10-15-fast-60.md @@ -12,14 +12,15 @@ Originally posted as Fast TotW #60 on June 6, 2022 *By [Chris Kennelly](mailto:ckennelly@google.com)* -Updated 2025-03-24 +Updated 2025-09-29 Quicklink: [abseil.io/fast/60](https://abseil.io/fast/60) [Google-Wide Profiling](https://research.google/pubs/pub36575/) collects data not just from our hardware performance counters, but also from in-process -profilers. +profilers. These have been covered in previous episodes covering +[hashtables](/fast/26). In-process profilers can give deeper insights about the state of the program that are hard to observe from the outside, such as lock contention, where memory @@ -39,8 +40,8 @@ decisions faster, shortening our The value is in pulling in the area-under-curve and landing in a better spot. An "imperfect" profiler that can help make a decision is better than a "perfect" profiler that is unwieldy to collect for performance or privacy reasons. Extra -information or precision is only useful insofar as it helps us make a *better* -decision or *changes* the outcome. +information or precision is only useful insofar as it helps us make a +[*better* decision or *changes* the outcome](/fast/94). For example, most new optimizations to [TCMalloc](https://github.com/google/tcmalloc/blob/master/tcmalloc) start from @@ -54,7 +55,7 @@ steps didn't directly save any CPU usage or bytes of RAM, but they enabled better decisions. Capabilities are harder to directly quantify, but they are the motor of progress. -## Leveraging existing profilers: the "No build" option +## Leveraging existing profilers: the "No build" option {#no-build} Developing a new profiler takes considerable time, both in terms of implementation and wallclock time to ready the fleet for collection at scale. @@ -65,19 +66,19 @@ For example, if the case for hashtable profiling was just reporting the capacity of hashtables, then we could also derive that information from heap profiles, TCMalloc's heap profiles of the fleet. Even where heap profiles might not be able to provide precise insights--the actual "size" of the hashtable, rather -than its capacity--we can make an informed guess from the profile combined with -knowledge about the typical load factors due to SwissMap's design. +than its capacity--we can make an [informed guess](/fast/90) from the profile +combined with knowledge about the typical load factors due to SwissMap's design. It is important to articulate the value of the new profiler over what is already provided. A key driver for hashtable-specific profiling is that the CPU profiles of a hashtable with a [bad hash function look similar to those](https://youtu.be/JZE3_0qvrMg?t=1864) -with a good hash function. The added information collected for stuck bits helps -us drive optimization decisions we wouldn't have been able to make. The capacity -information collected during hashtable-profiling is incidental to the profiler's -richer, hashtable-specific details, but wouldn't be a particularly compelling -reason to collect it on its own given the redundant information available from -ordinary heap profiles. +with a good hash function. The [added information collected](/fast/26) for stuck +bits helps us drive optimization decisions we wouldn't have been able to make. +The capacity information collected during hashtable-profiling is incidental to +the profiler's richer, hashtable-specific details, but wouldn't be a +particularly compelling reason to collect it on its own given the redundant +information available from ordinary heap profiles. ## Sampling strategies diff --git a/_posts/2023-10-20-fast-70.md b/_posts/2023-10-20-fast-70.md index 4f0337d..4430fb4 100644 --- a/_posts/2023-10-20-fast-70.md +++ b/_posts/2023-10-20-fast-70.md @@ -12,7 +12,7 @@ Originally posted as Fast TotW #70 on June 26, 2023 *By [Chris Kennelly](mailto:ckennelly@google.com)* -Updated 2025-03-25 +Updated 2025-10-03 Quicklink: [abseil.io/fast/70](https://abseil.io/fast/70) @@ -129,6 +129,13 @@ performance improvements. We still need to measure the impact on application and service-level performance, but the proxies help us hone in on an optimization that we want to deploy faster. +When we are considering multiple options for a project, secondary metrics can +give us confirmation after the fact that our expectations were correct. For +example, suppose we chose option A over option B because both provided +comparable performance but A would not impact reliability. We should measure +both the performance and reliability outcomes to support our engineering +decision. This lets us close the loop between expectations and reality. + ## Aligning with success The metrics we pick need to align with success. If a metric tells us to do the diff --git a/_posts/2023-11-10-fast-74.md b/_posts/2023-11-10-fast-74.md index d3bf02a..dbb8aba 100644 --- a/_posts/2023-11-10-fast-74.md +++ b/_posts/2023-11-10-fast-74.md @@ -12,7 +12,7 @@ Originally posted as Fast TotW #74 on September 29, 2023 *By [Chris Kennelly](mailto:ckennelly@google.com) and [Matt Kulukundis](mailto:kfm@google.com)* -Updated 2025-03-25 +Updated 2025-10-03 Quicklink: [abseil.io/fast/74](https://abseil.io/fast/74) @@ -74,12 +74,12 @@ understand, we might be tempted to remove it. TCMalloc's fast path would appear cheaper, but other code somewhere else would experience a cache miss and [application productivity](/fast/7) would decline. -To make matters worse, the cost is partly a profiling artifact. The TLB miss -blocks instruction retirement, but our processors are superscalar, out-of-order -behemoths. The processor can continue to execute further instructions in the -meantime, but this execution is not visible to a sampling profiler like -Google-Wide Profiling. IPC in the application may be improved, but not in a way -immediately associated with TCMalloc. +To make matters worse, the cost is partly [a profiling artifact](/fast/94). The +TLB miss blocks instruction retirement, but our processors are superscalar, +out-of-order behemoths. The processor can continue to execute further +instructions in the meantime, but this execution is not visible to a sampling +profiler like Google-Wide Profiling. IPC in the application may be improved, but +not in a way immediately associated with TCMalloc. ### Hidden context switch costs @@ -104,11 +104,11 @@ increase apparent kernel scheduler latency. ### Sweeping away protocol buffers -Consider an extreme example. When our hashtable profiler for Abseil's hashtables -indicates a problematic hashtable, a user could switch the offending table from -`absl::flat_hash_map` to `std::unordered_map`. Since the profiler doesn't -collect information about `std` containers, the offending table would no longer -show up, although the fleet itself would be dramatically worse. +Consider an extreme example. When [our hashtable profiler](/fast/26) for +Abseil's hashtables indicates a problematic hashtable, a user could switch the +offending table from `absl::flat_hash_map` to `std::unordered_map`. Since the +profiler doesn't collect information about `std` containers, the offending table +would no longer show up, although the fleet itself would be dramatically worse. While the above example may seem contrived, an almost entirely analogous recommendation comes up with some regularity: migrate users from protos to diff --git a/_posts/2023-11-10-fast-75.md b/_posts/2023-11-10-fast-75.md index 3ab9602..7f0579e 100644 --- a/_posts/2023-11-10-fast-75.md +++ b/_posts/2023-11-10-fast-75.md @@ -12,7 +12,7 @@ Originally posted as Fast TotW #75 on September 29, 2023 *By [Chris Kennelly](mailto:ckennelly@google.com)* -Updated 2025-03-25 +Updated 2025-10-03 Quicklink: [abseil.io/fast/75](https://abseil.io/fast/75) @@ -161,9 +161,10 @@ benchmark does, and that can have some profound effects on what we measure. For example, there, since we're iterating over the same buffer, and there's no dependency on the last value, the processor is very likely to be able to speculatively start the next iteration and won't need to undo the work. This -converts a benchmark that we'd like to measure as a chain of dependencies into a -measurement of the number of pipelines that the processor has (or the duration -of the dependency chain divided by the number of parallel executions). +converts a benchmark that we'd like to measure as a +[chain of dependencies](/fast/99) into a measurement of the number of pipelines +that the processor has (or the duration of the dependency chain divided by the +number of parallel executions). To make the benchmark more realistic, we can instead parse from a larger buffer of varints serialized end-on-end: diff --git a/_posts/2024-09-04-fast-62.md b/_posts/2024-09-04-fast-62.md index b13f37f..7f5e301 100644 --- a/_posts/2024-09-04-fast-62.md +++ b/_posts/2024-09-04-fast-62.md @@ -12,7 +12,7 @@ Originally posted as Fast TotW #62 on July 7, 2022 *By [Chris Kennelly](mailto:ckennelly@google.com), [Luis Otero](mailto:lotero@google.com) and [Carlos Villavieja](mailto:villavieja@google.com)* -Updated 2025-03-12 +Updated 2025-09-15 Quicklink: [abseil.io/fast/62](https://abseil.io/fast/62) diff --git a/_posts/2024-09-04-fast-72.md b/_posts/2024-09-04-fast-72.md index 46dd8da..dfc37b9 100644 --- a/_posts/2024-09-04-fast-72.md +++ b/_posts/2024-09-04-fast-72.md @@ -12,7 +12,7 @@ Originally posted as Fast TotW #72 on August 7, 2023 *By [Chris Kennelly](mailto:ckennelly@google.com)* -Updated 2025-02-18 +Updated 2025-08-23 Quicklink: [abseil.io/fast/72](https://abseil.io/fast/72) @@ -37,9 +37,9 @@ estimates to be correct, the primary goal is to have just enough information to optimization "A" over optimization "B" because "A" has a larger expected ROI. Oftentimes, we only need a [single significant figure](https://en.wikipedia.org/wiki/Significant_figures) -to do so: Spending more time making a better estimate does not make things more -efficient by itself. When new information arrives, we can update our priors -accordingly. +to do so: Spending more time making a better estimate or +[gathering more data](/fast/94) does not make things more efficient by itself. +When new information arrives, we can update our priors accordingly. Once we have identified an area to work in, we can shift to thinking about ways of tackling problems in that area. @@ -226,9 +226,11 @@ helps in several ways. Success in one area brings opportunities for cross-pollination. We can take the same solution, an - [algorithm](https://research.google/pubs/pub50370.pdf#page=7) or data - structure, and apply the idea to a related but different area. Without the - original landing, though, we might have never realized this. + [algorithm](https://research.google/pubs/pub50370.pdf#page=7) (pages on huge + pages) or data structure, and apply the idea to a + [related but different area](https://storage.googleapis.com/gweb-research2023-media/pubtools/7777.pdf#page=9) + (objects on pages, or "span" prioritization). Without the original landing, + though, we might have never realized this. * Circumstances are continually changing. The assumptions that started a project years ago may be invalid by the time the project is ready. diff --git a/_posts/2024-09-04-fast-79.md b/_posts/2024-09-04-fast-79.md index 63f8135..27df521 100644 --- a/_posts/2024-09-04-fast-79.md +++ b/_posts/2024-09-04-fast-79.md @@ -12,7 +12,7 @@ Originally posted as Fast TotW #79 on January 19, 2024 *By [Chris Kennelly](mailto:ckennelly@google.com) and [Matt Kulukundis](mailto:kfm@google.com)* -Updated 2024-12-10 +Updated 2025-06-20 Quicklink: [abseil.io/fast/79](https://abseil.io/fast/79) @@ -77,8 +77,9 @@ smoothly. TIP: Prefer switching defaults to migrating code if you can. -When we introduced hashtable profiling for monitoring tables fleet wide, some -users were surprised that tables could be sampled (triggering additional system +When we introduced +[hashtable profiling for monitoring tables fleet wide](/fast/26), some users +were surprised that tables could be sampled (triggering additional system calls). If we had tried to have sampled monitoring from the start, the migration would have had a new class of issues to debug. This also allowed us to have a [very clear opt-out for this specific feature](/fast/52) without delaying the @@ -91,7 +92,8 @@ class of issues at a time. ## Iterative improvement: Deploying TCMalloc's CPU caches -When TCMalloc was first introduced, it used per-thread caches, hence its name, +When [TCMalloc](github.com/google/tcmalloc/blob/master/docs/index.html) was +first introduced, it used per-thread caches, hence its name, "[Thread-Caching Malloc](https://goog-perftools.sourceforge.net/doc/tcmalloc.html)." As thread counts continued to increase, per-thread caches suffered from two growing problems: a per-process cache size was divided over more and more @@ -127,8 +129,8 @@ development of [several optimizations](https://research.google/pubs/characterizing-a-memory-allocator-at-warehouse-scale/). TCMalloc includes extensive telemetry that enabled us to calculate the amount of memory being used for per-vCPU caches which provided estimates of the potential -opportunity - to motivate the work - and the final impact - for recognising the -benefit. +opportunity - to motivate the work - and measure the final impact - for +recognising the benefit. TIP: Tracking metrics that we intend to optimize later, even if not right away, can help identify when an idea is worth pursuing and prioritizing. By monitoring diff --git a/_posts/2024-09-04-fast-83.md b/_posts/2024-09-04-fast-83.md index eea3320..cbbf060 100644 --- a/_posts/2024-09-04-fast-83.md +++ b/_posts/2024-09-04-fast-83.md @@ -12,7 +12,7 @@ Originally posted as Fast TotW #83 on June 17, 2024 *By [Chris Kennelly](mailto:ckennelly@google.com)* -Updated 2025-02-18 +Updated 2025-08-23 Quicklink: [abseil.io/fast/83](https://abseil.io/fast/83) diff --git a/_posts/2024-11-08-fast-87.md b/_posts/2024-11-08-fast-87.md index 873edeb..e3dd67d 100644 --- a/_posts/2024-11-08-fast-87.md +++ b/_posts/2024-11-08-fast-87.md @@ -12,7 +12,7 @@ Originally posted as Fast TotW #87 on October 16, 2024 *By [Chris Kennelly](mailto:ckennelly@google.com)* -Updated 2025-05-12 +Updated 2025-11-17 Quicklink: [abseil.io/fast/87](https://abseil.io/fast/87) @@ -76,7 +76,9 @@ like SwissMap might need thousands of usages to get meaningful traction. For SwissMap, nearly [drop-in API compatibility](https://abseil.io/docs/cpp/guides/container) makes a hypothetical rollback possible, whether by changing usage or by making the -implementation a wrapper for `std::unordered_map`. +implementation a wrapper for `std::unordered_map`. SwissMap makes strictly fewer +guarantees with respect to actually being unordered and iterator/pointer +stability, so it could be easily substituted for the existing implementation. At the other end of the spectrum are new programming paradigms. While a simple API might be possible to decompose in terms of other libraries, moving back and @@ -147,6 +149,29 @@ settings for the given application. This approach isn't always applicable--for example, we can't use it for decompression--but when it works it is a useful tool in higher-risk scenarios such as experimenting with data-at-rest. +## Latency injection and SLO brownouts + +Systems sometime underpromise on SLOs and overdeliver on actual performance. +This can make it challenging to determine whether higher-level services depend +on the better-than-promised performance characteristics. + +To mitigate this, one approach is to implement intentional +[brownouts](https://sre.google/sre-book/service-level-objectives/) for a +service. For example, Chubby did that in the past for its global instance, +ensuring that users experienced the letter of the SLO, allowing infeasible +dependencies to be uncovered in a controlled manner. + +Similarly, injecting latency can allow us to simulate a slower implementation, +which we might consider as part of a tradeoff of compute for other resources. An +implementation to add latency is straightforward to build and easy to turn on +and off. + +By handling this in a controlled setting, we can rapidly turn off the injection +experiment or availability brownout if we uncover downstream issues. This can +help mitigate the risk of building on the assumption that a particular tradeoff +was possible, only to discover a late-arrising issue partway through a large +scale rollout. + ## Handling bugs and regressions Bugs, regressions, and outages are never fun, but they are inevitable as the diff --git a/_posts/2025-02-06-fast-90.md b/_posts/2025-02-06-fast-90.md index fc37c8f..e7f9412 100644 --- a/_posts/2025-02-06-fast-90.md +++ b/_posts/2025-02-06-fast-90.md @@ -12,7 +12,7 @@ Originally posted as Fast TotW #90 on February 6, 2025 *By [Chris Kennelly](mailto:ckennelly@google.com)* -Updated 2025-02-06 +Updated 2025-08-11 Quicklink: [abseil.io/fast/90](https://abseil.io/fast/90) @@ -41,7 +41,9 @@ precision we might need. If we're considering project A that will make things 5% faster and project B that will make things 0.1% faster, we don't need to worry about additional significant figures for project A's estimate. A more precise estimate for project A of 5.134% won't change our prioritization all things -being equal (effort, complexity, etc.). +being equal (effort, complexity, etc.). In that situation, we should instead +prefer to focus on information that could [change our decision](/fast/94) rather +than gathering unneeded precision that won't. ## How big is it? @@ -113,8 +115,8 @@ and scope of a problem. bytes, the vector's size is likely 1. While proving a strict bound may be difficult, distributions from the heap - or container-specific profiles can give us a sense of what is most likely to - be encountered and worth optimizing for. + or [container-specific profiles](/fast/26) can give us a sense of what is + most likely to be encountered and worth optimizing for. ## How much can we change? @@ -155,7 +157,7 @@ optimization might be small. [Latency rules of thumb](https://sre.google/static/pdf/rule-of-thumb-latency-numbers-letter.pdf) or known hardware characteristics can give us another lower bound. -### Analytical methods +### Analytical methods {#analytical} We may be able to deduce a fraction based on the known properties of a library. @@ -165,7 +167,8 @@ We may be able to deduce a fraction based on the known properties of a library. at least 50%. A simple mean of the two extremes might be a good enough approximation. When -profiles are available for the container, we can be even more precise. +profiles are [available for the container](/fast/26), we can be even more +precise. Combining different pieces of information may also help us estimate headroom. While job reshaping has substantial tried-and-true savings results to draw from, diff --git a/_posts/2025-06-03-fast-93.md b/_posts/2025-06-03-fast-93.md index df30982..daca46b 100644 --- a/_posts/2025-06-03-fast-93.md +++ b/_posts/2025-06-03-fast-93.md @@ -84,6 +84,15 @@ finding bugs, allowing the engineer actively writing code to fix the issue right away, rather than waiting for a production outage with its commensurately higher costs. +Similarly, lock annotations allow us to programmatically +[convey thread safety requirements](https://clang.llvm.org/docs/ThreadSafetyAnalysis.html) +so they can be checked at compile time. Rather than having to remember to hold a +lock on every access of a member variable by hand, we can transform these into +build failures when we access it without holding the lock. While these +annotations can't express subtle, but correct, locking strategies, they can +allows us budget more time for those subtle cases, saved by not having to +carefully audit even mundate concurrency patterns. + ### Runtime hardening Over time, we've developed several techniques for warding off @@ -115,8 +124,8 @@ useful: tree walk, but incorrect sizes would lead to data corruption. In debug builds, TCMalloc checks the provided size against its internal metadata groundtruth. TCMalloc already samples periodically to drive heap profiling, - allowing it to provide extra checks on these sampled objects to proactively - find further issues in production. + allowing it to provide extra checks on these sampled objects to + [proactively find further issues in production](https://dl.acm.org/doi/10.1145/3639477.3640328). * Counterfactual checks: In sanitizer builds, we check that SwissMap's iterators remain valid after any potential invalidations. Even if a @@ -194,6 +203,7 @@ be functionally correct with or without the optimization. without actually using that much physical memory. Leveraging this design choice allowed a wider variety of edge cases to be tested and regressions avoided as new improvements are brought online. + * Protobuf elides copies in several situations. Tests ensure that this implementation detail is preserved to avoid substantial regressions. diff --git a/_posts/2025-06-27-fast-94.md b/_posts/2025-06-27-fast-94.md new file mode 100644 index 0000000..d1e9c05 --- /dev/null +++ b/_posts/2025-06-27-fast-94.md @@ -0,0 +1,192 @@ +--- +title: "Performance Tip of the Week #94: Decision making in a data-imperfect world" +layout: fast +sidenav: side-nav-fast.html +published: true +permalink: fast/94 +type: markdown +order: "094" +--- + +Originally posted as Fast TotW #94 on June 26, 2025 + +*By [Chris Kennelly](mailto:ckennelly@google.com)* + +Updated 2025-06-27 + +Quicklink: [abseil.io/fast/94](https://abseil.io/fast/94) + + +Profiling tools are vital to narrowing the search space of all possible changes +we could make to highlight the best uses of our time. In this episode, we +discuss strategies for identifying when we might want to seek out more data and +when we are in the midst of diminishing returns. + +## Focusing on outcomes + +The ultimate outcomes that we want from our tools are insights that let us +improve performance. + +We can almost always do more analysis -- increase the time horizon, auxiliary +metrics considered, alternative bucketing strategies -- to try to make a +decision. This can be initially helpful at finding unforeseen situations, but we +might mire ourselves in [spurious findings](/fast/88) or develop analysis +paralysis. + +### Changing decisions + +At some point, [redundant tools](/fast/60), [extra precision](/fast/98), or +further experiments add less value than their cost to obtain. While we can seek +these things anyway, if they aren't changing the decisions we make, we could do +without and achieve the same outcome. + +It's easy to fall into a trap of looking for one last thing, but this carries an +opportunity cost. We will occasionally make mistakes in the form of unsuccessful +projects. While some of these could be avoided in hindsight with extra analysis +or a small "fail fast" prototype, these are tolerable when the +[stakes are low](/fast/87). + +When considering an experiment to get additional information, ask yourself +questions like "can the possible range of outcomes convince me to change my +mind?" Otherwise, constructing additional experiments may merely serve to +provide confirmation bias. The following are examples where the extra analysis +seems appealing, but is ultimately unnecessary: + +* We might try out a new optimization technique with minimal rollout scope and + complexity. If it works, we've [gained confidence](/fast/72). If it fails, + we should change our plans. If either of those *possible* outcomes might be + [insufficient to persuade us](https://en.wikipedia.org/wiki/Ad_hoc_hypothesis), + we should instead develop a plan that will provide a definitive next step. + + If we plan to iteratively move + [from microbenchmark to loadtest to production](/fast/39), we should stop or + at least reconsider if the early benchmarks produce neutral results rather + than moving to production. The fact that a benchmark can give us a result + different from production might motivate us to move forward anyway, but + tediously collecting the data from them is pointless if our plan is to + unconditionally ignore them. + +* [Estimates](/fast/90) only need to be good enough to break ties. If our plan + is to look at the top 10 largest users of a library for performance + opportunities, improved accuracy might adjust the order somewhat, but not + dramatically. Historical data might give us slightly different orders for + rankings chosen using yesterday's data versus last week's, but the bulk of + the distribution is often roughly consistent. + +### Failing fast + +Consider a project where we're optimistic it will work, but we have a long slog +ahead of ourselves to get it over the finish line. There are a couple of +strategies we can use to quickly gain confidence in the optimization or abandon +it if it fails to provide the expected benefits: + +* A [new API](/fast/64) might be more expressive and easier to optimize, but + we need to migrate our existing users to it. Testing and canarying a single + user is easier to attempt (and rollback, if unsuccessful) than to deploy it + everywhere before switching implementations. +* We might be interested in deploying a new data layout, either in-memory or + on-disk. A simple prototype that simulates an approximation of that layout + can tell us a lot about the probable performance characteristics. Even + before we can handle every edge case, the "best case" scenario not panning + out gives us a reason to stop and go no further. + + For example, as we attempt to remove [data indirections](/fast/83), we might + [microbenchmark](/fast/75) a few candidate layouts for a data structure, + starting from the status quo `std::vector` to `std::deque`, + `absl::InlinedVector`, and `std::vector`. Each of these solutions + has its merits based on the constraints of `T` and the access pattern of the + program. Having a sense of where the largest opportunity is for the current + access pattern can help us focus our attention and avoid a situation where + we sink time into a migration before we can derisk anything. + +Getting things over the finish line may still be 80% of the work, but our +initial work will have derisked the outcome. Investing time in polish for a +project that ultimately fails carries a high opportunity cost. + +### Updating our priors + +The analysis we do is just to determine the course of action we should take. +Ultimately, we care about the impact of the action, not how elegant our plans +were. A promising change backed by good estimates and preliminary benchmarks has +to be successful when deployed to production to actually be a success. Good +benchmarks alone are not the [outcome](/fast/70) we are after. Sometimes we'll +find that promising analysis or benchmarks do not turn into benefits in +production, and the outcome is instead [an opportunity for learning](/fast/72). + +If we chose among several candidate solutions to a problem, we should confirm +that their qualities held up. For example, we might have picked a strategy that +was more complex but perceived to be more reliable than an alternative. Even if +the project was a success otherwise, but reliability instead suffered, we should +reconsider if the alternatives are worth revisiting. + +## Data censorship + +Our tools sometimes have blind spots that we need to consider when using them. +Simplifications to get a "good enough" result can help our priors, but we should +be cautious about extrapolating too broadly from them. + +More data points with the +[same caveats](https://en.wikipedia.org/wiki/Censoring_\(statistics\)) merely +make us overconfident, not more accurate. When the stakes are higher, +cross-validation against other information can help uncover gaps. More data +points from distinct vantage points are more valuable than more data points from +the same ones. We should prime ourselves to consider what new evidence would +cause us to reconsider our plans. + +### Profiler limitations {#profiler} + +Many profilers are cost-conscious to minimize their impact. To do this, they +employ sampling strategies that +[omit some data points](https://en.wikipedia.org/wiki/Censoring_\(statistics\)). + +Our [hashtable profiler](/fast/26) makes its sampling decisions when tables are +first mutated. Avoiding a sampling decision in the default constructor keeps +things efficient, but means that empty tables are not represented in the +statistics. Using other profilers, we can determine that many destroyed tables +are in fact empty. + +Historically, TCMalloc's lifetime profiler had a similar caveat. To simplify the +initial implementation, it only reported objects that had been both allocated +*and* deallocated during a profiling session. It omitted existing objects (left +censorship) and objects that outlived the session (right censorship). This +profiler has since been improved to include these, but understanding a +profiler's limitations is crucial to avoiding drawing the wrong conclusions from +biased data. + +Profilers tracking CPU cycles are often measuring how long it took for an +instruction to retire. The profile [hides](/fast/39) the cost of instructions +that come after a high latency one. In other situations, diffusely attributed +costs may obscure the actual critical path of the function found only by +[careful analysis](https://github.com/protocolbuffers/upb/pull/310) or tools +like [`llvm-mca`](https://llvm.org/docs/CommandGuide/llvm-mca.html). + +These examples illustrate how not everything can be measured with a +sampling-based profiler, but there are often different approaches. + +### Partial populations {#partial} + +Running a load test or even [canarying a change in production](/fast/39) for a +single application can increase our confidence that something will work. +Nevertheless, the limited scope doesn't assure us that the effect will be the +same on a wider population. + +This pitfall cuts in both positive and negative directions. If we have an +optimization for a library that isn't used by an application, no amount of +testing it with that application is likely to produce a [real effect](/fast/88). +A negative result where there is no opportunity for the optimization to show +benefit should not deter us, but we might abandon a good idea because of the +spurious result. Conversely, an optimization might falter as it is brought to a +broader scale, as our initial data points were biased by +[streetlamp effects](/fast/74). + +We can avoid this by measuring the effect on the wider population or by choosing +a broadly representative set of data points, instead of just one. Minimally we +should be confident that the method we have chosen to evaluate the optimization +has the potential to show a positive or negative impact. + +## Closing words + +As we work to gather data to guide our decisions, we should work to ensure we're +looking for features that could lead us to change our plans. It is easy to fall +into the trap of seeking additional data points to increase confidence, but we +may merely be falling into a trap of confirmation bias. diff --git a/_posts/2025-07-14-fast-95.md b/_posts/2025-07-14-fast-95.md new file mode 100644 index 0000000..bf880a3 --- /dev/null +++ b/_posts/2025-07-14-fast-95.md @@ -0,0 +1,142 @@ +--- +title: "Performance Tip of the Week #95: Spooky action at a distance" +layout: fast +sidenav: side-nav-fast.html +published: true +permalink: fast/95 +type: markdown +order: "095" +--- + +Originally posted as Fast TotW #95 on July 1, 2025 + +*By [Chris Kennelly](mailto:ckennelly@google.com)* + +Updated 2025-07-14 + +Quicklink: [abseil.io/fast/95](https://abseil.io/fast/95) + + +Shared resources can cause surprising performance impacts on seemingly unchanged +software. In this episode, we discuss how to anticipate these effects and +externalities. + +## Normalization techniques + +Workload changes can confound longitudinal analysis: If you optimize a library +like protocol buffers, does spending more time in that code mean your +optimization didn't work or that the application now serves more load? + +A/B tests can control for independent variables. Nevertheless, load balancing +can throw a wrench into this. A client-side load balancing algorithm (like +[Weighted Round Robin](https://sre.google/sre-book/load-balancing-datacenter/)) +might observe the better performance of some tasks and send more requests to +them. + +* In absolute terms, the total CPU usage of both experiment arms might be + unchanged. +* In relative terms, there may be small, second order changes in relative time + (for example, %-of-time in Protobuf). This might hide the true significance + of the optimization or worse, lead us to believe the change is a regression + when it is not. + +For individual jobs, normalizing by [useful work done](/fast/7) can let us +compare throughput-per-CPU or throughput-per-RAM to allow us to control for +workload shifts. + +## Choosing the right partitioning scheme + +Where we have shared resources at the process, machine, or even data center +level, we want to design an experimentation scheme that allows us to treat the +shared resources as part of our experiment and control groups too. + +### Cache locality and memory bandwidth + +Cache misses carry two costs for performance: they cause our code to wait for +the miss to resolve and every miss puts pressure on the memory subsystem for +other code that we are not closely studying. + +In a [previous episode](/fast/83), we discussed an optimization to move a hot +`absl::node_hash_set` to `absl::flat_hash_set`. This change only affected one +particular type of queries going through the server. While we saw 5-7% +improvements for those types of queries, completely unmodified code paths for +other query types also showed an improvement, albeit smaller. + +Measuring the latter effect required process-level partitioning, which selected +some processes to always use `absl::node_hash_set` and other processes to always +use `absl::flat_hash_set`. Without this technique, all requests handled by the +server coexist with a 50:50 mix of modified/unmodified code paths. The modified +code path would have shown a performance improvement from fewer cache misses on +its lookups. For query types unaffected by the change, their data structures +would have seen the same mix of cache pressure from neighboring modified and +unmodified requests, rather than ever seeing a 100% modified or 100% unmodified +neighbor. Per-request randomization would prevent us from measuring the "blast +radius" impact of our change. This happens all the time for any change that uses +a shared resource (aka most changes). + +This challenge of experiment design can be especially pronounced with load +tests. Without studying the right query distribution of different request types +concurrently, we won't see the full impact, both positive and negative, of our +change. + +### Better hugepage coverage lifted all boats + +Before +[Temeraire](https://storage.googleapis.com/gweb-research2023-media/pubtools/6170.pdf), +TCMalloc's hugepage-aware allocator, many teams had opted to not periodically +release memory from TCMalloc's page heap to maximize hugepage availability and +CPU performance. Other, less CPU-intensive workloads released memory +periodically to strike a different balance between CPU usage and RAM usage due +to free pages resting on TCMalloc's heap. + +Several +[additional](https://storage.googleapis.com/gweb-research2023-media/pubtools/6213.pdf) +optimizations have since modified the behavior, but the initial rollout sought +to maintain the same "release on request" characteristic of the old page heap to +[minimize tradeoffs](/fast/79). Ensuring that almost all applications used no +more memory than they previously did allowed us to focus on CPU performance and +a handful of edge cases in allocation patterns. + +When we ran A/B experiments prior to the fleetwide rollout of Temeraire, we saw +improvements in hugepage coverage, even for applications that did not +periodically release memory. + +Managing memory in a hugepage-aware fashion everywhere improved performance even +where we did not anticipate substantial benefits. The new allocator allowed +whole hugepages to be returned, avoiding physical memory fragmentation +altogether, and allowed us to iteratively target already fragmented pages to +satisfy requests. + +These added benefits were only recognized by A/B testing at the machine level. +While we had enabled Temeraire with the help of many eager, early adopters, we +could only see the first order effect--direct impact on the modified +application--and not the second order effect--better behaved neighbors--during +those experiments. + +### Distributed systems + +In larger, distributed serving stacks, we might go further to partition +resources to prevent confounding our results. + +* For example, if we send more requests to server backends during an A/B + experiment, the effect of higher load will be smeared across all of the + backends, affecting both the experiment and control group's latency. Several + teams use a "stripes" setup for this purpose, partitioning the serving stack + so that once a request is routed to one particular partition, it does not + cross back into other partitions. +* A change to + [job scheduling](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf) + might impact how work is distributed to individual machines in a data + center. If we were to improve performance on some machines, effectively + lowering their utilization, the control plane might shift machine + assignments to rebalance. While jobs may still have effectively the same + amount of CPU time, a busier machine can translate into lower per-core + performance due to memory bandwidth, cache contention, and frequency scaling + effects that create a headwind for the experiment group. + +## Closing words + +Managing and reducing contention for shared resources is a large source of +performance opportunities. Nevertheless, recognizing these opportunities fully +requires consideration of experiment design to ensure confounding factors do not +obscure the real benefits from us. diff --git a/_posts/2025-07-16-fast-26.md b/_posts/2025-07-16-fast-26.md new file mode 100644 index 0000000..2bd6901 --- /dev/null +++ b/_posts/2025-07-16-fast-26.md @@ -0,0 +1,267 @@ +--- +title: "Performance Tip of the Week #26: Fixing things with hashtable profiling" +layout: fast +sidenav: side-nav-fast.html +published: true +permalink: fast/26 +type: markdown +order: "026" +--- + +Originally posted as Fast TotW #26 on May 20, 2020 + +*By [Chris Kennelly](mailto:ckennelly@google.com) and [Zie Weaver](mailto:zearen@google.com)* + +Updated 2025-07-16 + +Quicklink: [abseil.io/fast/26](https://abseil.io/fast/26) + + +As discussed by Matt Kulukundis in his +[2019 CppCon talk](https://www.youtube.com/watch?v=JZE3_0qvrMg), identifying a +"slow" hashtable from a pure CPU profile can be challenging. Abseil's +[C++ hashtables](/tips/136) have a built-in profiler. In this episode we +describe what insights about the hash function quality and hash collisions it +can provide, making them discernible at scale. We also look at a couple of case +studies where this information was used to improve Google's production fleet +efficiency. + +## Starting the analysis + +We can gather hashtable profiles by collecting them from a server in production +and loading them into pprof. + +The data can be used to identify problematic hashtables. Bad hashtables usually +come from a low-entropy hash function or using the same hash function for both +the table and sharding across instances. These can be addressed by: + +1. Using [`absl::Hash`](https://abseil.io/docs/cpp/guides/hash) to provide a + well-mixed hash function. +1. Using a distinct hash function by salting the hash. + +## Finding stuck bits + +Stuck bits are bits that never change across every hash. Naturally, this +increases the probability of a collision exponentially in the number of stuck +bits. + +We can look at a few examples to demonstrate causes of stuck hash functions. + +### Case study: InternalSubscriber::MessageReceived + +This function showed up multiple times when searching for stuck bits. It's in a +client library used in many servers. + +We see an insertion into an indexed array called `pending_ack_callbacks`: + +
+pending_ack_callbacks_[i].shard.insert(ack_cb, ack_callback);
+
+ +Its definition is an array of hashtables: + +
+struct {
+  ABSL_CACHELINE_ALIGNED absl::flat_hash_map<Closure*, Closure*> shard;
+} pending_ack_callbacks_[kNumPendingAckShards];
+
+ +where `kNumPendingAckShards = 256`. Suspiciously, our stuck bits were 0xff = +255. We choose our shard based on the hash: + +
+/* static */
+int InternalSubscriber::ClosureAddressToShard(Closure *address) {
+  absl::flat_hash_set<Closure*>::hasher h;
+  return h(address) % kNumPendingAckShards;
+}
+
+ +For the hashtable of any shard, all the bottom bits will be the same by +definition. The hashtable uses these same bits to determine where to insert new +values, so we see worse performance due to collisions. + +This was fixed by salting the hash that determines sharding: + +
+absl::Hash<std::pair<Closure*, int>> h;
+return h(std::pair(address, 42)) % kNumPendingAckShards;
+
+ +This salting technique provides distinct hash functions for shard lookup and +table lookup. This halved the cost of insertions in this library. + +### Case study: ProcessWatchDone + +In a different library, we found an unusual pattern of stuck bits: +`0x4030802800000000`. Before it was fixed, we found the declarations: + +
+typedef absl::node_hash_map<NamespaceKey, int64, NamespaceKeyHash>
+    NsKeyRequestIdMap;
+
+ +Since it the code previously used `std::unordered_map` and `NamespaceKey` wasn't +hashable with `std::hash`, it specified a custom hasher, `NamespaceKeyHash`. It +used using `GoodFastHash`: + +
+size_t NamespaceKeyHash::operator()(const NamespaceKey& ns_key) const {
+  return GoodFastHash<NamespaceKey>()(ns_key);
+}
+
+ +`NamespaceKey` is a pair of `Cord`s. But `GoodFastHash` does not properly mix +bits for `std::pair`: + +
+typedef std::pair<absl::Cord, absl::Cord> NamespaceKey;
+
+ +Migrating this to use `absl::Hash` fixed this by improving mixing on the +`std::pair`. + +## Finding tables with many collisions + +When a hashtable has more collisions, it has to do more probes to find a +particular element for a lookup. With more collisions, lookup performance scales +with O(n) rather than O(1). The required memory loads might be uncached, +[hurting performance](https://static.googleusercontent.com/media/sre.google/en//static/pdf/rule-of-thumb-latency-numbers-letter.pdf). + +A perfect probe length is 0, which means we found the object right where we +first looked. If the average probe length is greater than 0.1, then the table is +probing more often than should be necessary as ~10% of keys are encountering +collisions. We can calculate average probe length by dividing the +`total_probe_length` by the sum of the `num_erases` and `size` (these two +numbers capture the total number of elements that have been inserted into the +table). + +### Case study: PortMapperImpl::UpdateActiveMap + +This one showed a max probe length of 133 and an average probe length of 8.4. +This is effectively performing a linear search on many lookups. + +We can look at where we are mutating the table: + +
+// Allocate new ConnLabel element and update counters
+iter = active_map_->find_or_insert(*flow);
+iter->second = new ConnLabel;
+
+ +`active_map_`'s definition points us to: + +
+typedef priority_hash_map<IpFlow,
+                          ConnLabel*,
+                          IpFlowPrioritizer> ClientPortMap;
+
+ +`priority_hash_map` is a custom type wrapping a SwissMap, but with +less-than-ideal defaults: + +
+template<class _Key,
+         class _Val, // NOLINT - "Failed to find complete declaration of
+                     //           class _Val  [build/class] [5]"
+         class _PriorityFunc,
+         class _PriorityKey = int,
+         class _PriorityHash = hash<_PriorityKey>,
+         class _PriorityEqual = std::equal_to<_PriorityKey>,
+         class _PriorityCompare = std::less<_PriorityKey>,
+         class _KeyHash = hash<_Key>,
+         class _KeyEqual = std::equal_to<_Key> >
+class priority_hash_map {
+ public:
+  typedef absl::flat_hash_map<_Key, _Val, _KeyHash, _KeyEqual> hash_map_type;
+  ...
+
+ +`_PriorityHash` and `_KeyHash` are using `hash<>`, which provides a custom hash +specialization: + +
+template<> struct hash<IpFlow> {
+  size_t operator()(const IpFlow& flow) const {
+    return flow.remote_ip()
+           (flow.proto() << 24) ^ flow.local_port() ^
+           (flow.remote_port() << 16);
+ }
+};
+
+ +The `xor`'d bits can lead to poor mixing. For example, all you'd need for a +collision is an IP `10.0.0.33` on port `2038` and another IP `10.0.0.32` on port +`2039`. To fix this, we implemented `AbslHashValue` for `IpFlow` so +`priority_hash_map` could switch to `absl::Hash`: + +
+// Allow this to be used as a hash map key.
+template<typename H>
+friend H AbslHashValue(H h, const IpFlow& flow) {
+ return H::combine(
+     std::move(h),
+     flow.remote_ip(),
+     flow.proto(),
+     flow.local_port(),
+     flow.remote_port());
+}
+
+ +## Hashtable statistics + +For each profiled hashtable, we record statistics in several tags: + +* `capacity` tells the exact capacity of the hashtable. +* `size` is the current number of elements in the hashtable. +* The probe length statistics (`max_probe_length`, `total_probe_length`) tell + us how many extra lookups were needed to find an element. In an ideal + hashtable, we find the element we are looking for on the first lookup + (`probe length = 0`). If there are collisions, we'll have a higher probe + length. `max_probe_length` tells us the worst case probe length for any + element in the hashtable. `total_probe_length` is the sum of probe lengths + for all elements of the hashtable. +* `stuck_bits` is a bitmask that will indicate any bits for which the hash + codes in the table have only seen one value (either zero or one). For a good + hash function this should be 0 for any table >10 elements. This is + implemented as a (running `&` of all hashes) `|` `~`(running `|` of all + hashes). A stuck bit indicates that our hash function may not be providing + sufficient entropy. +* `num_rehashes` records the number of times the table has been rehashed. + Calling `reserve` with an appropriate size (close to the true size of the + table) can avoid reduce hashes. +* `max_reserve` records the maximum size passed to a call to `reserve` for the + instance, or 0 if no such call has occurred. This can be useful for + identifying tables that are too large (`size << capacity`) because they were + called with too large a size. Similarly, tables with many rehashes can save + on rehashing costs by calling `reserve` with a sufficient size. +* `num_erases` is the number of elements that have been erased from the + hashtable (since the last rehash). The sum of `num_erases` and `size` + indicates the number of elements added to the table since the last rehash. +* `inline_element_size` is the size of the elements of the flat part of the + array. For flat hashtables, this is the size of the key-value pair. For node + hashtables, this is the size of the pointer to the key-value pair. +* `key_size` is equal to `sizeof(key_type)` for the table. +* `value_size` is equal to `sizeof(value_type)` for the table. On sets, + `sizeof(key_type) == sizeof(value_type)`. On maps, `value_type` holds both + `key_type` and `mapped_type` with appropriate padding for alignment. +* `soo_capacity` is the number of elements in the table's small-object + optimization (SOO). +* `table_age` reports the age in microseconds of the table. + +## Detecting bad hashtables in unit tests + +One additional safeguard is that we can run the profiler while running our +tests. This requires running with the environment variable +`ABSL_CONTAINER_FORCE_HASHTABLEZ_SAMPLE=1`. + +Since this triggers a test failure, its threshold for alerting is fairly high. +Unit tests may not insert sufficiently large numbers of elements into the +hashtables to produce collisions. + +## Closing words + +Finding issues in hashtables from a CPU profile alone is challenging since +opportunities may not appear prominently in the profile. SwissMap's built-in +hashtable profiling lets us peer into a number of hashtable properties to find +low-quality hash functions, collisions, or excessive rehashing. diff --git a/_posts/2025-10-03-fast-98.md b/_posts/2025-10-03-fast-98.md new file mode 100644 index 0000000..495dada --- /dev/null +++ b/_posts/2025-10-03-fast-98.md @@ -0,0 +1,173 @@ +--- +title: "Performance Tip of the Week #98: Measurement has an ROI" +layout: fast +sidenav: side-nav-fast.html +published: true +permalink: fast/98 +type: markdown +order: "098" +--- + +Originally posted as Fast TotW #98 on August 23, 2025 + +*By [Chris Kennelly](mailto:ckennelly@google.com)* + +Updated 2025-10-03 + +Quicklink: [abseil.io/fast/98](https://abseil.io/fast/98) + + +Effectively measuring optimization projects is an important part of the +[lifecycle of an optimization](/fast/72). Overlooking a large positive (or +negative) [externality](/fast/95) can cause us to make the wrong decisions for +choosing our next steps and future projects. Nevertheless, this quest for +accuracy needs to be balanced against the ROI from a better measurement. In this +episode, we discuss strategies for deciding when to invest more time in +measuring a project and when to move on. + +## Getting the big picture right + +In choosing a measurement strategy, we want to get the big picture right. We +want to look for where a difference [changes our decisions](/fast/94) for a +project: Do we roll back because something wasn't really helpful, the complexity +unwarranted, or do we double down and keep working in the same space +[following success](/fast/72)? + +Spending a small amount of effort to realize a 2x difference in our measured +results can have a tremendously positive ROI: The extra measurement time is +easier than finding, developing, and landing another equally-sized optimization +from scratch. Conversely, spending twice the effort to refine the measurement of +a much smaller optimization has a poor ROI, as the added precision is lost in +the noise of larger effects. + +It is easy to fixate on the numbers that we have [on the page](/fast/74) while +overlooking the numbers that are off the page. It is more important that we +don't overlook large positive (or negative) externalities that change our +conclusions than to eke out a 7th digit of precision. + +## Significant figures + +Optimization outcomes are quantitative, which can make them an attractive +nuisance for believing they are overly precise and accurate. For many +techniques, we can only measure 1 or 2 +[significant digits](https://en.wikipedia.org/wiki/Significant_figures). + +With a tool like +[Google-Wide Profiling](https://research.google/pubs/google-wide-profiling-a-continuous-profiling-infrastructure-for-data-centers/), +we can represent the fraction of Google fleet cycles in TCMalloc as a 17 decimal +digit, double-precision floating point number. Writing out this many digits +doesn't actually mean the number has that +[precision](https://en.wikipedia.org/wiki/False_precision). Spending a few +additional milliseconds in TCMalloc on a single server will change some of the +digits in that number. While the overall trend is stable, there is a day-to-day +standard deviation in the total. A second-order effect with small magnitude +relative to that variation might be entirely lost in the noise. + +Aggregating more days can give us more samples, but longer longitudinal studies +can experience confounding factors from unrelated changes in the environment if +the true effect size is small. We may have achieved a more precise measurement, +but changes to the environment will negatively impact our accuracy beyond the +added precision. + +## Precision and accuracy + +We want to avoid confusing +[precision for accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision), +and vice-versa. Similar to how a long duration analysis might be stymied by +confounding factors, a carefully controlled experiment might claim a high +precision result without being accurate. + +Many load tests repeatedly replay traffic and compare the performance between a +modified and baseline configuration. These repeated runs can deliver large +sample sizes to produce low standard deviation estimates, instilling a +(potentially false) belief that the result is accurate. By construction, load +tests are [simplifications](/fast/39) of production and so they may omit certain +[workload types, platforms, or environment features](/fast/94) that would be +seen in widespread deployment. + +Even with tools like production A/B experiments, which can help give us accurate +results by testing against the full variety of workloads, we still must account +for nuances in the technique: + +* Our sample has to be representative. For example, an A/B test in a single + data center will encounter a different subset of workloads than the fleet as + a whole. +* Our control and experiment groups have to be + [sufficiently isolated](/fast/95) from each other. +* Our experimental setup needs to avoid statistical pitfalls. + +Spending time to account for these factors makes sense if we're making real +gains in accuracy: Large scale data locality and contention effects can be +challenging to measure by other means. For small changes without these +externalities, we may expend a lot of effort to measure without obtaining a +better measurement. + +## Increasing statistical power through aggregation + +[Small changes](https://abseil.io/resources/swe-book/html/ch09.html#write_small_changes) +create agility by breaking them into manageable, focused chunks, which aids +testing, debugging, and review. This can complicate a measurement story, since +it adds more pieces that we need to track. + +Rather than allow the measurement tail to wag the dog by forcing us to combine +unrelated optimizations or forgo them entirely, we may opt to track the +aggregate benefit from a series of optimizations made over a brief period of +time. + +* As we iteratively optimized TCMalloc's fast path, we could track the overall + trend and the performance impact against the baseline. Each individual + change could be evaluated through inspection of assembly or microbenchmarks. +* During the SwissMap migration, we could track the overall trend in + associative container costs. Based on the overwhelming performance results + for early deployments, each individual change could be primarily evaluated + for correctness with careful benchmarking reserved for only a fraction of + the changes in the migration. + +Even if we're pursuing a general strategy of aggregating similar changes, we may +want to separate out the effects of more tradeoff-heavy optimizations. A change +that creates subtle edge cases or long-lived technical debt might be worth it if +it's a sufficiently large optimization. If we don't end up realizing the +benefits, we might prefer to roll it back in favor of looking for simpler +approaches. + +## Balancing complexity and simplicity + +Aiming for greater accuracy can tempt us to pursue more complex methodologies. +An approach that is easier to implement, but captures most of the impact and +regression signals from our project, can afford us more time to work on the +actual optimizations, look for externalities, and be just as accurate in the +end. + +* Importantly, a simple measurement should not be confused with an incomplete + one. The goal is to avoid needless precision, not to overlook significant + impacts like transitive costs on other systems or shifts between different + resource dimensions. + + The more complex approach that captures small effects can draw + [attention](https://en.wikipedia.org/wiki/Law_of_triviality) to its details, + causing us to overlook the larger effects that we missed. Even though we + explicitly made a good faith attempt to quantify everything, it is easier to + focus on the words on the page than the words absent from it. + +* While we might capture additional phenomena relevant to performance, the + additional factors we consider introduce new sources of error for us. + Joining multiple data sources can be tempting; but if we are not familiar + with their idiosyncrasies, we might end up with surprising (or wrong) + results. + +* Simple is explainable. "[This performance optimization only has an effect + during Australian business hours](/fast/88)" might be completely correct due + to diurnal effects, but it is harder to see the immediate connection between + what our approach left implicit and a causal explanation. + +When we consider additional factors for measuring a project, we should try to +gauge the contribution of each relative to the significant figures provided by +the others. If one signal contributes 1% +/- 0.1pp, it will overwhelm another +term that contributes 0.01% +/- 0.001pp. + +## Conclusion + +In choosing a measurement strategy, we want to strike a balance between +precision, accuracy, and effort. Accurate measurements can help us to continue +pursuing fruitful areas for adjacent optimizations, but we should be mindful +where increasing effort produces diminishing returns. diff --git a/_posts/2025-10-07-fast-99.md b/_posts/2025-10-07-fast-99.md new file mode 100644 index 0000000..7f4812d --- /dev/null +++ b/_posts/2025-10-07-fast-99.md @@ -0,0 +1,611 @@ +--- +title: "Performance Tip of the Week #99: Illuminating the processor core with llvm-mca" +layout: fast +sidenav: side-nav-fast.html +published: true +permalink: fast/99 +type: markdown +order: "099" +--- + +Originally posted as Fast TotW #99 on September 29, 2025 + +*By [Chris Kennelly](mailto:ckennelly@google.com)* + +Updated 2025-10-07 + +Quicklink: [abseil.io/fast/99](https://abseil.io/fast/99) + + +The [RISC](https://en.wikipedia.org/wiki/Reduced_instruction_set_computer) +versus [CISC](https://en.wikipedia.org/wiki/Complex_instruction_set_computer) +debate ended in a draw: Modern processors decompose instructions into +[micro-ops](https://en.wikipedia.org/wiki/Micro-operation) handled by backend +execution units. Understanding how instructions are executed by these units can +give us insights on optimizing key functions that are backend bound. In this +episode, we walk through using +[`llvm-mca`](https://llvm.org/docs/CommandGuide/llvm-mca.html) to analyze +functions and identify performance insights from its simulation. + +## Preliminaries: Varint optimization + +`llvm-mca`, short for Machine Code Analyzer, is a tool within LLVM. It uses the +same datasets that the compiler uses for making instruction scheduling +decisions. This ensures that improvements made to compiler optimizations +automatically flow towards keeping `llvm-mca` representative. The flip side is +that the tool is only as good as LLVM's internal modeling of processor designs, +so certain quirks of individual microarchitecture generations might be omitted. +It also models the processor [behavior statically](#limitations), so cache +misses, branch mispredictions, and other dynamic properties aren't considered. + +Consider Protobuf's `VarintSize64` method: + +
+size_t CodedOutputStream::VarintSize64(uint64_t value) {
+#if PROTOBUF_CODED_STREAM_H_PREFER_BSR
+  // Explicit OR 0x1 to avoid calling absl::countl_zero(0), which
+  // requires a branch to check for on platforms without a clz instruction.
+  uint32_t log2value = (std::numeric_limits<uint64_t>::digits - 1) -
+                       absl::countl_zero(value | 0x1);
+  return static_cast<size_t>((log2value * 9 + (64 + 9)) / 64);
+#else
+  uint32_t clz = absl::countl_zero(value);
+  return static_cast<size_t>(
+      ((std::numeric_limits<uint64_t>::digits * 9 + 64) - (clz * 9)) / 64);
+#endif
+}
+
+ +This function calculates how many bytes an encoded integer will consume in +[Protobuf's wire +format](https://protobuf.dev/programming-guides/encoding/#varints). It first +computes the number of bits needed to represent the value by finding the log2 +size of the input, then approximates division by 7. The size of the input can be +calculated using the `absl::countl_zero` function. However this has two possible +implementations depending on whether the processor has a `lzcnt` (Leading Zero +Count) instruction available or if this operation needs to instead leverage the +`bsr` (Bit Scan Reverse) instruction. + +Under the hood of `absl::countl_zero`, we need to check whether the argument is +zero, since `__builtin_clz` (Count Leading Zeros) models the behavior of x86's +`bsr` (Bit Scan Reverse) instruction and has unspecified behavior if the input +is 0. The `| 0x1` avoids needing a branch by ensuring the argument is non-zero +in a way the compiler can follow. + +When we have `lzcnt` available, the compiler optimizes `x == 0 ? 32 : +__builtin_clz(x)` in `absl::countl_zero` to `lzcnt` without branches. This makes +the `| 0x1` unnecessary. + +Compiling this gives us two different assembly sequences depending on whether +the `lzcnt` instruction is available or not: + +`bsr` (`-march=ivybridge`): + +
+orq     $1, %rdi
+bsrq    %rdi, %rax
+leal    (%rax,%rax,8), %eax
+addl    $73, %eax
+shrl    $6, %eax
+
+ +`lzcnt` (`-march=haswell`): + +
+lzcntq  %rdi, %rax
+leal    (%rax,%rax,8), %ecx
+movl    $640, %eax
+subl    %ecx, %eax
+shrl    $6, %eax
+
+ +### Analyzing the code + +We can now use [Compiler Explorer](https://godbolt.org/z/39EMsWq7z) to run these +sequences through `llvm-mca` and get an analysis of how they would execute on a +simulated Skylake processor (`-mcpu=skylake`) for a single invocation +(`-iterations=1`) and include `-timeline`: + +`bsr` (`-march=ivybridge`): + +
+Iterations:        1
+Instructions:      5
+Total Cycles:      10
+Total uOps:        5
+
+Dispatch Width:    6
+uOps Per Cycle:    0.50
+IPC:               0.50
+Block RThroughput: 1.0
+
+Timeline view:
+Index     0123456789
+
+[0,0]     DeER .   .   orq      $1, %rdi
+[0,1]     D=eeeER  .   bsrq     %rdi, %rax
+[0,2]     D====eER .   leal     (%rax,%rax,8), %eax
+[0,3]     D=====eER.   addl     $73, %eax
+[0,4]     D======eER   shrl     $6, %eax
+
+ +`lzcnt` (`-march=haswell`): + +
+Iterations:        1
+Instructions:      5
+Total Cycles:      9
+Total uOps:        5
+
+Dispatch Width:    6
+uOps Per Cycle:    0.56
+IPC:               0.56
+Block RThroughput: 1.0
+
+Timeline view:
+Index     012345678
+
+[0,0]     DeeeER  .   lzcntq    %rdi, %rax
+[0,1]     D===eER .   leal      (%rax,%rax,8), %ecx
+[0,2]     DeE---R .   movl      $640, %eax
+[0,3]     D====eER.   subl      %ecx, %eax
+[0,4]     D=====eER   shrl      $6, %eax
+
+ +This can also be obtained via the command line + +``` +$ clang file.cpp -O3 --target=x86_64 -S -o - | llvm-mca -mcpu=skylake -iterations=1 -timeline +``` + +There's two sections to this output, the first section provides some summary +statistics for the code, the second section covers the execution "timeline." The +timeline provides interesting detail about how instructions flow through the +execution pipelines in the processor. There are three columns, and each +instruction is shown on a separate row. The three columns are as follows: + + * The first column is a pair of numbers, the first number is the iteration, + the second number is the index of the instruction. In the above example + there's a single iteration, number 0, so just the index of the instruction + changes on each row. + * The third column is the instruction. + * The second column is the timeline. Each character in that column represents + a cycle, and the character indicates what's happening to the instruction in + that cycle. + +The timeline is counted in cycles. Each instruction goes through several steps: + + * `D` the instruction is dispatched by the processor; modern desktop or server + processors can dispatch many instructions per cycle. Little Arm cores like + the Cortex-A55 used in smartphones are more limited. + * `=` the instruction is waiting to execute. In this case, the instructions + are waiting for the results of prior instructions to be available. In other + cases, there might be a bottleneck in the processor's backend. + * `e` the instruction is executing. + * `E` the instruction's output is available. + * `-` the instruction has completed execution and is waiting to be retired. + Instructions generally retire in program order, the order instructions + appear in the program. An instruction will wait to retire until prior ones + have also retired. On some architectures like the Cortex-A55, there is no + `R` phase in the timeline as some instructions [retire + out-of-order](https://chipsandcheese.com/i/149874023/out-of-order-retire). + * `R` the instruction has been retired, and is no longer occupying execution + resources. + +The output is lengthy, but we can extract a few high-level insights from it: + + * The `lzcnt` implementation is quicker to execute (9 cycles) than the "bsr" + implementation (10 cycles). This is seen under the `Total Cycles` summary as + well as the timeline. + * The routine is simple: with the exception of `movl`, the instructions depend + on each other sequentially (`E`-finishing to `e`-starting vertically + aligning, pairwise, in the timeline view). + * Bitwise-`or` of `0x1` delays `bsrq`'s input being available by 1 cycle, + contributing to the longer execution cost. + * Note that although `movl` starts immediately in the `lzcnt` implementation, + it can't retire until prior instructions are retired, since we retire in + program order. + * Both sequences are 5 instructions long, but the `lzcnt` implementation has + higher [instruction-level parallelism + (ILP)](https://en.wikipedia.org/wiki/Instruction-level_parallelism) because + the `mov` has no dependencies. This demonstrates that counting instructions + need not tell us the [cycle cost](/fast/7). + +`llvm-mca` is flexible and can model other processors: + + * AMD Zen3 ([Compiler Explorer](https://godbolt.org/z/xbq9PqG8z)), where the + cost difference is more stark (8 cycles versus 12 cycles). + * Arm Neoverse-V2 ([Compiler Explorer](https://godbolt.org/z/sE3T65n8M)), a + server CPU where the difference is 7 vs 9 cycles. + * Arm Cortex-A55 ([Compiler Explorer](https://godbolt.org/z/vPP5EPrcW)), a + popular little core used in smartphones, where the difference is 8 vs 10 + cycles. Note how the much smaller dispatch width results in the `D` phase of + instructions starting later. + +### Throughput versus latency + +When designing [microbenchmarks](/fast/75), we sometimes want to distinguish +between throughput and latency microbenchmarks. If the input of one benchmark +iteration does not depend on the prior iteration, the processor can execute +multiple iterations in parallel. Generally for code that is expected to execute +in a loop we care more about throughput, and for code that is inlined in many +places interspersed with other logic we care more about latency. + +We can use `llvm-mca` to model execution of the block of code in a tight loop. +By specifying `-iterations=100` on the `lzcnt` version, we get a very different +set of results because one iteration's execution can overlap with the next: + +
+Iterations:        100
+Instructions:      500
+Total Cycles:      134
+Total uOps:        500
+
+Dispatch Width:    6
+uOps Per Cycle:    3.73
+IPC:               3.73
+Block RThroughput: 1.0
+
+ +We were able to execute 100 iterations in only 134 cycles (1.34 cycles/element) +by achieving high ILP. + +Achieving the best performance may sometimes entail trading off the latency of a +basic block in favor of higher throughput. Inside of the protobuf implementation +of `VarintSize` +([protobuf/wire_format_lite.cc](https://github.com/protocolbuffers/protobuf/tree/main/src/google/protobuf/wire_format_lite.cc)), +we use a vectorized version for realizing higher throughput albeit with worse +latency. A single iteration of the loop takes 29 cycles to process 32 elements +([Compiler Explorer](https://godbolt.org/z/TczKTaGd8)) for 0.91 cycles/element, +but 100 iterations (3200 elements) only requires 1217 cycles (0.38 +cycles/element - about 3x faster) showcasing the high throughput once setup +costs are amortized. + +## Understanding dependency chains + +When we are looking at CPU profiles, we are often tracking when instructions +*retire*. Costs are attributed to instructions that took longer to retire. +Suppose we profile a small function that accesses memory pseudo-randomly: + +
+unsigned Chains(unsigned* x) {
+   unsigned a0 = x[0];
+   unsigned b0 = x[1];
+
+   unsigned a1 = x[a0];
+   unsigned b1 = x[b0];
+
+   unsigned b2 = x[b1];
+
+   return a1 | b2;
+}
+
+ +`llvm-mca` models memory loads being an L1 hit ([Compiler +Explorer](https://godbolt.org/z/6PzTPYv8T)): It takes 5 cycles for the value of +a load to be available after the load starts execution. The output has been +annotated with the source code to make it easier to read. + +
+Iterations:        1
+Instructions:      6
+Total Cycles:      19
+Total uOps:        9
+
+Dispatch Width:    6
+uOps Per Cycle:    0.47
+IPC:               0.32
+Block RThroughput: 3.0
+
+Timeline view:
+                    012345678
+Index     0123456789
+
+
+[0,0]     DeeeeeER  .    .  .   movl    (%rdi), %ecx         // ecx = a0 = x[0]
+[0,1]     DeeeeeER  .    .  .   movl    4(%rdi), %eax        // eax = b0 = x[1]
+[0,2]     D=====eeeeeER  .  .   movl    (%rdi,%rax,4), %eax  // eax = b1 = x[b0]
+[0,3]     D==========eeeeeER.   movl    (%rdi,%rax,4), %eax  // eax = b2 = x[b1]
+[0,4]     D==========eeeeeeER   orl     (%rdi,%rcx,4), %eax  // eax |= a1 = x[a0]
+[0,5]     .DeeeeeeeE--------R   retq
+
+ +In this timeline the first two instructions load `a0` and `b0`. Both of these +operations can happen immediately. However, the load of `x[b0]` can only happen +once the value for `b0` is available in a register - after a 5 cycle delay. The +load of `x[b1]` can only happen once the value for `b1` is available after +another 5 cycle delay. + +This program has two places where we can execute loads in parallel: the pair +`a0` and `b0` and the pair `a1 and b1` (note: `llvm-mca` does not correctly +model the memory load uop from `orl` for `a1` starting). Since the processor +retires instructions in program order we expect the profile weight to appear on +the loads for `a0`, `b1`, and `b2`, even though we had parallel loads in-flight +simultaneously. + +If we examine this profile, we might try to optimize one of the memory +indirections because it appears in our profile. We might do this by miraculously +replacing `a0` with a constant ([Compiler +Explorer](https://godbolt.org/z/b68KzsKMP)). + +
+unsigned Chains(unsigned* x) {
+   unsigned a0 = 0;
+   unsigned b0 = x[1];
+
+   unsigned a1 = x[a0];
+   unsigned b1 = x[b0];
+
+   unsigned b2 = x[b1];
+
+   return a1 | b2;
+}
+
+ +
+Iterations:        1
+Instructions:      5
+Total Cycles:      19
+Total uOps:        8
+
+Dispatch Width:    6
+uOps Per Cycle:    0.42
+IPC:               0.26
+Block RThroughput: 2.5
+
+Timeline view:
+                    012345678
+Index     0123456789
+
+[0,0]     DeeeeeER  .    .  .   movl    4(%rdi), %eax
+[0,1]     D=====eeeeeER  .  .   movl    (%rdi,%rax,4), %eax
+[0,2]     D==========eeeeeER.   movl    (%rdi,%rax,4), %eax
+[0,3]     D==========eeeeeeER   orl     (%rdi), %eax
+[0,4]     .DeeeeeeeE--------R   retq
+
+ +Even though we got rid of the "expensive" load we saw in the CPU profile, we +didn't actually change the overall length of the critical path that was +dominated by the 3 load long "b" chain. The timeline view shows the critical +path for the function, and performance can only be improved if the duration of +the critical path is reduced. + +## Optimizing CRC32C + +CRC32C is a common hashing function and modern architectures include dedicated +instructions for calculating it. On short sizes, we're largely dealing with +handling odd numbers of bytes. For large sizes, we are constrained by repeatedly +invoking `crc32q` (x86) or similar every few bytes of the input. By examining +the repeated invocation, we can look at how the processor will execute it +([Compiler Explorer](https://godbolt.org/z/nEYsWWzWs)): + +
+uint32_t BlockHash() {
+ asm volatile("# LLVM-MCA-BEGIN");
+ uint32_t crc = 0;
+ for (int i = 0; i < 16; ++i) {
+   crc = _mm_crc32_u64(crc, i);
+ }
+ asm volatile("# LLVM-MCA-END" : "+r"(crc));
+ return crc;
+}
+
+ +This function doesn't hash anything useful, but it allows us to see the +back-to-back usage of one `crc32q`'s output with the next `crc32q`'s inputs. + +
+Iterations:        1
+Instructions:      32
+Total Cycles:      51
+Total uOps:        32
+
+Dispatch Width:    6
+uOps Per Cycle:    0.63
+IPC:               0.63
+Block RThroughput: 16.0
+
+Instruction Info:
+
+[1]: #uOps
+[2]: Latency
+[3]: RThroughput
+[4]: MayLoad
+[5]: MayStore
+[6]: HasSideEffects (U)
+
+[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
+1      0     0.17                        xorl   %eax, %eax
+1      3     1.00                        crc32q %rax, %rax
+1      1     0.25                        movl   $1, %ecx
+1      3     1.00                        crc32q %rcx, %rax
+...
+
+Resources:
+[0]   - SKLDivider
+[1]   - SKLFPDivider
+[2]   - SKLPort0
+[3]   - SKLPort1
+[4]   - SKLPort2
+[5]   - SKLPort3
+[6]   - SKLPort4
+[7]   - SKLPort5
+[8]   - SKLPort6
+[9]   - SKLPort7
+
+
+Resource pressure per iteration:
+[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]
+-      -     4.00   18.00   -     1.00    -     5.00   6.00    -
+
+Resource pressure by instruction:
+[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
+-      -      -      -      -      -      -      -      -      -     xorl       %eax, %eax
+-      -      -     1.00    -      -      -      -      -      -     crc32q     %rax, %rax
+-      -      -      -      -      -      -      -     1.00    -     movl       $1, %ecx
+-      -      -     1.00    -      -      -      -      -      -     crc32q     %rcx, %rax
+-      -      -      -      -      -      -     1.00    -      -     movl       $2, %ecx
+-      -      -     1.00    -      -      -      -      -      -     crc32q     %rcx, %rax
+...
+-      -      -      -      -      -      -      -     1.00    -     movl       $15, %ecx
+-      -      -     1.00    -      -      -      -      -      -     crc32q     %rcx, %rax
+-      -      -      -      -     1.00    -     1.00   1.00    -     retq
+
+
+Timeline view:
+                    0123456789          0123456789          0
+Index     0123456789          0123456789          0123456789
+
+[0,0]     DR   .    .    .    .    .    .    .    .    .    .   xorl    %eax, %eax
+[0,1]     DeeeER    .    .    .    .    .    .    .    .    .   crc32q  %rax, %rax
+[0,2]     DeE--R    .    .    .    .    .    .    .    .    .   movl    $1, %ecx
+[0,3]     D===eeeER .    .    .    .    .    .    .    .    .   crc32q  %rcx, %rax
+[0,4]     DeE-----R .    .    .    .    .    .    .    .    .   movl    $2, %ecx
+[0,5]     D======eeeER   .    .    .    .    .    .    .    .   crc32q  %rcx, %rax
+...
+[0,30]    .    DeE---------------------------------------R  .   movl    $15, %ecx
+[0,31]    .    D========================================eeeER   crc32q  %rcx, %rax
+
+ +Based on the "`Instruction Info`" table, `crc32q` has latency 3 and throughput +1: Every clock cycle, we can start processing a new invocation on port 1 (`[3]` +in the table), but it takes 3 cycles for the result to be available. + +Instructions decompose into individual micro operations (or "uops"). The +resources section lists the processor execution pipelines (often referred to as +ports). Every cycle uops can be issued to these ports. There are constraints - +no port can take every kind of uop and there is a maximum number of uops that +can be dispatched to the processor pipelines every cycle. + +For the instructions in our function, there is a one-to-one correspondence so +the number of instructions and the number of uops executed are equivalent (32). +The processor has several backends for processing uops. From the resource +pressure tables, we see that while `crc32` must execute on port 1, the `movl` +executes on any of ports 0, 1, 5, and 6. + +In the timeline view, we see that for our back-to-back sequence, we can't +actually begin processing the 2nd `crc32q` for several clock cycles until the +1st `crc32q` hasn't completed. This tells us that we're underutilizing port 1's +capabilities, since its throughput indicates that an instruction can be +dispatched to it once per cycle. + +If we restructure `BlockHash` to compute 3 parallel streams with a simulated +combine function (the code uses a bitwise or as a placeholder for the correct +logic that this approach requires), we can accomplish the same amount of work in +fewer clock cycles ([Compiler Explorer](https://godbolt.org/z/ha9KdYovh)): + +
+uint32_t ParallelBlockHash(const char* p) {
+ uint32_t crc0 = 0, crc1 = 0, crc2 = 0;
+ for (int i = 0; i < 5; ++i) {
+   crc0 = _mm_crc32_u64(crc0, 3 * i + 0);
+   crc1 = _mm_crc32_u64(crc1, 3 * i + 1);
+   crc2 = _mm_crc32_u64(crc2, 3 * i + 2);
+ }
+ crc0 = _mm_crc32_u64(crc0, 15);
+ return crc0 | crc1 | crc2;
+}
+
+ +
+Iterations:        1
+Instructions:      36
+Total Cycles:      22
+Total uOps:        36
+
+Dispatch Width:    6
+uOps Per Cycle:    1.64
+IPC:               1.64
+Block RThroughput: 16.0
+
+Timeline view:
+                    0123456789
+Index     0123456789          01
+
+[0,0]     DR   .    .    .    ..   xorl %eax, %eax
+[0,1]     DR   .    .    .    ..   xorl %ecx, %ecx
+[0,2]     DeeeER    .    .    ..   crc32q       %rcx, %rcx
+[0,3]     DeE--R    .    .    ..   movl $1, %esi
+[0,4]     D----R    .    .    ..   xorl %edx, %edx
+[0,5]     D=eeeER   .    .    ..   crc32q       %rsi, %rdx
+[0,6]     .DeE--R   .    .    ..   movl $2, %esi
+[0,7]     .D=eeeER  .    .    ..   crc32q       %rsi, %rax
+[0,8]     .DeE---R  .    .    ..   movl $3, %esi
+[0,9]     .D==eeeER .    .    ..   crc32q       %rsi, %rcx
+[0,10]    .DeE----R .    .    ..   movl $4, %esi
+[0,11]    .D===eeeER.    .    ..   crc32q       %rsi, %rdx
+...
+[0,32]    .    DeE-----------R..   movl $15, %esi
+[0,33]    .    D==========eeeER.   crc32q       %rsi, %rcx
+[0,34]    .    D============eER.   orl  %edx, %eax
+[0,35]    .    D=============eER   orl  %ecx, %eax
+
+ +The implementation invokes `crc32q` the same number of times, but the end-to-end +latency of the block is 22 cycles instead of 51 cycles The timeline view shows +that the processor can issue a `crc32` instruction every cycle. + +This modeling can be evidenced by [microbenchmark](/fast/75) results for +`absl::ComputeCrc32c` +([absl/crc/crc32c_benchmark.cc](https://github.com/abseil/abseil-cpp/blob/master/absl/crc/crc32c_benchmark.cc)). +The real implementation uses multiple streams (and correctly combines them). +Ablating these shows a regression, validating the value of the technique. + +
+name                          CYCLES/op    CYCLES/op    vs base
+BM_Calculate/0                   5.007 ± 0%     5.008 ± 0%         ~ (p=0.149 n=6)
+BM_Calculate/1                   6.669 ± 1%     8.012 ± 0%   +20.14% (p=0.002 n=6)
+BM_Calculate/100                 30.82 ± 0%     30.05 ± 0%    -2.49% (p=0.002 n=6)
+BM_Calculate/2048                285.6 ± 0%     644.8 ± 0%  +125.78% (p=0.002 n=6)
+BM_Calculate/10000               906.7 ± 0%    3633.8 ± 0%  +300.78% (p=0.002 n=6)
+BM_Calculate/500000             37.77k ± 0%   187.69k ± 0%  +396.97% (p=0.002 n=6)
+
+ +If we create a 4th stream for `ParallelBlockHash` ([Compiler +Explorer](https://godbolt.org/z/eo36ejca7)), `llvm-mca` shows that the overall +latency is unchanged since we are bottlenecked on port 1's throughput. Unrolling +further adds additional overhead to combine the streams and makes prefetching +harder without actually improving performance. + +To improve performance, many fast CRC32C implementations use other processor +features. Instructions like the carryless multiply instruction (`pclmulqdq` on +x86\) can be used to implement another parallel stream. This allows additional +ILP to be extracted by using the other ports of the processor without worsening +the bottleneck on the port used by `crc32`. + +## Limitations + +While `llvm-mca` can be a useful tool in many situations, its modeling has +limits: + + * Memory accesses are modeled as L1 hits. In the real world, we can have much + longer stalls when we need to access the L2, L3, or even [main + memory](/fast/62). + + * It cannot model branch predictor behavior. + + * It does not model instruction fetch and decode steps. + + * Its analysis is only as good as LLVM's processor models. If these do not + accurately model the processor, the simulation might differ from the real + processor. + + For example, many ARM processor models are incomplete, and `llvm-mca` picks + a processor model that it estimates to be a good substitute; this is + generally fine for compiler heuristics, where differences only matter if it + would result in different generated code, but it can derail manual + optimization efforts. + +## Closing words + +Understanding how the processor executes and retires instructions can give us +powerful insights for optimizing functions. `llvm-mca` lets us peer into the +processor to let us understand bottlenecks and underutilized resources. + +## Further reading + + * [uops.info](https://uops.info) + * Chandler Carruth's "[Going Nowhere + Faster](https://www.youtube.com/watch?v=2EWejmkKlxs)" talk + * Agner Fog's "[Software Optimization + Resources](https://www.agner.org/optimize/)" diff --git a/_posts/2025-11-27-fast-97.md b/_posts/2025-11-27-fast-97.md new file mode 100644 index 0000000..b986feb --- /dev/null +++ b/_posts/2025-11-27-fast-97.md @@ -0,0 +1,287 @@ +--- +title: "Performance Tip of the Week #97: Virtuous ecosystem cycles" +layout: fast +sidenav: side-nav-fast.html +published: true +permalink: fast/97 +type: markdown +order: "097" +--- + +Originally posted as Fast TotW #97 on August 21, 2025 + +*By [Chris Kennelly](mailto:ckennelly@google.com)* + +Updated 2025-11-27 + +Quicklink: [abseil.io/fast/97](https://abseil.io/fast/97) + + +Software ecosystems aim to maximize qualities like efficiency, correctness, and +reliability while minimizing the costs of achieving these properties. Improving +a single service through customization can help build an optimization more +expediently, but it has an inherent limited scope to its upside and increases +[technical debt](/fast/52). A point solution fails to provide the full benefits +from applying features horizontally. In this episode, we discuss how lessons +learned from partnerships to improve individual applications can improve +efficiency for everyone. + +## Case studies + +We look at several case studies where we could take a feature that showed +benefits for a single team and rolled it out to provide benefits for all of +Google. + +### SwissMap + +Abseil's +[hash table implementation](https://abseil.io/blog/20180927-swisstables), +SwissMap, originated out of a partnership between our C++ core libraries team +and two Zurich-based engineers, Alkis Evlogimenos and Roman Perepelitsa, on our +search indexing team. They had set out to make an improved hash table that used +open addressing for [fewer data indirections](/fast/83), improved the design to +reduce memory usage, and added API features to avoid unnecessary copies. Jeff +Dean and Sanjay Ghemawat had suggested a SIMD-accelerated control block to speed +up comparison, allowing the table to efficiently run at a higher load factor. + +While hash tables are a key tool in a programmer's toolbox, they tend to not +span API boundaries to nearly the same degree as other +[vocabulary types](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2125r0.pdf) +like `std::string` and `std::vector`. For the indexing team's purposes, most of +the benefits for their application could be obtained by replacing their hottest +hashtables and declaring victory. + +But rather than stopping, the team opted to drive widespread adoption of the +superior hashtable. Over tens of thousands of changes later, nearly every hash +table in Google's codebase is a SwissMap. The early years of the rollout saved a +substantial amount of CPU and RAM. These savings were only unlocked because we +pursued a scaled migration. + +The broad rollout produced ongoing ecosystem dividends: + +* "Just use SwissMap" was good advice, but for an engineer trying to + [develop a new feature](https://queue.acm.org/detail.cfm?id=3733703), it is + easier to use the existing types, even if SwissMap would be a better fit for + performance. The widespread usage of SwissMap made it easier for new code to + reach for it. +* Drawing on lessons from prior hash table changes, [randomization](/fast/93) + was introduced to prevent code from relying on the order of iteration. This + made it easier to land subsequent optimizations under the hood, allowing us + to iteratively improve the hash function by changing its algorithm. +* Because of our implementation freedom, we were able to add telemetry + features like [built-in profiling](/fast/26) that work across the fleet. + This has allowed us to find and fix bad hash functions, identify + optimizations for small tables, and proactively `reserve` containers to + their final sizes. + +### Size class-driven allocation + +Modern memory allocators like TCMalloc use +"[size classes](https://github.com/google/tcmalloc/blob/master/docs/design.md#objects)" +to efficiently cache small objects. Rather than fit each allocation precisely, +the allocator determines which size class a request falls into and checks the +freelist for that size class. When an object is freed, the allocator determines +which size class the memory belongs to and puts the object on the appropriate +freelist for future reuse. This design simplifies bookkeeping and reduces the +cost of allocating and deallocating memory. + +#### Sized deallocation + +Using size classes requires a memory allocator to determine which freelist to +put freed objects onto. TCMalloc's original mapping approach required +[several pointer dereferences](/fast/83). In 2012, Google's compiler +optimization team identified that the object's size is frequently already known +outside of the allocator. Passing this size as an +[argument to the allocator](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3778.html) +avoids the expensive lookup. + +Once sized deallocation was implemented, several teams were able to adopt the +feature to reduce allocation costs. This optimization can exacerbate +[existing undefined behavior](https://github.com/google/tcmalloc/blob/master/docs/mismatched-sized-delete.md#tcmalloc-is-buggy) +or introduce bugs in the rare case of tail-padded structures, requiring would-be +adopters to clean up their code and transitive dependencies. Because of these +lurking issues, the default remained "unsized" delete. + +Motivated by +[fleetwide profiles showing the horizontal cost of TCMalloc](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44271.pdf), +work began to enable sized deallocation for the entire codebase. + +* [Runtime assertions](/fast/93) were added to TCMalloc to catch bugs in debug + builds. +* [Test failures](https://abseil.io/resources/swe-book/html/ch11.html#:~:text=The%20Beyonc%C3%A9%20Rule&text=The%20straightforward%20answer%20is%3A%20test,an%20automated%20test%20for%20it.) + were investigated and numerous fixes submitted. +* The feature was progressively rolled out to Address Sanitizer, debug, and + optimized builds, preventing backsliding. + +Infrastructure improvements, like the runtime assertions themselves, cost the +same to develop for one team as they do would for widespread deployment. For +other steps, like enabling sized deallocation in tests, operating centrally on +all tests was cheaper and easier to do than asking each team to set up a +parallel set of tests for themselves and their dependencies. + +While the project delivered on its original goal of reducing TCMalloc costs in +the fleet, the central effort strengthened the ecosystem's reliability as a +whole by creating a sturdier foundation: + +* By enabling sized deallocation for all tests, teams already relying on the + optimization benefited from the prevention of new bugs in their transitive + dependencies. In the past, bugs would manifest as production crashes + requiring teams to play whack-a-mole to set up extra testing. Moving the + entire platform to sized delete foreclosed this entire class of bugs while + avoiding the cost of duplicative and ad hoc testing setups. +* The added information allows TCMalloc to + [occasionally double-check](https://dl.acm.org/doi/10.1145/3639477.3640328) + the purported size, allowing further memory corruption bugs to be uncovered. + This was only possible because size arguments are routinely passed to + deallocations. +* The centralized burndown uncovered a relatively rare, but important, pattern + of tail padded structures not working well with sized deallocation. This + spurred C++20's + [destroying delete](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0722r3.html) + feature, which may not have been standardized and adopted universally + without a central vantage point of its value. + +Had we eschewed centralized adoption, individual teams would have faced similar +challenges, more bugs in production, and not benefitted from the flywheel of +follow-on improvements. + +#### Size class improvements + +Size classes simplify the logic required to allocate and deallocate memory, but +they introduce a new optimization problem: which sizes should we select as our +buckets? For many years, these partitions were chosen to minimize the amount of +memory lost by rounding up the request to its bucket ("internal fragmentation") +based on fleetwide profiles. + +Internal fragmentation is just [one metric](/fast/70) we can look at when making +these choices. Merging two size classes would mean some smaller sizes grow, +increasing roundup overhead, while being offset by reducing the amount of memory +left in TCMalloc's caches at peak demand +("[external fragmentation](https://storage.googleapis.com/gweb-research2023-media/pubtools/6213.pdf#page=4)"). +Having too few size classes would put more stress on TCMalloc's backend and +global locks. + +One partnership with the Search retrieval team produced a series of +optimizations. The experiments showed improved performance with two sets of size +classes, "power-of-2 below 64" and "power-of-2 only." Rather than stop with just +that service, we opted to evaluate the impact for all workloads. + +The former struck a tradeoff between maximizing CPU performance while modestly +increasing RAM usage. Fleetwide A/B experimentation allowed this set to be +evaluated and rolled out to improve aggregate CPU performance, a result +comparable to the initial load tests motivating the work. + +Continued investigation of the "power-of-2 only" result pointed towards reducing +the amount of time an object might sit on one of TCMalloc's caches before the +next allocation that uses it. By choosing size classes that were likely to see +high allocation rates rather than solely minimizing internal fragmentation, we +were able to further improve fleetwide CPU performance. + +We can contrast this with the hypothetical counterfactual where we made size +class [customizability](/fast/52) a full-fledged feature of TCMalloc and +individual teams tuned for their workloads. + +* For the teams with the time and inclination to customize their + configuration, they might have seen even larger wins for themselves than + what they obtained because of the global rollout, but over time, these + settings might have become stale. The second round of global optimizations + underscores the risk that individual teams might have stopped at a + suboptimal point. +* More configurations make it harder to maintain TCMalloc, and hamper + [implementation freedom](/fast/52). +* The tightly controlled configuration space made it possible to land + improvements centrally, without having to navigate the existing + customizations or risks from [breaking](https://hyrumslaw.com) + [someone's use case](https://xkcd.com/1172/). + +### Turbo troubles + +Early work to use AVX instructions in general purpose workloads faced the +headwind of downclocking on the first AVX-capable platforms. In a math-intensive +workload, a modest frequency reduction would be a tolerable tradeoff for a +doubling in floating point throughput. In other workloads, the compiler's +occasional use of them in mostly scalar code produced performance regressions, +not wins, which deterred adoption. + +Changing the compiler to constrain autovectorization broke the logjam. It was +possible to make this change because: + +* Most math-intensive code already used intrinsics to ensure high-quality + vectorization, rather than relying on autovectorization. This was unaffected + by the compiler change. +* Scalar code could unlock and leverage new instruction sets like + [BMI2](/fast/39) from subsequent microarchitectures. +* In places where mostly scalar code could be vectorized somewhat, the + compiler was constrained from introducing heavyweight vector instructions + that introduced downclocking. + +Being able to leverage the performance benefits of these new instructions +without downsides spurred broader adoption, creating a flywheel for further +horizontally and vertically-driven optimizations to be built on top of it. + +## Lessons + +### Don't assume the status quo is fixed + +Teams working on a single application sometimes treat the underlying ecosystem +and platform as immutable, and solve problems within that constraint. The +platform itself is rarely immutable if there's sufficient value to be gained +from an ecosystem-wide change. + +Even if it makes sense to work around a problem for reasons of expediency, +surfacing these pain points can raise awareness of frequently encountered pain +points. A frequent problem might look rare if everyone works around it rather +than seeing it get fixed in the platform once and for all. + +### Focus on problems and outcomes, not exact solutions + +[Prototypes](/fast/72) are an invaluable resource for assessing how to solve a +problem and determining what strategies actually work. It can help to step back +to think about requirements and [desired outcomes](/fast/70). We can say "yes, +this is an important problem worth solving," while simultaneously saying "but in +a different way." + +When charting a path for scaled adoption of a prototype, we might want to use a +different implementation strategy to target a broader addressable market, take +advantage of existing use cases, or avoid long-term technical debt. Rarely is +"deploy exactly this prototype" the outcome we care about, so flexibility +towards solving the problem in different ways can let us grow our impact. + +### Create and use leverage + +As we take vertically-inspired solutions and deploy them horizontally, we are +often looking for leverage. If there is already a broadly used library or +feature that we can improve, we get a larger benefit than if we created an +entirely new thing that needed adoption. Rather than workaround a problem, we +should see if we can push changes further down the stack to fix issues more +generally and for everyone. + +At other times, we might need to create a lever for ourselves. As seen with +SwissMap and AVX, driving adoption helped us to bootstrap. It created usage +where there was none previously, and then used that lever to motivate further +centralized optimizations. + +### Understand what is seen and unseen + +Opinionated libraries that might offer relatively few configuration options may +appear to be an impediment to optimization, but it is easy to miss the +opportunity costs of catering too much to specific use cases or flexibility. The +costs of the former tend to be [visible](/fast/74), while the latter is not. + +A healthier, more sustainable core library can deliver other optimizations that +help our service of interest. For example, we can see the costs of SwissMap's +per-instance [randomization](/fast/93) when copying. + +* The direct savings are clear: Removing this randomization would help improve + performance for copying-intensive users. +* The indirect costs are less so: Loss of randomization would make it harder + to add optimizations to common operations like lookup and insertion. This + would hurt the broad ecosystem, likely including copy-intensive users as + well. + +## Closing words + +Deep work with individual workloads can produce novel optimization insights. +Finding ways to scale these to the broader ecosystem can increase our impact, +both by removing bottlenecks for that workload and by addressing the problem +systematically.