Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion _posts/2023-03-02-fast-21.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Originally posted as Fast TotW #21 on January 16, 2020

*By [Paul Wankadia](mailto:[email protected]) and [Darryl Gove](mailto:[email protected])*

Updated 2024-10-21
Updated 2025-09-03

Quicklink: [abseil.io/fast/21](https://abseil.io/fast/21)

Expand Down
11 changes: 6 additions & 5 deletions _posts/2023-03-02-fast-39.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Originally posted as Fast TotW #39 on January 22, 2021

*By [Chris Kennelly](mailto:[email protected]) and [Alkis Evlogimenos](mailto:[email protected])*

Updated 2025-03-24
Updated 2025-09-29

Quicklink: [abseil.io/fast/39](https://abseil.io/fast/39)

Expand Down Expand Up @@ -112,10 +112,11 @@ challenging: Microbenchmarks tend to have small working sets that tend to be
cache resident. Real code, particularly Google C++, is not.

In production, the cacheline holding `kMasks` might be evicted, leading to much
worse stalls (hundreds of cycles to access main memory). Additionally, on x86
processors since Haswell, this [optimization can be past its prime](/fast/9):
BMI2's `bzhi` instruction is both faster than loading and masking *and* delivers
more consistent performance.
worse stalls
([hundreds of cycles to access main memory](https://sre.google/static/pdf/rule-of-thumb-latency-numbers-letter.pdf)).
Additionally, on x86 processors since Haswell, this
[optimization can be past its prime](/fast/9): BMI2's `bzhi` instruction is both
faster than loading and masking *and* delivers more consistent performance.

When developing benchmarks for
[SwissMap](https://abseil.io/blog/20180927-swisstables), individual operations
Expand Down
12 changes: 7 additions & 5 deletions _posts/2023-03-02-fast-53.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Originally posted as Fast TotW #53 on October 14, 2021

*By [Mircea Trofin](mailto:[email protected])*

Updated 2024-11-19
Updated 2025-09-03

Quicklink: [abseil.io/fast/53](https://abseil.io/fast/53)

Expand Down Expand Up @@ -73,7 +73,7 @@ the process of writing a benchmark. An example of its use may be seen
[here](https://github.com/llvm/llvm-test-suite/tree/main/MicroBenchmarks/LoopVectorization)

The benchmark harness support for performance counters consists of allowing the
user to specify up to 3 counters in a comma-separated list, via the
user to specify counters in a comma-separated list, via the
`--benchmark_perf_counters` flag, to be measured alongside the time measurement.
Just like time measurement, each counter value is captured right before the
benchmarked code is run, and right after. The difference is reported to the user
Expand Down Expand Up @@ -131,13 +131,15 @@ instructions, and 6 memory ops per iteration.

- *Number of counters*: At most 32 events may be requested for simultaneous
collection. Note however, that the number of hardware counters available is
much lower (usually 4-8 on modern CPUs) -- requesting more events than the
much lower (usually 4-8 on modern CPUs, see
`PerfCounterValues::kMaxCounters`) -- requesting more events than the
hardware counters will cause
[multiplexing](https://perf.wiki.kernel.org/index.php/Tutorial#multiplexing_and_scaling_events)
and decreased accuracy.

- *Visualization*: There is no visualization available, so the user needs to
rely on collecting JSON result files and summarizing the results.
- *Visualization*: There is no dedicated visualization UI available, so for
complex analysis, users may need to collect JSON result files and summarize
the results.

- *Counting vs. Sampling*: The framework only collects counters in "counting"
mode -- it answers how many cycles/cache misses/etc. happened, but not does
Expand Down
2 changes: 1 addition & 1 deletion _posts/2023-03-02-fast-9.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Originally posted as Fast TotW #9 on June 24, 2019

*By [Chris Kennelly](mailto:[email protected])*

Updated 2025-03-27
Updated 2025-10-03

Quicklink: [abseil.io/fast/9](https://abseil.io/fast/9)

Expand Down
2 changes: 1 addition & 1 deletion _posts/2023-09-14-fast-7.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Originally posted as Fast TotW #7 on June 6, 2019

*By [Chris Kennelly](mailto:[email protected])*

Updated 2025-03-25
Updated 2025-10-03

Quicklink: [abseil.io/fast/7](https://abseil.io/fast/7)

Expand Down
2 changes: 1 addition & 1 deletion _posts/2023-09-30-fast-52.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Originally posted as Fast TotW #52 on September 30, 2021

*By [Chris Kennelly](mailto:[email protected])*

Updated 2025-03-24
Updated 2025-10-03

Quicklink: [abseil.io/fast/52](https://abseil.io/fast/52)

Expand Down
4 changes: 2 additions & 2 deletions _posts/2023-10-10-fast-64.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Originally posted as Fast TotW #64 on October 21, 2022

*By [Chris Kennelly](mailto:[email protected])*

Updated 2025-03-24
Updated 2025-09-29

Quicklink: [abseil.io/fast/64](https://abseil.io/fast/64)

Expand Down Expand Up @@ -192,7 +192,7 @@ that can be returned. This approach has two problems:
variable small string object buffer sizes. Returning `const std::string&`
constrains the implementation to that particular size of buffer.

In contrast, by returning `std::string_view` (or our
In contrast, by returning [`std::string_view`](/tips/1) (or our
[internal predecessor](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3442.html),
`StringPiece`), we decouple callers from the internal representation. The API is
the same, independent of whether the string is constant data (backed by the
Expand Down
27 changes: 14 additions & 13 deletions _posts/2023-10-15-fast-60.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,15 @@ Originally posted as Fast TotW #60 on June 6, 2022

*By [Chris Kennelly](mailto:[email protected])*

Updated 2025-03-24
Updated 2025-09-29

Quicklink: [abseil.io/fast/60](https://abseil.io/fast/60)


[Google-Wide Profiling](https://research.google/pubs/pub36575/) collects data
not just from our hardware performance counters, but also from in-process
profilers.
profilers. These have been covered in previous episodes covering
[hashtables](/fast/26).

In-process profilers can give deeper insights about the state of the program
that are hard to observe from the outside, such as lock contention, where memory
Expand All @@ -39,8 +40,8 @@ decisions faster, shortening our
The value is in pulling in the area-under-curve and landing in a better spot. An
"imperfect" profiler that can help make a decision is better than a "perfect"
profiler that is unwieldy to collect for performance or privacy reasons. Extra
information or precision is only useful insofar as it helps us make a *better*
decision or *changes* the outcome.
information or precision is only useful insofar as it helps us make a
[*better* decision or *changes* the outcome](/fast/94).

For example, most new optimizations to
[TCMalloc](https://github.com/google/tcmalloc/blob/master/tcmalloc) start from
Expand All @@ -54,7 +55,7 @@ steps didn't directly save any CPU usage or bytes of RAM, but they enabled
better decisions. Capabilities are harder to directly quantify, but they are the
motor of progress.

## Leveraging existing profilers: the "No build" option
## Leveraging existing profilers: the "No build" option {#no-build}

Developing a new profiler takes considerable time, both in terms of
implementation and wallclock time to ready the fleet for collection at scale.
Expand All @@ -65,19 +66,19 @@ For example, if the case for hashtable profiling was just reporting the capacity
of hashtables, then we could also derive that information from heap profiles,
TCMalloc's heap profiles of the fleet. Even where heap profiles might not be
able to provide precise insights--the actual "size" of the hashtable, rather
than its capacity--we can make an informed guess from the profile combined with
knowledge about the typical load factors due to SwissMap's design.
than its capacity--we can make an [informed guess](/fast/90) from the profile
combined with knowledge about the typical load factors due to SwissMap's design.

It is important to articulate the value of the new profiler over what is already
provided. A key driver for hashtable-specific profiling is that the CPU profiles
of a hashtable with a
[bad hash function look similar to those](https://youtu.be/JZE3_0qvrMg?t=1864)
with a good hash function. The added information collected for stuck bits helps
us drive optimization decisions we wouldn't have been able to make. The capacity
information collected during hashtable-profiling is incidental to the profiler's
richer, hashtable-specific details, but wouldn't be a particularly compelling
reason to collect it on its own given the redundant information available from
ordinary heap profiles.
with a good hash function. The [added information collected](/fast/26) for stuck
bits helps us drive optimization decisions we wouldn't have been able to make.
The capacity information collected during hashtable-profiling is incidental to
the profiler's richer, hashtable-specific details, but wouldn't be a
particularly compelling reason to collect it on its own given the redundant
information available from ordinary heap profiles.

## Sampling strategies

Expand Down
9 changes: 8 additions & 1 deletion _posts/2023-10-20-fast-70.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Originally posted as Fast TotW #70 on June 26, 2023

*By [Chris Kennelly](mailto:[email protected])*

Updated 2025-03-25
Updated 2025-10-03

Quicklink: [abseil.io/fast/70](https://abseil.io/fast/70)

Expand Down Expand Up @@ -129,6 +129,13 @@ performance improvements. We still need to measure the impact on application and
service-level performance, but the proxies help us hone in on an optimization
that we want to deploy faster.

When we are considering multiple options for a project, secondary metrics can
give us confirmation after the fact that our expectations were correct. For
example, suppose we chose option A over option B because both provided
comparable performance but A would not impact reliability. We should measure
both the performance and reliability outcomes to support our engineering
decision. This lets us close the loop between expectations and reality.

## Aligning with success

The metrics we pick need to align with success. If a metric tells us to do the
Expand Down
24 changes: 12 additions & 12 deletions _posts/2023-11-10-fast-74.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Originally posted as Fast TotW #74 on September 29, 2023

*By [Chris Kennelly](mailto:[email protected]) and [Matt Kulukundis](mailto:[email protected])*

Updated 2025-03-25
Updated 2025-10-03

Quicklink: [abseil.io/fast/74](https://abseil.io/fast/74)

Expand Down Expand Up @@ -74,12 +74,12 @@ understand, we might be tempted to remove it. TCMalloc's fast path would appear
cheaper, but other code somewhere else would experience a cache miss and
[application productivity](/fast/7) would decline.

To make matters worse, the cost is partly a profiling artifact. The TLB miss
blocks instruction retirement, but our processors are superscalar, out-of-order
behemoths. The processor can continue to execute further instructions in the
meantime, but this execution is not visible to a sampling profiler like
Google-Wide Profiling. IPC in the application may be improved, but not in a way
immediately associated with TCMalloc.
To make matters worse, the cost is partly [a profiling artifact](/fast/94). The
TLB miss blocks instruction retirement, but our processors are superscalar,
out-of-order behemoths. The processor can continue to execute further
instructions in the meantime, but this execution is not visible to a sampling
profiler like Google-Wide Profiling. IPC in the application may be improved, but
not in a way immediately associated with TCMalloc.

### Hidden context switch costs

Expand All @@ -104,11 +104,11 @@ increase apparent kernel scheduler latency.

### Sweeping away protocol buffers

Consider an extreme example. When our hashtable profiler for Abseil's hashtables
indicates a problematic hashtable, a user could switch the offending table from
`absl::flat_hash_map` to `std::unordered_map`. Since the profiler doesn't
collect information about `std` containers, the offending table would no longer
show up, although the fleet itself would be dramatically worse.
Consider an extreme example. When [our hashtable profiler](/fast/26) for
Abseil's hashtables indicates a problematic hashtable, a user could switch the
offending table from `absl::flat_hash_map` to `std::unordered_map`. Since the
profiler doesn't collect information about `std` containers, the offending table
would no longer show up, although the fleet itself would be dramatically worse.

While the above example may seem contrived, an almost entirely analogous
recommendation comes up with some regularity: migrate users from protos to
Expand Down
9 changes: 5 additions & 4 deletions _posts/2023-11-10-fast-75.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Originally posted as Fast TotW #75 on September 29, 2023

*By [Chris Kennelly](mailto:[email protected])*

Updated 2025-03-25
Updated 2025-10-03

Quicklink: [abseil.io/fast/75](https://abseil.io/fast/75)

Expand Down Expand Up @@ -161,9 +161,10 @@ benchmark does, and that can have some profound effects on what we measure. For
example, there, since we're iterating over the same buffer, and there's no
dependency on the last value, the processor is very likely to be able to
speculatively start the next iteration and won't need to undo the work. This
converts a benchmark that we'd like to measure as a chain of dependencies into a
measurement of the number of pipelines that the processor has (or the duration
of the dependency chain divided by the number of parallel executions).
converts a benchmark that we'd like to measure as a
[chain of dependencies](/fast/99) into a measurement of the number of pipelines
that the processor has (or the duration of the dependency chain divided by the
number of parallel executions).

To make the benchmark more realistic, we can instead parse from a larger buffer
of varints serialized end-on-end:
Expand Down
2 changes: 1 addition & 1 deletion _posts/2024-09-04-fast-62.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Originally posted as Fast TotW #62 on July 7, 2022

*By [Chris Kennelly](mailto:[email protected]), [Luis Otero](mailto:[email protected]) and [Carlos Villavieja](mailto:[email protected])*

Updated 2025-03-12
Updated 2025-09-15

Quicklink: [abseil.io/fast/62](https://abseil.io/fast/62)

Expand Down
16 changes: 9 additions & 7 deletions _posts/2024-09-04-fast-72.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Originally posted as Fast TotW #72 on August 7, 2023

*By [Chris Kennelly](mailto:[email protected])*

Updated 2025-02-18
Updated 2025-08-23

Quicklink: [abseil.io/fast/72](https://abseil.io/fast/72)

Expand All @@ -37,9 +37,9 @@ estimates to be correct, the primary goal is to have just enough information to
optimization "A" over optimization "B" because "A" has a larger expected ROI.
Oftentimes, we only need a
[single significant figure](https://en.wikipedia.org/wiki/Significant_figures)
to do so: Spending more time making a better estimate does not make things more
efficient by itself. When new information arrives, we can update our priors
accordingly.
to do so: Spending more time making a better estimate or
[gathering more data](/fast/94) does not make things more efficient by itself.
When new information arrives, we can update our priors accordingly.

Once we have identified an area to work in, we can shift to thinking about ways
of tackling problems in that area.
Expand Down Expand Up @@ -226,9 +226,11 @@ helps in several ways.

Success in one area brings opportunities for cross-pollination. We can take
the same solution, an
[algorithm](https://research.google/pubs/pub50370.pdf#page=7) or data
structure, and apply the idea to a related but different area. Without the
original landing, though, we might have never realized this.
[algorithm](https://research.google/pubs/pub50370.pdf#page=7) (pages on huge
pages) or data structure, and apply the idea to a
[related but different area](https://storage.googleapis.com/gweb-research2023-media/pubtools/7777.pdf#page=9)
(objects on pages, or "span" prioritization). Without the original landing,
though, we might have never realized this.

* Circumstances are continually changing. The assumptions that started a
project years ago may be invalid by the time the project is ready.
Expand Down
14 changes: 8 additions & 6 deletions _posts/2024-09-04-fast-79.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Originally posted as Fast TotW #79 on January 19, 2024

*By [Chris Kennelly](mailto:[email protected]) and [Matt Kulukundis](mailto:[email protected])*

Updated 2024-12-10
Updated 2025-06-20

Quicklink: [abseil.io/fast/79](https://abseil.io/fast/79)

Expand Down Expand Up @@ -77,8 +77,9 @@ smoothly.

TIP: Prefer switching defaults to migrating code if you can.

When we introduced hashtable profiling for monitoring tables fleet wide, some
users were surprised that tables could be sampled (triggering additional system
When we introduced
[hashtable profiling for monitoring tables fleet wide](/fast/26), some users
were surprised that tables could be sampled (triggering additional system
calls). If we had tried to have sampled monitoring from the start, the migration
would have had a new class of issues to debug. This also allowed us to have a
[very clear opt-out for this specific feature](/fast/52) without delaying the
Expand All @@ -91,7 +92,8 @@ class of issues at a time.

## Iterative improvement: Deploying TCMalloc's CPU caches

When TCMalloc was first introduced, it used per-thread caches, hence its name,
When [TCMalloc](github.com/google/tcmalloc/blob/master/docs/index.html) was
first introduced, it used per-thread caches, hence its name,
"[Thread-Caching Malloc](https://goog-perftools.sourceforge.net/doc/tcmalloc.html)."
As thread counts continued to increase, per-thread caches suffered from two
growing problems: a per-process cache size was divided over more and more
Expand Down Expand Up @@ -127,8 +129,8 @@ development of
[several optimizations](https://research.google/pubs/characterizing-a-memory-allocator-at-warehouse-scale/).
TCMalloc includes extensive telemetry that enabled us to calculate the amount of
memory being used for per-vCPU caches which provided estimates of the potential
opportunity - to motivate the work - and the final impact - for recognising the
benefit.
opportunity - to motivate the work - and measure the final impact - for
recognising the benefit.

TIP: Tracking metrics that we intend to optimize later, even if not right away,
can help identify when an idea is worth pursuing and prioritizing. By monitoring
Expand Down
2 changes: 1 addition & 1 deletion _posts/2024-09-04-fast-83.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Originally posted as Fast TotW #83 on June 17, 2024

*By [Chris Kennelly](mailto:[email protected])*

Updated 2025-02-18
Updated 2025-08-23

Quicklink: [abseil.io/fast/83](https://abseil.io/fast/83)

Expand Down
Loading