Improve compression performance by conditionally skipping lazy evaluation #4533

BiplabRaut · 2025-11-26T11:36:58Z

Optional ZSTD_COMPRESS_FAST flag provided to speed up compression by restricting lazy evaluation to short matches.
Applied to levels 6 and 7. This helps improve lazy compression strategy. Optimization disabled by default.
This code change improves compression performance with little to no tradeoff for compression ratio and decompression performance. It is useful for applications that have high compression speed requirements.
Skip determinism test when this optimization option is enabled. Since the compression ratio varies, so it can't satisfy the determinism tests.
Performance Highlights :-
Benchmark result (single-threaded execution) graphs are shared below for the datasets: Silesia, Calgary, Canterbury and FreeBSD.
Other datasets like geekBench and Enwik showed similar trends.
Test Setup: AMD Genoa (Zen4) CPU, SMT OFF, GCC 14.2.1

…tion Optional ZSTD_COMPRESS_FAST flag provided to speed up compression by restricting lazy evaluation to short matches. Applied to levels 6 and 7. This helps improve lazy compression strategy. Skip determinism test when enabled. Optimization disabled by default.

meta-cla · 2025-11-26T11:37:07Z

Hi @BiplabRaut!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Cyan4973

The core idea seems sound: do not search for a better match when you already have a "good enough" one.
This will save some work, and consequently reduce the amount of cpu spent in search.
It will also reduce opportunities to find better matches, and therefore negatively impact compression ratio.

So it's not a clear win situation, and more a trade off.
Therefore, what matters is the relative sizes of these trade off: how much is won, how much is lost.

And when looking at the numbers provided here, it does not look great.
The announced compression speed gains are significant, in the double digit range.
But we also take reduction in compression ratio pretty seriously. Something < 0.1% could be considered roughly negligible. But anything closer to ~1% is significant. And there are a few cases where compression ratio is negatively impacted by a significant amount.

I guess we could have a discussion here, about the pros and cons of the trade off, and maybe find another compromise point.

The second part though is about code complexity.
The core of the concept is not difficult, it adds a branch within the search loop, in ZSTD_compressBlock_lazy_generic().

But unfortunately, the current implementation also adds a ton of other changes in other parts of the source code.
This is more difficult to maintain.
And considering the impact of the change, which merely offers a different position in the trade-off curve, I feel this is not a reasonable proposition. We need to keep things simpler if we want the source code to remain maintainable in the future.

As it stands, this PR is only suitable for a dedicated fork, where an independant maintenance policy can be set.

Cyan4973 · 2025-12-01T19:42:46Z

lib/compress/clevels.h

    { 21, 16, 17,  1,  5,  0, ZSTD_dfast   },  /* level  3 */
    { 21, 18, 18,  1,  5,  0, ZSTD_dfast   },  /* level  4 */
    { 21, 18, 19,  3,  5,  2, ZSTD_greedy  },  /* level  5 */
+#ifdef ZSTD_ENABLE_COMPRESS_FAST


This is too disruptive, and will make maintenance of these tables more complex for future patches.

Cyan4973 · 2025-12-01T19:44:00Z

lib/README.md

  For this scenario, it can be set as `ZDICT_QSORT=ZDICT_QSORT_C90`.
  Other selectable suffixes are `_GNU`, `_APPLE`, `_MSVC` and `_C11`.

+- The build macro `ZSTD_COMPRESS_FAST` can be defined for fast lazy evaluation


The macro name is not specific enough. The name implies that this is about the "fast mode", while the changes are actually impacting the ZSTD_lazy strategy. Consider a more accurate name, or better, if we can find just a better trade off, maybe there would be no need for another build macro.

I also see that there is a confusion between a C macro and a Makefile macro.
They are not the same.

Sure, we will address this

Cyan4973 · 2025-12-01T19:44:49Z

lib/compress/zstd_compress.c

    /* params are supposed to be fully validated at this point */
    assert(!ZSTD_isError(ZSTD_checkCParams(params->cParams)));
+#ifdef ZSTD_ENABLE_COMPRESS_FAST
+    compressionLevel = (params->compressionLevel == 0) ? cctx->requestedParams.compressionLevel 


create a block which both starts and end compressionLevel lifetime.

Cyan4973 · 2025-12-01T19:46:22Z

lib/compress/zstd_compress_internal.h

+#ifdef ZSTD_ENABLE_COMPRESS_FAST
+/* Lazy evaluation is performed only if the first match length 
+ * is <= ZSTD_compressFastLazyLimit */
+#define ZSTD_COMPRESS_FAST_BASE_LAZY_LIMIT 5


That seems way too low.

In fact, in many cases, the search is targeting 5-byte matches, so all matches would trigger this condition, effectively changing _lazy into _greedy.

Cyan4973 · 2025-12-01T19:47:03Z

lib/compress/zstd_compress_internal.h

    size_t maxNbSeq;
    size_t maxNbLit;
+#ifdef ZSTD_ENABLE_COMPRESS_FAST
+    size_t lazyLimit;      /* Match length limit to allow lazy evaluation */


This is unwelcome, I would prefer we avoid adding a conditional variable here.

Cyan4973 · 2025-12-01T19:47:25Z

lib/compress/zstd_lazy.c

                                     prefixLowestIndex - (U32)(dictEnd - dictBase) :
                                     0;
    const U32 dictAndPrefixLength = (U32)((ip - prefixLowest) + (dictEnd - dictLowest));
+#ifdef ZSTD_ENABLE_COMPRESS_FAST


too many such #ifdef blocks across the code base

Cyan4973 · 2025-12-01T19:52:59Z

tests/cli-tests/common/platform.sh

 fi

-if zstd -vv --version | grep -q 'non-deterministic'; then
+if zstd -vv --version | grep -q 'non-deterministic' || [ "$ZSTD_COMPRESS_FAST" = "1" ]; then


Removing determinism is pretty wild, though I guess it might be acceptable if this setting must be explicitly opt-in.

BiplabRaut · 2025-12-04T12:24:25Z

The core idea seems sound: do not search for a better match when you already have a "good enough" one. This will save some work, and consequently reduce the amount of cpu spent in search. It will also reduce opportunities to find better matches, and therefore negatively impact compression ratio.

So it's not a clear win situation, and more a trade off. Therefore, what matters is the relative sizes of these trade off: how much is won, how much is lost.

And when looking at the numbers provided here, it does not look great. The announced compression speed gains are significant, in the double digit range. But we also take reduction in compression ratio pretty seriously. Something < 0.1% could be considered roughly negligible. But anything closer to ~1% is significant. And there are a few cases where compression ratio is negatively impacted by a significant amount.

I guess we could have a discussion here, about the pros and cons of the trade off, and maybe find another compromise point.

The second part though is about code complexity. The core of the concept is not difficult, it adds a branch within the search loop, in ZSTD_compressBlock_lazy_generic().

But unfortunately, the current implementation also adds a ton of other changes in other parts of the source code. This is more difficult to maintain. And considering the impact of the change, which merely offers a different position in the trade-off curve, I feel this is not a reasonable proposition. We need to keep things simpler if we want the source code to remain maintainable in the future.

As it stands, this PR is only suitable for a dedicated fork, where an independent maintenance policy can be set.

Thank you Yann for your detailed review comments.

Broadly, there are two main points:-

compression ratio tradeoffs :
The compression ratio compromise is intentional and has been guided by a couple of our customer requests who are ok to such an aggressive tradeoff. For the same reason, this code can not be in default path so we introduced a separate macro.
maintainable simpler code changes
I will get back on this on what we can do to address your points

Thanks again.

Cyan4973 self-assigned this Dec 1, 2025

Cyan4973 requested changes Dec 1, 2025

View reviewed changes

Improve compression performance by conditionally skipping lazy evaluation #4533

Are you sure you want to change the base?

Improve compression performance by conditionally skipping lazy evaluation #4533

Uh oh!

Conversation

BiplabRaut commented Nov 26, 2025

Uh oh!

meta-cla bot commented Nov 26, 2025

Action Required

Process

Uh oh!

Cyan4973 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BiplabRaut commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants