Feature/refactor noodle masked load (WIP) #216

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

markos wants to merge 616 commits into develop from feature/refactor-noodle-masked-load

markos commented Dec 21, 2023

Refactored Noodle engine to be closer to already refactored Shufti/Truffle/etc.
Up to 2x as fast than previous version.
Also implemented masked loads, this will be used as a model for masked loads on AVX512 and SVE2 but also other future vector archictectures that will provide predicated/masked loads (SVP64?).

BigRedEye and others added 30 commits

February 8, 2022 00:22


          fix: Mark operator bool explicit

6d6c291


          Merge pull request #90 from BigRedEye/vectorscan-master

2819dc3

Fix word boundary assertions under C++20


          Fix all ASAN issues in vectorscan

9af996b


          Add sanitize options

b3e88e4


          Fix a couple of tests

5f8729a


          change FAT_RUNTIME to a normal option so it can be set to off

d626381

fixes #89


          move to original position

b34aacd


          Merge pull request #93 from danlark1/master

5fa22e6

Fix all ASAN issues in vectorscan


          Merge pull request #94 from a16bitsysop/fat_runtime

edea9d1

change FAT_RUNTIME to a normal option so it can be set to off


          Optimized and correct version of movemask128 for ARM

288491d

Closes #99

https://gcc.godbolt.org/z/cTjKqzcvn

Previous version was not correct because movemask thought of having bytes 0xFF. We can fully match the semantics + do it faster with USRA instructions.

Re-submission to a develop branch


          Merge pull request #102 from danlark1/patch-2

bd91134

Optimized and correct version of movemask128 for ARM


          add Jenkinsfile back to master branch

76b2b4b


          add Jenkinsfile back to master branch

f441213


          Merge pull request #104 from VectorCamp/bugfix/jenkinsfile

630f7b2

add Jenkinsfile back to master branch


          Delete JenkinsFile

fce10b5


          fix large pipeline error

b3d7174


          Merge pull request #105 from VectorCamp/bugfix/jenkins

e71fb5c

fix large pipeline error


          Update Jenkinsfile

2c78b77


          Update Jenkinsfile

59ffac5


          Update Jenkinsfile

6c24e61


          Merge pull request #103 from VectorCamp/develop

8739a6c

Develop


          Update CMakeLists.txt

fc5059a


          Use non-deprecated method of finding python

0a35a46


          Bump scripts to python3

85a77e3


          Merge pull request #108 from jth/cmake-python

73695e4

CMake: Use non-deprecated method for finding python


          Optimize vectorscan for aarch64 by using shrn instruction

49eb18e

This optimization is based on the thread
https://twitter.com/Danlark1/status/1539344279268691970 and uses
shift right and narrow by 4 instruction https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/SHRN--SHRN2--Shift-Right-Narrow--immediate--

To achieve that, I needed to redesign a little movemask into comparemask
and have an additional step towards mask iteration. Our benchmarks
showed 10-15% improvement on average for long matches.


          Fix formatting of a couple files

8a49e20


          Minor fix


          Fix ppc64el debug

7e7f604


          Fix avx512 movemask call

db52ce6

markos and others added 21 commits

December 20, 2023 00:12


          add missing pdep64 for x86 bitutils

49e6fe1


          add fallback pdep64 for x86 if no HAVE_BMI2

1b915cf


          fix arch=native on arm+clang

2aa5e1c


          fix submodule headers detection

44f19c1


          reorganize OS detection

a7a1284


          GREATER_EQUAL

306e861


          native CPU on SIMDe will enable all sorts of features in an unpredict…

ef37e60

…ed manner, set sane defaults


          fix typo in baseline x86 arch definition

10d9574


          Merge pull request #212 from VectorCamp/bugfix/fix-simde-build

3113d1c

SIMDe on Clang needs SIMDE_NO_CHECK_IMMEDIATE_CONSTANT defined and other SIMDe related fixes now that SIMDe is part of the CI pipeline.

Some issue with SIMDe on x86 still remains because of an upstream bug:

simd-everywhere/simde#1119

Similarly SIMDe native with clang on Arm also poses a non-high priority build failure:

https://buildbot-ci.vectorcamp.gr/#/builders/129/builds/11

Possibly a SIMDe issue as well, need to investigate but will merge this PR as these are non-blockers.


          use ccache if available

ad70693


          Merge pull request #215 from VectorCamp/feature/use-ccache

17fb9f4

use ccache if available


          refactor Noodle to use the same loop as Shufti/Truffle, now it's at l…

d4fde85

…east 2x as fast


          define HAVE_MASKED_LOADS for AVX512

9f66822


          fix loadu_maskz, remove old defines

476cefb


          fix types of z in debug prints

5f65b9f


          refactor Noodle Single/Double to use masked loads

0e2f6c1


          remove unneeded shifts

5814d32


          comparemask_type is u64a on Arm, use single load_mask

db3b0e9


          fix debug formats for z on arm

f866b72


          fix debug prints for z on ppc64le

de66c74


          add missing findLSB for ppc64le

9a53b19

ypicchi-arm reviewed

View reviewed changes

src/hwlm/noodle_engine_simd.hpp Outdated

    
              		Z_TYPE z, size_t len, const struct cb_info *cbi) {

                                        Z_TYPE z, size_t len, const struct cb_info *cbi) {

                  while (unlikely(z)) {

                      Z_TYPE pos = JOIN(findAndClearLSB_, Z_BITS)(&z) >> Z_POSSHIFT;

ypicchi-arm Jan 4, 2024

As I understand, this clear a single bit. We handle the case where the mask is wider with Z_POSSHIFT. But I believe in the case of neon, we'd have all the bits being 1, so we'd iterate 4 times in this loop? Or maybe I missed something else?

Author

markos Jan 4, 2024

it's still WIP, I have some local fixes for this that's why it has not been merged yet.

src/hwlm/noodle_engine_simd.hpp Outdated

    
              static really_inline

              hwlm_error_t scanSingle(const struct noodTable *n, const u8 *buf, size_t len,

                                      size_t start, bool noCase, const struct cb_info *cbi) {

              /*    if (len < VECTORSIZE) {

ypicchi-arm Jan 4, 2024

Commented code. Shouldn't it be removed?

src/hwlm/noodle_engine_simd.hpp Outdated

    
                          size_t l = d1 - d;

                          SuperVector<S> chars = SuperVector<S>::loadu(d) & caseMask;

                          typename SuperVector<S>::comparemask_type mask = SINGLE_LOAD_MASK(l * SuperVector<S>::mask_width());

                          typename SuperVector<S>::comparemask_type z = mask & mask1.eqmask(chars);

ypicchi-arm Jan 4, 2024

I think you forgot the iteration_mask(z); here ? I'm not sure what's its purpose, but it was there in the previous code, and is also there in the double scan path.

Author

markos Jan 4, 2024

again, this is WIP, there is uncommitted code that I need to fix. iteration_mask is a way to reproduce the movemask functionality on Intel, it performs a different way in each architecture.

src/hwlm/noodle_engine_simd.hpp Outdated

    
                      size_t l = buf_end - d;

                      typename SuperVector<S>::comparemask_type mask = SINGLE_LOAD_MASK(l * SuperVector<S>::mask_width());

                      typename SuperVector<S>::comparemask_type z = mask & mask1.eqmask(chars);

                      hwlm_error_t rv = single_zscan(n, d, buf, z, len, cbi);

ypicchi-arm Jan 4, 2024

Missing the iteration_mask(z); here too?

ypicchi-arm reviewed

View reviewed changes

ypicchi-arm left a comment

One thing I noticed is that you often make a change in a commit that would break/is missing something, and you later fix it in another commit. I suppose you plan on squashing/reworking those commits?

src/hwlm/noodle_engine_simd.hpp Show resolved Hide resolved

src/util/supervector/arch/x86/impl.cpp Show resolved Hide resolved

src/util/supervector/arch/x86/impl.cpp Show resolved Hide resolved

src/hwlm/noodle_engine_simd.hpp

    
                          typename SuperVector<S>::comparemask_type z2 = mask2.eqmask(chars);

                          typename SuperVector<S>::comparemask_type z = (z1 << SuperVector<S>::mask_width()) & z2;

                          DEBUG_PRINTF("z: %0llx\n", z);

                          lastz1 = z1 >> (S - 1);

ypicchi-arm Jan 9, 2024

I think this assume SuperVector<S>::mask_width() == 1 which is not always the case (for arm/neon it's 4)

src/hwlm/noodle_engine_simd.hpp Show resolved Hide resolved

src/hwlm/noodle_engine_simd.hpp

    
                          typename SuperVector<S>::comparemask_type z2 = mask2.eqmask(chars);

                          typename SuperVector<S>::comparemask_type z = (z1 << SuperVector<S>::mask_width() | lastz1) & z2;

                          lastz1 = z1 >> (Z_SHIFT * SuperVector<S>::mask_width());

                          lastz1 = z1 >> (S - 1);

ypicchi-arm Jan 9, 2024

same issue with mask_width()

src/hwlm/noodle_engine_simd.hpp

    
                          uint8_t l = d0 + S - d;

                          DEBUG_PRINTF("l: %d \n", l);

                          SuperVector<S> chars = SuperVector<S>::loadu_maskz(d, l) & caseMask;

                          chars.print8("chars");

ypicchi-arm Jan 9, 2024

debug print?

src/hwlm/noodle_engine_simd.hpp Outdated

    
                          typename SuperVector<S>::comparemask_type z1 = mask1.eqmask(chars);

                          typename SuperVector<S>::comparemask_type z2 = mask2.eqmask(chars);

                          typename SuperVector<S>::comparemask_type z = (z1 << SuperVector<S>::mask_width()) & z2;

                          DEBUG_PRINTF("z: %0llx\n", z);

ypicchi-arm Jan 9, 2024

Why deleting the debug print here? when you added more of the likes for the scanSingle function

Author

markos Jan 9, 2024

z has a different type in each architecture, this DEBUG_PRINTF fails to compile on some architectures, so I need to make it work and compile on all architectures.

ypicchi-arm Jan 9, 2024

What confuse me is that in the previous function, you added DEBUG_PRINTF("z: %08llx\n", (u64a) z);, so I believe you could have modified this print to work the same way by casting z?

Author

markos Jan 9, 2024

yes, that's what I did locally after I realized I could just cast it :)
unfortunately I had some other things to fix before the holidays and this was left unfinished -along with other fixes that I have locally. I will be commiting more fixes over the next days.

src/util/supervector/arch/arm/impl.cpp

    
                  DEBUG_PRINTF("mask = %08llx\n", mask);

                  SuperVector v = loadu(ptr);

                  (void)mask;

                  return v; // FIXME: & mask

ypicchi-arm Jan 9, 2024

FIXME

src/util/supervector/supervector.hpp

    
                static SuperVector loadu(void const *ptr);

                static SuperVector load(void const *ptr);

                static SuperVector loadu_maskz(void const *ptr, uint8_t const len);

                static SuperVector loadu_maskz(void const *ptr, typename base_type::comparemask_type const len);

ypicchi-arm Jan 9, 2024

I see you add the implementation for arm later on, but I didn't see any implementation for ppc64 ?

markos changed the title ~~Feature/refactor noodle masked load~~ Feature/refactor noodle masked load (WIP)

Author

markos commented Jan 9, 2024

You are reviewing code which is not ready to be merged yet.

One thing I noticed is that you often make a change in a commit that would break/is missing something, and you later fix it in another commit. I suppose you plan on squashing/reworking those commits?

The reason for that is that I may be working on eg. Arm, fixing something that works on Arm, only to find that it fails in another architecture, or another compiler, or even another configuration flag on the same architecture/compiler. Currently the CI compiles almost 100 different configs and if a PR is to be merged, it has to pass on all of them, otherwise it doesn't get merged. Hence the iterative approach. I have thought of squashing those commits I don't see it as a problem currently, as it's mostly myself working on this project, with the occasional contributors, if there is a git history pollution, I may rethink that.

ypicchi-arm commented Jan 9, 2024

As a WIP it's ok, yes. It's just that it wasn't marked as a draft so I preferred to take the conservative approach of aiming for release quality. I usually prefer to comment on anything suspiscious, even if it means making false positive, than discarding it. I'm sure you'll fix things like those FIXME, but the safe way is for me to remind you they are here just in case :)

markos marked this pull request as draft

January 9, 2024 17:21

Author

markos commented Jan 9, 2024

Well, previously I never had to draft my PRs as I was the single person reviewing them :)

markos added this to the 5.4.12 milestone

markos force-pushed the develop branch from 3a70ed4 to eaa8f91 Compare

October 29, 2025 22:06

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet