-
Notifications
You must be signed in to change notification settings - Fork 70
Feature/refactor noodle masked load (WIP) #216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Fix word boundary assertions under C++20
Fix all ASAN issues in vectorscan
change FAT_RUNTIME to a normal option so it can be set to off
Closes #99 https://gcc.godbolt.org/z/cTjKqzcvn Previous version was not correct because movemask thought of having bytes 0xFF. We can fully match the semantics + do it faster with USRA instructions. Re-submission to a develop branch
Optimized and correct version of movemask128 for ARM
add Jenkinsfile back to master branch
fix large pipeline error
CMake: Use non-deprecated method for finding python
This optimization is based on the thread https://twitter.com/Danlark1/status/1539344279268691970 and uses shift right and narrow by 4 instruction https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/SHRN--SHRN2--Shift-Right-Narrow--immediate-- To achieve that, I needed to redesign a little movemask into comparemask and have an additional step towards mask iteration. Our benchmarks showed 10-15% improvement on average for long matches.
…ed manner, set sane defaults
SIMDe on Clang needs SIMDE_NO_CHECK_IMMEDIATE_CONSTANT defined and other SIMDe related fixes now that SIMDe is part of the CI pipeline. Some issue with SIMDe on x86 still remains because of an upstream bug: simd-everywhere/simde#1119 Similarly SIMDe native with clang on Arm also poses a non-high priority build failure: https://buildbot-ci.vectorcamp.gr/#/builders/129/builds/11 Possibly a SIMDe issue as well, need to investigate but will merge this PR as these are non-blockers.
use ccache if available
src/hwlm/noodle_engine_simd.hpp
Outdated
| Z_TYPE z, size_t len, const struct cb_info *cbi) { | ||
| Z_TYPE z, size_t len, const struct cb_info *cbi) { | ||
| while (unlikely(z)) { | ||
| Z_TYPE pos = JOIN(findAndClearLSB_, Z_BITS)(&z) >> Z_POSSHIFT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand, this clear a single bit. We handle the case where the mask is wider with Z_POSSHIFT. But I believe in the case of neon, we'd have all the bits being 1, so we'd iterate 4 times in this loop? Or maybe I missed something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's still WIP, I have some local fixes for this that's why it has not been merged yet.
src/hwlm/noodle_engine_simd.hpp
Outdated
| static really_inline | ||
| hwlm_error_t scanSingle(const struct noodTable *n, const u8 *buf, size_t len, | ||
| size_t start, bool noCase, const struct cb_info *cbi) { | ||
| /* if (len < VECTORSIZE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented code. Shouldn't it be removed?
src/hwlm/noodle_engine_simd.hpp
Outdated
| size_t l = d1 - d; | ||
| SuperVector<S> chars = SuperVector<S>::loadu(d) & caseMask; | ||
| typename SuperVector<S>::comparemask_type mask = SINGLE_LOAD_MASK(l * SuperVector<S>::mask_width()); | ||
| typename SuperVector<S>::comparemask_type z = mask & mask1.eqmask(chars); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you forgot the iteration_mask(z); here ? I'm not sure what's its purpose, but it was there in the previous code, and is also there in the double scan path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, this is WIP, there is uncommitted code that I need to fix. iteration_mask is a way to reproduce the movemask functionality on Intel, it performs a different way in each architecture.
src/hwlm/noodle_engine_simd.hpp
Outdated
| size_t l = buf_end - d; | ||
| typename SuperVector<S>::comparemask_type mask = SINGLE_LOAD_MASK(l * SuperVector<S>::mask_width()); | ||
| typename SuperVector<S>::comparemask_type z = mask & mask1.eqmask(chars); | ||
| hwlm_error_t rv = single_zscan(n, d, buf, z, len, cbi); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing the iteration_mask(z); here too?
ypicchi-arm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I noticed is that you often make a change in a commit that would break/is missing something, and you later fix it in another commit. I suppose you plan on squashing/reworking those commits?
| typename SuperVector<S>::comparemask_type z2 = mask2.eqmask(chars); | ||
| typename SuperVector<S>::comparemask_type z = (z1 << SuperVector<S>::mask_width()) & z2; | ||
| DEBUG_PRINTF("z: %0llx\n", z); | ||
| lastz1 = z1 >> (S - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this assume SuperVector<S>::mask_width() == 1 which is not always the case (for arm/neon it's 4)
| typename SuperVector<S>::comparemask_type z2 = mask2.eqmask(chars); | ||
| typename SuperVector<S>::comparemask_type z = (z1 << SuperVector<S>::mask_width() | lastz1) & z2; | ||
| lastz1 = z1 >> (Z_SHIFT * SuperVector<S>::mask_width()); | ||
| lastz1 = z1 >> (S - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same issue with mask_width()
| uint8_t l = d0 + S - d; | ||
| DEBUG_PRINTF("l: %d \n", l); | ||
| SuperVector<S> chars = SuperVector<S>::loadu_maskz(d, l) & caseMask; | ||
| chars.print8("chars"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug print?
src/hwlm/noodle_engine_simd.hpp
Outdated
| typename SuperVector<S>::comparemask_type z1 = mask1.eqmask(chars); | ||
| typename SuperVector<S>::comparemask_type z2 = mask2.eqmask(chars); | ||
| typename SuperVector<S>::comparemask_type z = (z1 << SuperVector<S>::mask_width()) & z2; | ||
| DEBUG_PRINTF("z: %0llx\n", z); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why deleting the debug print here? when you added more of the likes for the scanSingle function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
z has a different type in each architecture, this DEBUG_PRINTF fails to compile on some architectures, so I need to make it work and compile on all architectures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What confuse me is that in the previous function, you added DEBUG_PRINTF("z: %08llx\n", (u64a) z);, so I believe you could have modified this print to work the same way by casting z?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's what I did locally after I realized I could just cast it :)
unfortunately I had some other things to fix before the holidays and this was left unfinished -along with other fixes that I have locally. I will be commiting more fixes over the next days.
| DEBUG_PRINTF("mask = %08llx\n", mask); | ||
| SuperVector v = loadu(ptr); | ||
| (void)mask; | ||
| return v; // FIXME: & mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FIXME
| static SuperVector loadu(void const *ptr); | ||
| static SuperVector load(void const *ptr); | ||
| static SuperVector loadu_maskz(void const *ptr, uint8_t const len); | ||
| static SuperVector loadu_maskz(void const *ptr, typename base_type::comparemask_type const len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you add the implementation for arm later on, but I didn't see any implementation for ppc64 ?
|
You are reviewing code which is not ready to be merged yet.
The reason for that is that I may be working on eg. Arm, fixing something that works on Arm, only to find that it fails in another architecture, or another compiler, or even another configuration flag on the same architecture/compiler. Currently the CI compiles almost 100 different configs and if a PR is to be merged, it has to pass on all of them, otherwise it doesn't get merged. Hence the iterative approach. I have thought of squashing those commits I don't see it as a problem currently, as it's mostly myself working on this project, with the occasional contributors, if there is a git history pollution, I may rethink that. |
|
As a WIP it's ok, yes. It's just that it wasn't marked as a draft so I preferred to take the conservative approach of aiming for release quality. I usually prefer to comment on anything suspiscious, even if it means making false positive, than discarding it. I'm sure you'll fix things like those FIXME, but the safe way is for me to remind you they are here just in case :) |
|
Well, previously I never had to draft my PRs as I was the single person reviewing them :) |
Refactored Noodle engine to be closer to already refactored Shufti/Truffle/etc.
Up to 2x as fast than previous version.
Also implemented masked loads, this will be used as a model for masked loads on AVX512 and SVE2 but also other future vector archictectures that will provide predicated/masked loads (SVP64?).