Optimize UnicodeHash lookup performance using std::unordered_map by antalvdb · Pull Request #30 · LanguageMachines/ticcutils

antalvdb · 2025-11-21T21:32:06Z

Pull Request: Optimize UnicodeHash lookup performance using std::unordered_map

Description

This PR optimizes the Hash::UnicodeHash class by replacing the internal custom Trie implementation (Tries::UniTrie) with a standard std::unordered_map.

Profiling of the dependent timbl application revealed that UnicodeHash::lookup and UnicodeHash::hash were significant performance bottlenecks, consuming over 50% of the CPU time during the learning phase on large datasets. This was due to the linear time complexity O(L) of the Trie structure for string lookups. switching to a hash map provides O(1) average time complexity.

Changes

include/ticcutils/UniHash.h:
- Replaced Tries::UniTrie<UniInfo> _tree with std::unordered_map<icu::UnicodeString, UniInfo*, UnicodeStringHash> _map.
- Added a custom UnicodeStringHash struct to support icu::UnicodeString keys.
src/UniHash.cxx:
- Refactored hash() and lookup() to use std::unordered_map API (find, insert).
- Added a conditional check Normalizer2::isNormalized to avoid expensive normalization if the input string is already in NFC format.
- Updated the destructor ~UnicodeHash() to explicitly delete UniInfo pointers stored in the map to prevent memory leaks.

Performance Analysis

Benchmarks were run using timbl on the edufineweb_train_000001-100k dataset (~4.9M lines).

Metric	Original (UniTrie)	Optimized (unordered_map)	Improvement
Total Learning Time	~274s	~125s	2.2x Speedup
Lookup + Hash Time	~145.3s	~3.4s	~42x Speedup

Profiling Details (Top Functions)

Before (Original):
```text
% cumulative self name
38.44 101.33 101.33 Hash::UnicodeHash::lookup
16.66 145.26 43.93 Hash::UnicodeHash::hash
```

After (Optimized):
```text
% cumulative self name
1.92 96.51 2.19 Hash::UnicodeHash::lookup
1.05 101.13 1.20 Hash::UnicodeHash::hash
```

The bottleneck has been effectively eliminated, shifting the primary processing time to the core algorithm logic (ClassDistribution::IncFreq).
ticcutils_optimization_pr.zip

modernizing, expanded tests, cleanup cleaning up XmlTools. update FileUtils to use filesystem::remove updating enum_types handling. reworking LogStream C++ code quality (C++17)

Date: Sat Nov 22 16:07:08 2025 +0100 updated as suggested by #30 entering 2025 updated nlohmann json.hpp

kosloot added 6 commits December 5, 2024 08:40

Squashed commit of the following:

2b0256e

modernizing, expanded tests, cleanup cleaning up XmlTools. update FileUtils to use filesystem::remove updating enum_types handling. reworking LogStream C++ code quality (C++17)

update NEWS

345b693

added a newline

6a62429

fix CI

c7058db

stupid typo

36f7964

update CI

d88ba65

kosloot self-assigned this Nov 22, 2025

kosloot added a commit that referenced this pull request Nov 22, 2025

updated as suggested by #30

f27ebe4

kosloot added 3 commits November 22, 2025 16:38

Merge branch 'develop' into master

99353c6

Squashed commit of the following:

4a0d8c0

Date: Sat Nov 22 16:07:08 2025 +0100 updated as suggested by #30 entering 2025 updated nlohmann json.hpp

Merge branch 'master' of github.com:LanguageMachines/ticcutils

244ee10

kosloot merged commit 244ee10 into develop Nov 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize UnicodeHash lookup performance using std::unordered_map#30

Optimize UnicodeHash lookup performance using std::unordered_map#30
kosloot merged 9 commits intodevelopfrom
master

antalvdb commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

antalvdb commented Nov 21, 2025

Pull Request: Optimize UnicodeHash lookup performance using std::unordered_map

Description

Changes

Performance Analysis

Profiling Details (Top Functions)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants