-
Notifications
You must be signed in to change notification settings - Fork 125
Configurable minimum word length for tokenization #120
Copy link
Copy link
Open
Labels
area: coreCore functionality affecting all classifiersCore functionality affecting all classifiersenhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomersinternationalizationInternational language supportInternational language supportpriority: mediumMedium priorityMedium priority
Milestone
Metadata
Metadata
Assignees
Labels
area: coreCore functionality affecting all classifiersCore functionality affecting all classifiersenhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomersinternationalizationInternational language supportInternational language supportpriority: mediumMedium priorityMedium priority
Summary
Allow configuring the minimum word length filter (currently hardcoded to 2) in tokenization.
Motivation
From classifier-reborn#176:
The current code filters words with length ≤ 2:
This assumption is problematic for:
Proposed API
Workaround
Currently, users would need to monkey-patch
String#word_hash_for_wordsor use a custom tokenizer (if #114 is implemented).Related