-
Notifications
You must be signed in to change notification settings - Fork 125
Add custom tokenizer support #118
Copy link
Copy link
Open
Labels
area: coreCore functionality affecting all classifiersCore functionality affecting all classifiersenhancementNew feature or requestNew feature or requestinternationalizationInternational language supportInternational language supportpriority: highHigh priorityHigh priority
Milestone
Metadata
Metadata
Assignees
Labels
area: coreCore functionality affecting all classifiersCore functionality affecting all classifiersenhancementNew feature or requestNew feature or requestinternationalizationInternational language supportInternational language supportpriority: highHigh priorityHigh priority
Summary
Allow users to provide a custom tokenizer for text processing instead of using the built-in
String#word_hashmethod.Motivation
From classifier-reborn#131:
The current tokenization is hardcoded:
This doesn't work well for:
Proposed API
Affected Classes
Classifier::BayesClassifier::LSIClassifier::TFIDFClassifier::LogisticRegressionRelated