Skip to content

Conversation

@diegodlh
Copy link

When running unicharset_extractor on the Spanish langdata, it warns that capital "Ñ", capital "É" and "«" are absent from the training text (while their counterparts, "ñ", "é" and "»", are present). This makes Tesseract then fail to recognize this characters with --oem 0 (for example, it recognizes "Ñ" as "NN", and "É" as "EI").
I'm a beginner in the subject of Tesseract training and I'm not sure how these training_text files are generated. It seems to me they are more or less a random set of words and short phrases. It occurred to me I could simply make some replacements to cover these missing characters: España -> ESPAÑA, años -> AÑOS, también -> TAMBIÉN, México -> MÉXICO, and also replaced half occurrences of "»" with "«".
If my assumption that this file is mostly random, please consider pulling this commit into master. Thank you

@Shreeshrii
Copy link
Contributor

Thank you. This training text file is suitable for tesseract 3.0x (base tesseract). For 4.0 and lstm training please see the langdata_lstm repo.

@diegodlh
Copy link
Author

Effectively, I retried tesstrain.sh with langdata_lstm and the training_text file is so long that this time unicharset_extractor did not complain about missing characters. Still, as users may still be using langdata to train their tesseract 3.0x engine (or tesseract 4.0 with --oem 0, as I understand it), I deem it useful to merge my commit into plain langdata's master branch. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants