Skip to content

Ship fast tries in icu_normalizer_data by default #6836

@hsivonen

Description

@hsivonen

Changing from the small trie type to the fast trie type doubles the throughput for figuring out that already-NFC UTF-16 is indeed already in NFC for Japanese and Chinese. Korean becomes even faster. It seems reasonable to assume that other languages whose bulk of characters is between U+1000 and U+FFFF would get doubled throughput, too.

It seems bad to offer worse performance by default for this part of the BMP.

For nfd, databake claims 27948B vs. 34748B. (Fast is 6.6 KB larger.)

For nfkd, databake claims 43132B vs. 51324B. (Fast is 8.0 KB larger.)

For uts46d, databake claims 56200B vs 69488B. It seems bad to pessimize widely-used languages to save 13 KB in binary size, but people already complain about having to carry any IDNA data as a side effect of depending on url, and it seems unlikely that IDNA processing is a perf bottleneck, so perhaps we should keep defaulting to the small trie type for this one.

For the Harfbuzz supplement, databake claims 4486B vs. 6382B, but in the U+1000 to U+FFFF range that result in a query to this trie, so perhaps it makes sense to keep the small trie type for this one. (But we should check, if there are characters of interest in the relevant range.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-dataArea: Data coverage or qualityA-performanceArea: Performance (CPU, Memory)C-collatorComponent: Collation, normalizationC-data-infraComponent: provider, datagen, fallback, adapters

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions