Skip to content

Superscript detection fails due to strict bbox overlap check #53

@Tenkeboks

Description

@Tenkeboks

Description

Span break on possible superscript requires char bbox to have no overlap with previous span. The logic is too strict for documents where char bboxes have slight overlap.

# Character is likely a superscript
if all([
char["bbox"][1] < (span["bbox"][1] - span["bbox"].height * line_distance_threshold), # char top is above span
char["bbox"][3] < (span["bbox"].height * superscript_height_threshold) + span["bbox"][1], # char bottom is not full line height
char["bbox"][0] > span["bbox"][2], # char is to the right of the span

If a chars left edge[0] overlaps with the previous spans right edge[2], it fails the third condition.

Reproduction

Test file: https://doi.org/10.1177/1748006X17699145
Current behavior: detects 4 superscript spans.
Expected behavior: Should detect 35 superscript spans that fit existing filtering logic.

Results

Using the superscript chars bbox width as reference:
30 of the missed superscript chars have <5% of overlap with the last char of the previous span. 1 of the missed superscript have around 7% overlap.
Average overlap: 2.33% of character width

Suggestion

Allow overlap in

char["bbox"][0] > span["bbox"][2], # char is to the right of the span

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions