-
Notifications
You must be signed in to change notification settings - Fork 61
Open
Description
Description
Span break on possible superscript requires char bbox to have no overlap with previous span. The logic is too strict for documents where char bboxes have slight overlap.
# Character is likely a superscript
if all([
char["bbox"][1] < (span["bbox"][1] - span["bbox"].height * line_distance_threshold), # char top is above span
char["bbox"][3] < (span["bbox"].height * superscript_height_threshold) + span["bbox"][1], # char bottom is not full line height
char["bbox"][0] > span["bbox"][2], # char is to the right of the spanIf a chars left edge[0] overlaps with the previous spans right edge[2], it fails the third condition.
Reproduction
Test file: https://doi.org/10.1177/1748006X17699145
Current behavior: detects 4 superscript spans.
Expected behavior: Should detect 35 superscript spans that fit existing filtering logic.
Results
Using the superscript chars bbox width as reference:
30 of the missed superscript chars have <5% of overlap with the last char of the previous span. 1 of the missed superscript have around 7% overlap.
Average overlap: 2.33% of character width
Suggestion
Allow overlap in
char["bbox"][0] > span["bbox"][2], # char is to the right of the spanMetadata
Metadata
Assignees
Labels
No labels