outline_detection detects where one Tibetan text ends and another begins, using pattern rules (yig mgo ༄༅, section mark ༈, closing phrases) and an optional CRF sequence labeler. Built from analysis of tens of thousands of annotated boundary snippets.
pip install -e . # core (rule-based detection + evaluation)
pip install -e ".[crf]" # also install CRF extras (scikit-learn, sklearn-crfsuite)Requires Python 3.9+. (pip install -r requirements.txt does the editable [crf] install.)
from outline_detection import detect_breakpoints
text = "...རྫོགས་སོ།། ༄༅། །next text..."
detect_breakpoints(text)
# {"breakpoints": [0, 152, 410, ...]}detect_breakpoints returns a dict with key breakpoints whose value is the list of boundary start indices (character offsets) found by the rule-based detector.
Options:
detect_breakpoints(text, profile="precision") # recall | balanced | precision
detect_breakpoints(text, min_confidence=0.5) # override threshold
detect_breakpoints(text, detailed=True) # adds per-boundary confidence + ruleFor OCR output that arrives as a sequence of pages, the optional page-layout rules (I–L) use line density instead of orthographic signals. They are off by default and inert on continuous (single-page) text:
detect_breakpoints(
ocr_text,
rule_i_empty_page=True, # an empty page marks a break
rule_j_sparse_tail=True, # dense page then two sparse pages
line_threshold=4, # T: "few lines" cutoff
page_delimiter="\f", # form feed (default), "blank"/"blankN", or regex
)See docs/rules.md for the full rule set.
The install provides an outline-detect command.
Detect boundaries (text in -> JSON out):
outline-detect detect mytext.txt
# {"breakpoints": [0, 152, 410]}
echo "..." | outline-detect detect - # read from stdin
outline-detect detect --text "རྫོགས་སོ།། ༄༅། །next" --pretty
outline-detect detect mytext.txt -o result.jsonEvaluate against annotated data:
outline-detect evaluate data/breakpoints_context_snippets_unique.json --profile balanced --tolerance 15
outline-detect evaluate data/breakpoints_context_snippets_unique.json --all-profiles --tolerance 15Analyze boundary patterns:
outline-detect analyze data/breakpoints_context_snippets_unique.jsonAnnotate a raw file with boundary markers:
outline-detect predict data/samples/INPUT.txt --profile balancedCRF (requires the [crf] extra):
# Full-corpus train with feature cache and post-train eval
outline-detect crf train data/breakpoints_context_snippets.json \
--save-model --features-cache reports/models/crf_features.pkl \
--eval-file data/breakpoints_context_snippets_unique.json
# Evaluate a saved model
outline-detect crf evaluate data/breakpoints_context_snippets_unique.json \
--model reports/models/boundary_crf.pkl --tolerance 15
outline-detect crf predict data/samples/INPUT.txt --model reports/models/boundary_crf.pkl| File | Size |
|---|---|
data/breakpoints_context_snippets.json |
82,560 annotated snippets |
data/breakpoints_context_snippets_unique.json |
31,591 deduplicated snippets |
data/samples/ |
Optional raw .txt files for prediction demos |
Boundaries in annotated JSON are marked with </b> (or <b>).
Hugging Face Hub:
| Resource | Repo |
|---|---|
| Full snippets (82,560) | ganga4364/tibetan-outline-boundary-snippets-full |
| Unique benchmark (31,591) | ganga4364/tibetan-outline-boundary-snippets-unique |
| CRF full (production) | ganga4364/tibetan-outline-boundary-crf-full |
| CRF unbiased (honest eval) | ganga4364/tibetan-outline-boundary-crf-unbiased |
hf download ganga4364/tibetan-outline-boundary-snippets-unique --repo-type dataset
hf download ganga4364/tibetan-outline-boundary-crf-unbiased boundary_crf.pkl --local-dir ./reports/modelsevaluate, analyze, predict, and crf write under ./reports/ (relative to where you run the command; gitignored except .gitkeep):
| Directory | Contents |
|---|---|
reports/evaluations/ |
rule_based_evaluation_*.md |
reports/analysis/ |
boundary_report_*.md / .json |
reports/diagnostics/ |
false_negatives.json |
reports/models/ |
CRF .pkl models |
reports/ |
predicted_boundaries.txt, crf_predicted.txt |
| Method | F1 |
|---|---|
| Rule-based (balanced) | 0.601 |
| CRF full | 0.571 |
| CRF unbiased | 0.555 |
Rule-based balanced reaches ~63% precision and ~57.5% recall. Primary active rules: A (yig mgo) and G (༈). See docs/evaluation.md for full comparison and regeneration commands.
- docs/terminology.md — Tibetan signals, markup, metrics
- docs/rules.md — Rules A–H (orthographic) and I–L (page layout)
- docs/workflow.md — Full step-by-step workflow
- docs/huggingface.md — Hub datasets and models
- docs/evaluation.md — Benchmark results
- docs/README.md — Doc index
- CHANGELOG.md — Release notes
├── pyproject.toml
├── requirements.txt
├── src/
│ └── outline_detection/ # api, cli, detector, evaluation, analyzer, crf, utils, paths
├── data/ # Annotated JSON corpora and samples/
├── docs/ # Static reference
├── scripts/ # Training, comparison, and Hub upload helpers
└── reports/ # Generated outputs (gitignored)