Skip to content

Commit 115e54f

Browse files
author
Gal Ben David
committed
- added support in Python 3.9
- unigrams and bigrams files are now compressed and pickled to load faster and to save some space
1 parent 72bd8a9 commit 115e54f

File tree

10 files changed

+38
-619610
lines changed

10 files changed

+38
-619610
lines changed

.github/workflows/build.yml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ jobs:
77
runs-on: ubuntu-latest
88
steps:
99
- name: Checkout
10-
uses: actions/checkout@v1
10+
uses: actions/checkout@v2
1111
- name: Install latest rust
1212
uses: actions-rs/toolchain@v1
1313
with:
@@ -25,14 +25,14 @@ jobs:
2525
strategy:
2626
fail-fast: false
2727
matrix:
28-
python-version: [3.6, 3.7, 3.8]
28+
python-version: [3.6, 3.7, 3.8, 3.9]
2929
os: [ubuntu-latest , macos-latest, windows-latest]
3030

3131
steps:
3232
- name: Checkout
33-
uses: actions/checkout@v1
33+
uses: actions/checkout@v2
3434
- name: Set up Python ${{ matrix.python-version }}
35-
uses: actions/setup-python@v1
35+
uses: actions/setup-python@v2
3636
with:
3737
python-version: ${{ matrix.python-version }}
3838
- name: Run image
@@ -46,5 +46,5 @@ jobs:
4646
run: poetry install
4747
- name: Build Python package
4848
run: poetry run maturin develop
49-
- name: pytest
49+
- name: Test
5050
run: poetry run pytest tests

Cargo.toml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "pywordsegment"
3-
version = "0.1.1"
3+
version = "0.1.2"
44
authors = ["Gal Ben David <[email protected]>"]
55
edition = "2018"
66
description = "Concatenated-word segmentation Python library written in Rust"
@@ -14,9 +14,14 @@ keywords = ["word", "segment", "rust", "pyo3"]
1414
requires-python = ">=3.6"
1515
classifier = [
1616
"License :: OSI Approved :: MIT License",
17+
"Operating System :: MacOS",
18+
"Operating System :: Microsoft",
19+
"Operating System :: POSIX :: Linux",
1720
"Programming Language :: Python :: 3.6",
1821
"Programming Language :: Python :: 3.7",
1922
"Programming Language :: Python :: 3.8",
23+
"Programming Language :: Python :: 3.9",
24+
"Programming Language :: Rust",
2025
]
2126

2227
[lib]

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99

1010

1111
![license](https://img.shields.io/badge/MIT-License-blue)
12-
![Python](https://img.shields.io/badge/Python-3.6%20%7C%203.7%20%7C%203.8-blue)
12+
![Python](https://img.shields.io/badge/Python-3.6%20%7C%203.7%20%7C%203.8%20%7C%203.9-blue)
1313
![OS](https://img.shields.io/badge/OS-Mac%20%7C%20Linux%20%7C%20Windows-blue)
1414
![Build](https://github.com/intsights/pywordsegment/workflows/Build/badge.svg)
1515
[![PyPi](https://img.shields.io/pypi/v/pywordsegment.svg)](https://pypi.org/project/pywordsegment/)

pyproject.toml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ strip = true
1515

1616
[tool.poetry]
1717
name = "pywordsegment"
18-
version = "0.1.1"
18+
version = "0.1.2"
1919
authors = ["Gal Ben David <[email protected]>"]
2020
description = "Concatenated-word segmentation Python library written in Rust"
2121
readme = "README.md"
@@ -36,7 +36,8 @@ classifiers = [
3636
"Programming Language :: Python :: 3.6",
3737
"Programming Language :: Python :: 3.7",
3838
"Programming Language :: Python :: 3.8",
39-
"Programming Language :: Rust"
39+
"Programming Language :: Python :: 3.9",
40+
"Programming Language :: Rust",
4041
]
4142
packages = [
4243
{ include = "pywordsegment" },

pywordsegment/__init__.py

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
1+
import gzip
12
import pathlib
3+
import pickle
24
import typing
35

46
from . import pywordsegment
@@ -11,9 +13,25 @@ def __init__(
1113
self,
1214
) -> None:
1315
if WordSegmenter.word_segmenter is None:
16+
current_file_dir = pathlib.Path(__file__).parent.absolute()
17+
18+
unigrams_file = current_file_dir.joinpath('unigrams.pkl.gz')
19+
unigrams = pickle.load(
20+
file=gzip.GzipFile(
21+
filename=str(unigrams_file),
22+
),
23+
)
24+
25+
bigrams_file = current_file_dir.joinpath('bigrams.pkl.gz')
26+
bigrams = pickle.load(
27+
file=gzip.GzipFile(
28+
filename=str(bigrams_file),
29+
),
30+
)
31+
1432
WordSegmenter.word_segmenter = pywordsegment.WordSegmenter(
15-
unigrams_file_path=str(pathlib.Path(__file__).parent.absolute().joinpath('unigrams.txt')),
16-
bigrams_file_path=str(pathlib.Path(__file__).parent.absolute().joinpath('bigrams.txt')),
33+
unigrams=unigrams,
34+
bigrams=bigrams,
1735
total_words_frequency=1024908267229.0,
1836
)
1937

pywordsegment/bigrams.pkl.gz

2.1 MB
Binary file not shown.

0 commit comments

Comments
 (0)