Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,16 @@ exclude = ["tests/*"]
[features]
default = ["onig", "hf-hub"]
hf-hub = ["dep:hf-hub"]
local-only = []
Comment on lines 17 to +19
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local-only is currently a marker feature with no code-level effect: the remote-download path is gated only by feature = "hf-hub" (see src/model.rs:380+). As a result, enabling local-only alongside hf-hub still allows remote model downloads, contradicting the README wording. Consider enforcing the restriction in code (e.g., disable/compile-error the remote branch when local-only is enabled) and adding a test that verifies remote from_pretrained fails under local-only even if hf-hub is on.

Copilot uses AI. Check for mistakes.
onig = ["tokenizers/onig",
"tokenizers/progressbar",
"tokenizers/esaxx_fast"]

fancy-regex = ["tokenizers/fancy-regex",
"tokenizers/progressbar",
"tokenizers/esaxx_fast"]
wasm = ["local-only",
"tokenizers/unstable_wasm"]

[dependencies]
tokenizers = { version = "0.21", default-features = false }
Expand Down
30 changes: 30 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,36 @@ cargo build --release
* **Batch Processing:** Encodes multiple sentences in batches.
* **Configurable Encoding:** Allows customization of maximum sequence length and batch size during encoding.

### Feature flags

The crate exposes a few feature combinations for different runtimes:

* `default`: native build with `onig` tokenization and optional Hugging Face Hub downloads
* `fancy-regex`: alternative tokenizer backend for native builds
* `local-only`: disable remote model downloads and restrict loading to local paths or `from_bytes(...)`
* `wasm`: minimal WebAssembly-oriented feature set for in-memory loading via `from_bytes(...)`

Comment on lines +150 to +156
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature list implies local-only and wasm are standalone “feature combinations”, but local-only doesn’t currently change behavior by itself, and wasm will generally need to be used with --no-default-features (otherwise default enables onig/hf-hub, which is not wasm-friendly). Suggest clarifying the expected invocation patterns (e.g., --no-default-features --features wasm / --no-default-features --features onig,local-only) to avoid confusing or non-working builds.

Copilot uses AI. Check for mistakes.
Typical invocations are:

* native local-only build:
`cargo build --no-default-features --features onig,local-only`
* wasm check:
`RUSTFLAGS='--cfg getrandom_backend="wasm_js"' cargo check --no-default-features --features wasm --target wasm32-unknown-unknown`

The `wasm` feature is intended for `wasm32-unknown-unknown` builds that load models
from in-memory bytes, for example after fetching assets over HTTP or embedding them
into the binary. Direct filesystem access is usually not available in browser-style
WebAssembly environments, so callers should pass file contents through `from_bytes(...)`.
Remote Hugging Face downloads are not available in this mode.

For `wasm32-unknown-unknown`, `getrandom` also requires a target-specific backend
configuration. The minimal check command is:

```bash
RUSTFLAGS='--cfg getrandom_backend="wasm_js"' \
cargo check --no-default-features --features wasm --target wasm32-unknown-unknown
```

## What is Model2Vec?

Model2Vec is a technique to distill large sentence transformer models into highly efficient static embedding models. This process significantly reduces model size and computational requirements for inference. For a detailed understanding of how Model2Vec works, including the distillation process and model training, please refer to the [main Model2Vec Python repository](https://github.com/MinishLab/model2vec) and its [documentation](https://github.com/MinishLab/model2vec/blob/main/docs/what_is_model2vec.md).
Expand Down
18 changes: 13 additions & 5 deletions src/model.rs
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
use anyhow::{anyhow, Context, Result};
use half::f16;
#[cfg(feature = "hf-hub")]
#[cfg(all(feature = "hf-hub", not(feature = "local-only")))]
use hf_hub::api::sync::Api;
use ndarray::{Array2, ArrayView2, CowArray, Ix2};
use safetensors::{tensor::Dtype, SafeTensors};
use serde_json::Value;
use std::borrow::Cow;
#[cfg(feature = "hf-hub")]
#[cfg(all(feature = "hf-hub", not(feature = "local-only")))]
use std::env;
use std::{fs, path::Path};
use tokenizers::Tokenizer;
Expand Down Expand Up @@ -384,6 +384,8 @@ fn resolve_model_files<P: AsRef<Path>>(
) -> Result<ModelFiles> {
#[cfg(not(feature = "hf-hub"))]
let _ = token;
#[cfg(feature = "local-only")]
let _ = token;

let (tokenizer, model, config) = {
let base = repo_or_path.as_ref();
Expand All @@ -397,12 +399,18 @@ fn resolve_model_files<P: AsRef<Path>>(
}
(tokenizer, model, config)
} else {
#[cfg(feature = "hf-hub")]
#[cfg(all(feature = "hf-hub", not(feature = "local-only")))]
{
let files = download_model_files(repo_or_path.as_ref().to_string_lossy().as_ref(), token, subfolder)?;
(files.tokenizer, files.model, files.config)
}
#[cfg(not(feature = "hf-hub"))]
#[cfg(feature = "local-only")]
{
return Err(anyhow!(
"remote model downloads are disabled by the `local-only` feature; pass a local model directory instead"
));
}
#[cfg(all(not(feature = "hf-hub"), not(feature = "local-only")))]
{
return Err(anyhow!(
"remote model downloads require the `hf-hub` feature; pass a local model directory instead"
Expand All @@ -418,7 +426,7 @@ fn resolve_model_files<P: AsRef<Path>>(
})
}

#[cfg(feature = "hf-hub")]
#[cfg(all(feature = "hf-hub", not(feature = "local-only")))]
fn download_model_files(repo_id: &str, token: Option<&str>, subfolder: Option<&str>) -> Result<ModelFiles> {
let previous = token.and_then(|_| env::var_os("HF_HUB_TOKEN"));
if let Some(tok) = token {
Expand Down
10 changes: 10 additions & 0 deletions tests/test_model.rs
Original file line number Diff line number Diff line change
Expand Up @@ -132,3 +132,13 @@ fn test_from_pretrained_remote_requires_hf_hub_feature() {
"expected remote loading without hf-hub to mention the missing feature"
);
}

#[cfg(all(feature = "hf-hub", feature = "local-only"))]
#[test]
fn test_from_pretrained_remote_disallowed_by_local_only_feature() {
let err = StaticModel::from_pretrained("minishlab/potion-base-2M", None, None, None).unwrap_err();
assert!(
err.to_string().contains("local-only"),
"expected remote loading with local-only to mention the local-only restriction"
);
}
Loading