autocomplete

Low-latency local cognitive co-processor for developer workflows. A full-stack autocomplete platform that runs on your own machine via llama-server or LM Studio, with a plugin system for code, CLI, SQL, JSON, configs, macros, intent expansion, writing, and an agentic execute-and-verify loop.

Autocomplete is usually framed as a convenience. The right framing is compression of human intent: the better you compress what the user is trying to do, the closer you get to real-time thinking assistance. This repo pushes toward that framing instead of stopping at "next-token prediction."

What's inside

autocomplete/
├── autocomplete/                 # installable package
│   ├── core/
│   │   ├── backends.py           # llama-server, LM Studio, mock
│   │   ├── engine.py             # CompletionEngine (cache + metrics)
│   │   ├── cache.py              # LRU + prefix cache for streaming
│   │   ├── context.py            # CompletionContext (cursor, file, repo)
│   │   ├── streaming.py          # stream helpers + stop-word handling
│   │   ├── config.py             # built-in model profiles
│   │   └── metrics.py            # latency, acceptances, keystrokes saved
│   │
│   ├── plugins/
│   │   ├── base.py               # Plugin interface
│   │   ├── registry.py           # Registry + Router (match scoring)
│   │   ├── code.py               # generic code (FIM)
│   │   ├── cli.py                # shell command expansion
│   │   ├── sql.py                # schema-aware SQL composer
│   │   ├── json_schema.py        # JSON payloads + tiny validator
│   │   ├── config_gen.py         # YAML / Dockerfile / Terraform / k8s
│   │   ├── intent.py             # '# train cnn on cifar' -> full pipeline
│   │   ├── macros.py             # /rag setup, /fastapi, /pytest, ...
│   │   └── writing.py            # tone-aware email / note / journal
│   │
│   ├── context/
│   │   ├── embeddings.py         # zero-dep hashing embedder
│   │   └── repo_index.py         # repo-level retrieval
│   │
│   ├── agentic/
│   │   ├── executor.py           # suggest -> run -> verify -> self-heal
│   │   └── patcher.py            # apply completion at cursor
│   │
│   ├── experiments/
│   │   ├── speculative.py        # draft-and-target speculative decoding
│   │   └── memory.py             # persistent style memory (JSONL)
│   │
│   ├── benchmarks/
│   │   ├── latency.py            # p50/p95/p99 against any backend
│   │   ├── keystrokes.py         # keystrokes-saved simulator
│   │   └── accuracy.py           # edit-distance + exact match vs truth
│   │
│   └── cli.py                    # unified `autocomplete` CLI
│
├── demos/                        # runnable end-to-end demos (mock + real)
├── tests/                        # 57 passing unit tests
├── Using_llama_server/           # original quickstart (kept)
├── Using_LM_Studio/              # LM Studio quickstart
└── pyproject.toml

Quickstart

1. Install

pip install -r requirements.txt         # just `requests`
pip install -e .                        # exposes `autocomplete` CLI

2. Run a local model

Option A — llama-server (recommended; used throughout demos):

# pick any GGUF coding model, e.g. zeta-2 / nemotron-nano / lfm2.5
llama-server -m zed-industries_zeta-2-Q4_K_S.gguf

Option B — LM Studio (GUI): load your model, start the server, leave http://localhost:1234/v1 as the endpoint. For vision models like Qwen-2.5-VL-8B or Gemma-3 E4B use this path.

Option C — offline: the mock backend needs no server and returns deterministic text. Great for CI and tests:

export AUTOCOMPLETE_BACKEND=mock

3. Use it

# auto-routes to the best plugin
echo 'def add(a, b):' | autocomplete complete

# force a specific plugin
autocomplete complete --plugin sql --prefix 'SELECT email, SUM'

# see which plugin would handle an input
autocomplete route --prefix 'git pu'
# 0.85  cli
# 0.40  writing
# 0.05  code
# Selected: cli

# stream tokens live
autocomplete stream --prefix 'def fib(n):\n'

# interactive REPL with warm cache + metrics
autocomplete repl

# benchmark latency on your model
autocomplete benchmark --prefix 'def f():' --n 30

The 13 feature ideas, mapped to the code

Idea from the brief	Where it lives
1. Command autocomplete / intent blocks	plugins/cli.py, plugins/intent.py
2. Structured autocomplete (JSON / SQL / YAML / TF)	plugins/json_schema.py, plugins/sql.py, plugins/config_gen.py
3. Latency-sensitive flow-state	core/streaming.py, core/cache.py (prefix cache)
4. Multi-modal (vision models)	core/config.py (`qwen3-vl-8b`, `gemma3-e4b` profiles) via core/backends.py LM Studio chat mode
5. Context-aware completion	context/repo_index.py, core/context.py
6. Decision layer	any plugin accepts an `instruction` in context; see plugins/intent.py
7. Prompt-to-system macro pipelines	plugins/macros.py
8. Agentic autocomplete	agentic/executor.py
9. Personal cognition assistant	plugins/writing.py
10. Benchmarking module	benchmarks/
11. Plug-in architecture	plugins/base.py, plugins/registry.py
12. Wild demos	demos/ (code, CLI, SQL, intent, macro, stream, agentic)
13a. Speculative decoding	experiments/speculative.py
13b. Memory-augmented	experiments/memory.py
13c. RL / reward signal	core/metrics.py (`accept` / `reject` / `keystrokes_saved` hooks)

Programmatic API

from autocomplete import CompletionEngine, load_backend, Router, registry
from autocomplete.core.context import CompletionContext

engine = CompletionEngine(load_backend("llama-server"))

ctx = CompletionContext(
    prefix="SELECT email, SUM(total_cents)\n",
    language="sql",
    extras={
        "dialect": "postgres",
        "schema": {"users": ["id","email"], "orders": ["id","user_id","total_cents"]},
    },
)

plugin = Router(registry).select(ctx)     # auto-picks `sql`
print(engine.complete(plugin, ctx).text)  # runnable, schema-respecting SQL

Streaming with a live cursor

for tok in engine.stream(plugin, ctx):
    print(tok, end="", flush=True)

Feeding repo context in

from autocomplete.context import RepoIndex
idx = RepoIndex().build(".")                          # one-time, in-memory
ctx.repo_snippets = idx.search_snippets("user auth", k=4)

Agentic suggest -> run -> verify

from autocomplete.agentic import AgenticExecutor
agent = AgenticExecutor(engine, registry.get("code"), max_steps=3)
completion, exec_result, trace = agent.run(ctx)

Speculative decoding (small-draft / big-target)

from autocomplete.core.backends import LlamaServerBackend
from autocomplete.experiments import SpeculativeDecoder

draft  = LlamaServerBackend(url="http://localhost:8081/v1/completions")  # LFM2.5
target = LlamaServerBackend(url="http://localhost:8080/v1/completions")  # zeta-2
sd = SpeculativeDecoder(draft, target)
res = sd.complete(prompt, max_tokens=96)
print(res.text, "accepted-from-draft =", res.accepted_from_draft)

Style memory

from autocomplete.experiments import StyleMemory
mem = StyleMemory(path="~/.autocomplete/style.jsonl")
mem.add(query="flask endpoint", accepted="@app.route('/ping')\ndef ping(): return 'pong'")
# later:
examples = mem.as_examples("another small flask endpoint", k=3)

Built-in model profiles

core/config.py ships with tuned defaults for the models in your current setup. Select one via env var or CLI flag:

profile	backend	rec. use
`zeta-2`	llama-server	code-tuned, best quality on this repo's demos
`lfm2.5`	llama-server	1.2B — the sub-100ms "flow state" tier
`nemotron-nano`	llama-server	4B — quality / latency balance
`qwen3-vl-8b`	lm-studio (chat)	vision-capable; diagram + image intent
`gemma3-e4b`	lm-studio (chat)	Gemma 3 vision; LM Studio chat endpoint

Override the URL/model with env vars:

export AUTOCOMPLETE_BACKEND=lm-studio
export LMSTUDIO_URL=http://localhost:1234/v1
export LMSTUDIO_MODEL=qwen3-vl-8b

Tests

python -m pytest tests -q
# 57 passed

All tests run offline — the MockBackend and deterministic plugin paths (templates, macros) make the suite deterministic and fast (<2s).

Demos

python demos/demo_code.py        # code completion (mock or real backend)
python demos/demo_cli.py         # CLI expansion (deterministic)
python demos/demo_sql.py         # schema-aware SQL
python demos/demo_intent.py      # '# train cnn on cifar' -> pipeline
python demos/demo_macro.py       # /rag setup, /fastapi, /pytest, ...
python demos/demo_stream.py      # live streaming
python demos/demo_agentic.py     # suggest -> execute -> verify loop

Every demo works with AUTOCOMPLETE_BACKEND=mock (no server needed); swap to a real backend with AUTOCOMPLETE_BACKEND=llama-server once llama-server is up on port 8080.

Design notes

Zero extra deps. The only runtime dependency is requests. Embeddings use a pure-stdlib hashing vectorizer — not as strong as SBERT but plenty for retrieving from your own code.
Plugins are pure transforms. build_prompt(ctx) and postprocess(raw, ctx) — no side effects. This makes every plugin testable without a running server.
Caches are process-local. A CompletionCache combines exact-match LRU with a prefix cache that gives near-free completions while the user is streaming characters — the dominant case during inline autocomplete.
Router is explicit. Every routing decision is scorable and explainable via autocomplete route --prefix .... No black-box heuristics.

License

MIT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

autocomplete

What's inside

Quickstart

1. Install

2. Run a local model

3. Use it

The 13 feature ideas, mapped to the code

Programmatic API

Streaming with a live cursor

Feeding repo context in

Agentic suggest -> run -> verify

Speculative decoding (small-draft / big-target)

Style memory

Built-in model profiles

Tests

Demos

Design notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.claude		.claude
Using_llama_server		Using_llama_server
autocomplete		autocomplete
demos		demos
tests		tests
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

autocomplete

What's inside

Quickstart

1. Install

2. Run a local model

3. Use it

The 13 feature ideas, mapped to the code

Programmatic API

Streaming with a live cursor

Feeding repo context in

Agentic suggest -> run -> verify

Speculative decoding (small-draft / big-target)

Style memory

Built-in model profiles

Tests

Demos

Design notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages