Low-latency local cognitive co-processor for developer workflows. A full-stack autocomplete platform that runs on your own machine via
llama-serveror LM Studio, with a plugin system for code, CLI, SQL, JSON, configs, macros, intent expansion, writing, and an agentic execute-and-verify loop.
Autocomplete is usually framed as a convenience. The right framing is compression of human intent: the better you compress what the user is trying to do, the closer you get to real-time thinking assistance. This repo pushes toward that framing instead of stopping at "next-token prediction."
autocomplete/
├── autocomplete/ # installable package
│ ├── core/
│ │ ├── backends.py # llama-server, LM Studio, mock
│ │ ├── engine.py # CompletionEngine (cache + metrics)
│ │ ├── cache.py # LRU + prefix cache for streaming
│ │ ├── context.py # CompletionContext (cursor, file, repo)
│ │ ├── streaming.py # stream helpers + stop-word handling
│ │ ├── config.py # built-in model profiles
│ │ └── metrics.py # latency, acceptances, keystrokes saved
│ │
│ ├── plugins/
│ │ ├── base.py # Plugin interface
│ │ ├── registry.py # Registry + Router (match scoring)
│ │ ├── code.py # generic code (FIM)
│ │ ├── cli.py # shell command expansion
│ │ ├── sql.py # schema-aware SQL composer
│ │ ├── json_schema.py # JSON payloads + tiny validator
│ │ ├── config_gen.py # YAML / Dockerfile / Terraform / k8s
│ │ ├── intent.py # '# train cnn on cifar' -> full pipeline
│ │ ├── macros.py # /rag setup, /fastapi, /pytest, ...
│ │ └── writing.py # tone-aware email / note / journal
│ │
│ ├── context/
│ │ ├── embeddings.py # zero-dep hashing embedder
│ │ └── repo_index.py # repo-level retrieval
│ │
│ ├── agentic/
│ │ ├── executor.py # suggest -> run -> verify -> self-heal
│ │ └── patcher.py # apply completion at cursor
│ │
│ ├── experiments/
│ │ ├── speculative.py # draft-and-target speculative decoding
│ │ └── memory.py # persistent style memory (JSONL)
│ │
│ ├── benchmarks/
│ │ ├── latency.py # p50/p95/p99 against any backend
│ │ ├── keystrokes.py # keystrokes-saved simulator
│ │ └── accuracy.py # edit-distance + exact match vs truth
│ │
│ └── cli.py # unified `autocomplete` CLI
│
├── demos/ # runnable end-to-end demos (mock + real)
├── tests/ # 57 passing unit tests
├── Using_llama_server/ # original quickstart (kept)
├── Using_LM_Studio/ # LM Studio quickstart
└── pyproject.toml
pip install -r requirements.txt # just `requests`
pip install -e . # exposes `autocomplete` CLIOption A — llama-server (recommended; used throughout demos):
# pick any GGUF coding model, e.g. zeta-2 / nemotron-nano / lfm2.5
llama-server -m zed-industries_zeta-2-Q4_K_S.ggufOption B — LM Studio (GUI): load your model, start the server, leave
http://localhost:1234/v1 as the endpoint. For vision models like
Qwen-2.5-VL-8B or Gemma-3 E4B use this path.
Option C — offline: the mock backend needs no server and returns
deterministic text. Great for CI and tests:
export AUTOCOMPLETE_BACKEND=mock# auto-routes to the best plugin
echo 'def add(a, b):' | autocomplete complete
# force a specific plugin
autocomplete complete --plugin sql --prefix 'SELECT email, SUM'
# see which plugin would handle an input
autocomplete route --prefix 'git pu'
# 0.85 cli
# 0.40 writing
# 0.05 code
# Selected: cli
# stream tokens live
autocomplete stream --prefix 'def fib(n):\n'
# interactive REPL with warm cache + metrics
autocomplete repl
# benchmark latency on your model
autocomplete benchmark --prefix 'def f():' --n 30| Idea from the brief | Where it lives |
|---|---|
| 1. Command autocomplete / intent blocks | plugins/cli.py, plugins/intent.py |
| 2. Structured autocomplete (JSON / SQL / YAML / TF) | plugins/json_schema.py, plugins/sql.py, plugins/config_gen.py |
| 3. Latency-sensitive flow-state | core/streaming.py, core/cache.py (prefix cache) |
| 4. Multi-modal (vision models) | core/config.py (qwen3-vl-8b, gemma3-e4b profiles) via core/backends.py LM Studio chat mode |
| 5. Context-aware completion | context/repo_index.py, core/context.py |
| 6. Decision layer | any plugin accepts an instruction in context; see plugins/intent.py |
| 7. Prompt-to-system macro pipelines | plugins/macros.py |
| 8. Agentic autocomplete | agentic/executor.py |
| 9. Personal cognition assistant | plugins/writing.py |
| 10. Benchmarking module | benchmarks/ |
| 11. Plug-in architecture | plugins/base.py, plugins/registry.py |
| 12. Wild demos | demos/ (code, CLI, SQL, intent, macro, stream, agentic) |
| 13a. Speculative decoding | experiments/speculative.py |
| 13b. Memory-augmented | experiments/memory.py |
| 13c. RL / reward signal | core/metrics.py (accept / reject / keystrokes_saved hooks) |
from autocomplete import CompletionEngine, load_backend, Router, registry
from autocomplete.core.context import CompletionContext
engine = CompletionEngine(load_backend("llama-server"))
ctx = CompletionContext(
prefix="SELECT email, SUM(total_cents)\n",
language="sql",
extras={
"dialect": "postgres",
"schema": {"users": ["id","email"], "orders": ["id","user_id","total_cents"]},
},
)
plugin = Router(registry).select(ctx) # auto-picks `sql`
print(engine.complete(plugin, ctx).text) # runnable, schema-respecting SQLfor tok in engine.stream(plugin, ctx):
print(tok, end="", flush=True)from autocomplete.context import RepoIndex
idx = RepoIndex().build(".") # one-time, in-memory
ctx.repo_snippets = idx.search_snippets("user auth", k=4)from autocomplete.agentic import AgenticExecutor
agent = AgenticExecutor(engine, registry.get("code"), max_steps=3)
completion, exec_result, trace = agent.run(ctx)from autocomplete.core.backends import LlamaServerBackend
from autocomplete.experiments import SpeculativeDecoder
draft = LlamaServerBackend(url="http://localhost:8081/v1/completions") # LFM2.5
target = LlamaServerBackend(url="http://localhost:8080/v1/completions") # zeta-2
sd = SpeculativeDecoder(draft, target)
res = sd.complete(prompt, max_tokens=96)
print(res.text, "accepted-from-draft =", res.accepted_from_draft)from autocomplete.experiments import StyleMemory
mem = StyleMemory(path="~/.autocomplete/style.jsonl")
mem.add(query="flask endpoint", accepted="@app.route('/ping')\ndef ping(): return 'pong'")
# later:
examples = mem.as_examples("another small flask endpoint", k=3)core/config.py ships with tuned defaults for the models in your current setup. Select one via env var or CLI flag:
| profile | backend | rec. use |
|---|---|---|
zeta-2 |
llama-server | code-tuned, best quality on this repo's demos |
lfm2.5 |
llama-server | 1.2B — the sub-100ms "flow state" tier |
nemotron-nano |
llama-server | 4B — quality / latency balance |
qwen3-vl-8b |
lm-studio (chat) | vision-capable; diagram + image intent |
gemma3-e4b |
lm-studio (chat) | Gemma 3 vision; LM Studio chat endpoint |
Override the URL/model with env vars:
export AUTOCOMPLETE_BACKEND=lm-studio
export LMSTUDIO_URL=http://localhost:1234/v1
export LMSTUDIO_MODEL=qwen3-vl-8bpython -m pytest tests -q
# 57 passedAll tests run offline — the MockBackend and deterministic plugin paths
(templates, macros) make the suite deterministic and fast (<2s).
python demos/demo_code.py # code completion (mock or real backend)
python demos/demo_cli.py # CLI expansion (deterministic)
python demos/demo_sql.py # schema-aware SQL
python demos/demo_intent.py # '# train cnn on cifar' -> pipeline
python demos/demo_macro.py # /rag setup, /fastapi, /pytest, ...
python demos/demo_stream.py # live streaming
python demos/demo_agentic.py # suggest -> execute -> verify loopEvery demo works with AUTOCOMPLETE_BACKEND=mock (no server needed); swap
to a real backend with AUTOCOMPLETE_BACKEND=llama-server once
llama-server is up on port 8080.
- Zero extra deps. The only runtime dependency is
requests. Embeddings use a pure-stdlib hashing vectorizer — not as strong as SBERT but plenty for retrieving from your own code. - Plugins are pure transforms.
build_prompt(ctx)andpostprocess(raw, ctx)— no side effects. This makes every plugin testable without a running server. - Caches are process-local. A
CompletionCachecombines exact-match LRU with a prefix cache that gives near-free completions while the user is streaming characters — the dominant case during inline autocomplete. - Router is explicit. Every routing decision is scorable and
explainable via
autocomplete route --prefix .... No black-box heuristics.
MIT.