Skip to content

AbhyudayPatel/Autocomplete

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

autocomplete

Low-latency local cognitive co-processor for developer workflows. A full-stack autocomplete platform that runs on your own machine via llama-server or LM Studio, with a plugin system for code, CLI, SQL, JSON, configs, macros, intent expansion, writing, and an agentic execute-and-verify loop.

Autocomplete is usually framed as a convenience. The right framing is compression of human intent: the better you compress what the user is trying to do, the closer you get to real-time thinking assistance. This repo pushes toward that framing instead of stopping at "next-token prediction."


What's inside

autocomplete/
├── autocomplete/                 # installable package
│   ├── core/
│   │   ├── backends.py           # llama-server, LM Studio, mock
│   │   ├── engine.py             # CompletionEngine (cache + metrics)
│   │   ├── cache.py              # LRU + prefix cache for streaming
│   │   ├── context.py            # CompletionContext (cursor, file, repo)
│   │   ├── streaming.py          # stream helpers + stop-word handling
│   │   ├── config.py             # built-in model profiles
│   │   └── metrics.py            # latency, acceptances, keystrokes saved
│   │
│   ├── plugins/
│   │   ├── base.py               # Plugin interface
│   │   ├── registry.py           # Registry + Router (match scoring)
│   │   ├── code.py               # generic code (FIM)
│   │   ├── cli.py                # shell command expansion
│   │   ├── sql.py                # schema-aware SQL composer
│   │   ├── json_schema.py        # JSON payloads + tiny validator
│   │   ├── config_gen.py         # YAML / Dockerfile / Terraform / k8s
│   │   ├── intent.py             # '# train cnn on cifar' -> full pipeline
│   │   ├── macros.py             # /rag setup, /fastapi, /pytest, ...
│   │   └── writing.py            # tone-aware email / note / journal
│   │
│   ├── context/
│   │   ├── embeddings.py         # zero-dep hashing embedder
│   │   └── repo_index.py         # repo-level retrieval
│   │
│   ├── agentic/
│   │   ├── executor.py           # suggest -> run -> verify -> self-heal
│   │   └── patcher.py            # apply completion at cursor
│   │
│   ├── experiments/
│   │   ├── speculative.py        # draft-and-target speculative decoding
│   │   └── memory.py             # persistent style memory (JSONL)
│   │
│   ├── benchmarks/
│   │   ├── latency.py            # p50/p95/p99 against any backend
│   │   ├── keystrokes.py         # keystrokes-saved simulator
│   │   └── accuracy.py           # edit-distance + exact match vs truth
│   │
│   └── cli.py                    # unified `autocomplete` CLI
│
├── demos/                        # runnable end-to-end demos (mock + real)
├── tests/                        # 57 passing unit tests
├── Using_llama_server/           # original quickstart (kept)
├── Using_LM_Studio/              # LM Studio quickstart
└── pyproject.toml

Quickstart

1. Install

pip install -r requirements.txt         # just `requests`
pip install -e .                        # exposes `autocomplete` CLI

2. Run a local model

Option A — llama-server (recommended; used throughout demos):

# pick any GGUF coding model, e.g. zeta-2 / nemotron-nano / lfm2.5
llama-server -m zed-industries_zeta-2-Q4_K_S.gguf

Option B — LM Studio (GUI): load your model, start the server, leave http://localhost:1234/v1 as the endpoint. For vision models like Qwen-2.5-VL-8B or Gemma-3 E4B use this path.

Option C — offline: the mock backend needs no server and returns deterministic text. Great for CI and tests:

export AUTOCOMPLETE_BACKEND=mock

3. Use it

# auto-routes to the best plugin
echo 'def add(a, b):' | autocomplete complete

# force a specific plugin
autocomplete complete --plugin sql --prefix 'SELECT email, SUM'

# see which plugin would handle an input
autocomplete route --prefix 'git pu'
# 0.85  cli
# 0.40  writing
# 0.05  code
# Selected: cli

# stream tokens live
autocomplete stream --prefix 'def fib(n):\n'

# interactive REPL with warm cache + metrics
autocomplete repl

# benchmark latency on your model
autocomplete benchmark --prefix 'def f():' --n 30

The 13 feature ideas, mapped to the code

Idea from the brief Where it lives
1. Command autocomplete / intent blocks plugins/cli.py, plugins/intent.py
2. Structured autocomplete (JSON / SQL / YAML / TF) plugins/json_schema.py, plugins/sql.py, plugins/config_gen.py
3. Latency-sensitive flow-state core/streaming.py, core/cache.py (prefix cache)
4. Multi-modal (vision models) core/config.py (qwen3-vl-8b, gemma3-e4b profiles) via core/backends.py LM Studio chat mode
5. Context-aware completion context/repo_index.py, core/context.py
6. Decision layer any plugin accepts an instruction in context; see plugins/intent.py
7. Prompt-to-system macro pipelines plugins/macros.py
8. Agentic autocomplete agentic/executor.py
9. Personal cognition assistant plugins/writing.py
10. Benchmarking module benchmarks/
11. Plug-in architecture plugins/base.py, plugins/registry.py
12. Wild demos demos/ (code, CLI, SQL, intent, macro, stream, agentic)
13a. Speculative decoding experiments/speculative.py
13b. Memory-augmented experiments/memory.py
13c. RL / reward signal core/metrics.py (accept / reject / keystrokes_saved hooks)

Programmatic API

from autocomplete import CompletionEngine, load_backend, Router, registry
from autocomplete.core.context import CompletionContext

engine = CompletionEngine(load_backend("llama-server"))

ctx = CompletionContext(
    prefix="SELECT email, SUM(total_cents)\n",
    language="sql",
    extras={
        "dialect": "postgres",
        "schema": {"users": ["id","email"], "orders": ["id","user_id","total_cents"]},
    },
)

plugin = Router(registry).select(ctx)     # auto-picks `sql`
print(engine.complete(plugin, ctx).text)  # runnable, schema-respecting SQL

Streaming with a live cursor

for tok in engine.stream(plugin, ctx):
    print(tok, end="", flush=True)

Feeding repo context in

from autocomplete.context import RepoIndex
idx = RepoIndex().build(".")                          # one-time, in-memory
ctx.repo_snippets = idx.search_snippets("user auth", k=4)

Agentic suggest -> run -> verify

from autocomplete.agentic import AgenticExecutor
agent = AgenticExecutor(engine, registry.get("code"), max_steps=3)
completion, exec_result, trace = agent.run(ctx)

Speculative decoding (small-draft / big-target)

from autocomplete.core.backends import LlamaServerBackend
from autocomplete.experiments import SpeculativeDecoder

draft  = LlamaServerBackend(url="http://localhost:8081/v1/completions")  # LFM2.5
target = LlamaServerBackend(url="http://localhost:8080/v1/completions")  # zeta-2
sd = SpeculativeDecoder(draft, target)
res = sd.complete(prompt, max_tokens=96)
print(res.text, "accepted-from-draft =", res.accepted_from_draft)

Style memory

from autocomplete.experiments import StyleMemory
mem = StyleMemory(path="~/.autocomplete/style.jsonl")
mem.add(query="flask endpoint", accepted="@app.route('/ping')\ndef ping(): return 'pong'")
# later:
examples = mem.as_examples("another small flask endpoint", k=3)

Built-in model profiles

core/config.py ships with tuned defaults for the models in your current setup. Select one via env var or CLI flag:

profile backend rec. use
zeta-2 llama-server code-tuned, best quality on this repo's demos
lfm2.5 llama-server 1.2B — the sub-100ms "flow state" tier
nemotron-nano llama-server 4B — quality / latency balance
qwen3-vl-8b lm-studio (chat) vision-capable; diagram + image intent
gemma3-e4b lm-studio (chat) Gemma 3 vision; LM Studio chat endpoint

Override the URL/model with env vars:

export AUTOCOMPLETE_BACKEND=lm-studio
export LMSTUDIO_URL=http://localhost:1234/v1
export LMSTUDIO_MODEL=qwen3-vl-8b

Tests

python -m pytest tests -q
# 57 passed

All tests run offline — the MockBackend and deterministic plugin paths (templates, macros) make the suite deterministic and fast (<2s).


Demos

python demos/demo_code.py        # code completion (mock or real backend)
python demos/demo_cli.py         # CLI expansion (deterministic)
python demos/demo_sql.py         # schema-aware SQL
python demos/demo_intent.py      # '# train cnn on cifar' -> pipeline
python demos/demo_macro.py       # /rag setup, /fastapi, /pytest, ...
python demos/demo_stream.py      # live streaming
python demos/demo_agentic.py     # suggest -> execute -> verify loop

Every demo works with AUTOCOMPLETE_BACKEND=mock (no server needed); swap to a real backend with AUTOCOMPLETE_BACKEND=llama-server once llama-server is up on port 8080.


Design notes

  • Zero extra deps. The only runtime dependency is requests. Embeddings use a pure-stdlib hashing vectorizer — not as strong as SBERT but plenty for retrieving from your own code.
  • Plugins are pure transforms. build_prompt(ctx) and postprocess(raw, ctx) — no side effects. This makes every plugin testable without a running server.
  • Caches are process-local. A CompletionCache combines exact-match LRU with a prefix cache that gives near-free completions while the user is streaming characters — the dominant case during inline autocomplete.
  • Router is explicit. Every routing decision is scorable and explainable via autocomplete route --prefix .... No black-box heuristics.

License

MIT.

About

Tab autocompletion models and use cases

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages