A reverse catalogue, availability tracker, and recommender for the Andalusian public-library network — built to help you find the book you didn't know you were looking for.
A personal side project that mirrors public catalogue data. Not affiliated with the Junta de Andalucía or any of its libraries.
The official public-library OPAC (AbsysNET) is great if you already know the title you want. biblioHack flips it around: it keeps a local, searchable mirror of the catalogue so you can explore it the way you'd browse a bookshop — search in natural Spanish (accents optional), filter to the literary catalogue, see which branch has a copy on the shelf right now, import your Goodreads shelf, and get recommendations based on what you've already read.
It's also a study in doing this politely and sustainably: a rate-limited crawler that's a good citizen of a public system, a history-preserving availability time-series, and a clean hexagonal codebase.
- 🔎 Full-text search in Spanish — accent-insensitive (
caféfindscafe) via Postgresspanish_unaccent+ a generatedtsvector. - 📖 Literary-first catalogue — records are classified by audience (adult / youth / children) and literary form (literary / non-fiction) from shelf-mark and CDU signals. The default scope shows the adult literary catalogue; a toggle includes children's, youth and non-fiction. Nothing is discarded — just scoped.
- 🟢 Live availability by branch — a history-preserving time-series of copies per branch, surfaced as "N disponibles ahora" badges.
- 🖼️ Book covers — resolved asynchronously (Open Library → Google Books → placeholder) and served from a content-addressed store, off the crawl path.
- 🤖 Autonomous, resumable crawler — a containerised, scheduled crawler walks the catalogue with a persisted cursor, growing the mirror hour by hour without ever re-scanning from the top.
- 🧠 Semantic search — BGE-M3 embeddings in pgvector:
?mode=semanticqueries and "more like this" on every record. - 👤 User accounts — public registration with email verification, Turnstile bot protection, Redis sessions, rate limiting, and GDPR self-service (data export + account deletion).
- 📥 Reading-history import — upload a Goodreads CSV; a background worker matches it against the catalogue (ISBN-13 first, fuzzy title+author fallback) and your shelf re-matches for free as the mirror grows.
- ✨ Per-user recommendations — a taste profile from your rated shelf drives pgvector retrieval over the catalogue, with optional LLM-written rationales.
- 📊 Production APM — OpenTelemetry tracing (FastAPI + asyncpg + Redis) exported to Grafana Tempo / SigNoz.
- 🔀 Hybrid retrieval — fusing keyword and semantic rankings for better search.
- 🗺️ Expansion beyond Huelva to other Andalusian provinces.
- 📱 Mobile app reusing the same API.
biblioHack is a hexagonal (ports & adapters) modular monolith. The domain logic never imports a framework or a driver; adapters (the AbsysNET scraper, Postgres repositories, the cover providers) plug in behind Protocol ports, which keeps the core testable and the deployment topology a free choice.
Cloudflare Tunnel (read-only, TLS at the edge)
│
┌──────────────────────┴───────────────────────┐
▼ ▼
┌─────────────────┐ ┌──────────────────┐
│ Astro frontend │ ── /catalog, /healthz ──▶ │ FastAPI (api) │
│ (static + React│ │ + OpenTelemetry │
│ islands) │ └────────┬─────────┘
└─────────────────┘ │
▼
┌──────────────────────────────────────┐
│ Postgres (TimescaleDB + pgvector + │
│ spanish_unaccent FTS) · MinIO covers │
└──────────────────────▲───────────────┘
│ LAN / private network
┌───────────────────────────────────────────┐ │
│ Crawl plane (off the public API) │ ─────────┘
│ scheduled discover → worker → refresh, │
│ Scrapling/Camoufox→Chromium against the │
│ public OPAC, polite by design │
└───────────────────────────────────────────┘
- Read + serve plane is public (read-only) behind the tunnel; write/admin, the database and the crawler are never exposed to the internet.
- The crawler runs separately from the API so a heavy headless-browser workload can't affect request latency. It paginates the OPAC's expert-query results with a resumable offset cursor, so it advances through the whole catalogue across runs and then tracks new arrivals.
- Availability is a time-series: copies are upserted (never delete+insert) so each re-scrape appends a snapshot rather than wiping history.
See docs/design/architecture.md for the full design, the scrape state machine, and the data model.
| Layer | Choice |
|---|---|
| Backend | Python 3.12, FastAPI, SQLAlchemy 2.0 (async) + asyncpg, Typer CLI, uv |
| Database | PostgreSQL + TimescaleDB (hypertables), pgvector, spanish_unaccent FTS |
| Scraping | Scrapling (Camoufox → Patchright Chromium), polite token-bucket throttle |
| Frontend | Astro (static) + React islands, Zod, pnpm |
| Covers/store | Pillow (WebP), content-addressed filesystem / MinIO (S3) |
| Observability | OpenTelemetry → OTLP → Grafana Tempo / SigNoz |
| Infra | Docker Compose, Synology NAS, Cloudflare Tunnel, GitHub Actions CI/CD |
biblioHack/
├── README.md # this file (GitHub front door)
├── AGENTS.md # conventions for AI assistants / contributors (CLAUDE.md points here)
├── docs/ # canonical Markdown — design/ · ops/ · outreach/ + generated site/
├── docker-compose.yml # dev environment
├── docker-compose.prod.yml # production (read + serve plane)
├── docker-compose.crawler.yml # the autonomous crawl plane
├── backend/ # FastAPI hexagonal modular monolith (uv)
│ └── src/bibliohack/ # bounded contexts: catalog · holdings · availability ·
│ # covers · identity · reading_history · recommendations · shared
├── frontend/ # Astro + React islands (pnpm)
├── infra/ # Dockerfiles, crawler image + schedule, cloudflared config
└── .github/workflows/ # CI (lint, typecheck, test, build, deploy)
- Python 3.12+ and uv
- Node 20+ and pnpm
- Docker + Docker Compose
- make (optional, for the convenience targets)
git clone https://github.com/jarcos/biblioHack.git
cd biblioHack
cp .env.example .env
# 1. Bring up postgres + redis + api + frontend
make dev-up
# 2. Backend checks (ruff + mypy + pytest)
make backend-check
# 3. Frontend checks (eslint + astro check + vitest)
make frontend-check
# 4. Open the apps
open http://localhost:8800/docs # FastAPI Swagger
open http://localhost:4321 # Astro frontendEvery make target is a one-liner — peek inside the Makefile if you'd rather run things directly. Scraping is opt-in (it's a heavy, browser-backed dependency set): cd backend && uv sync --extra scraper && uv run camoufox fetch then uv run bibliohack catalog --help.
Copy .env.example to .env and adjust as needed. Sensible defaults are provided for local development; the OpenTelemetry exporter and other production settings stay dormant unless their env vars are set.
| Milestone | Scope | State |
|---|---|---|
| M0 | Foundations (scaffold, compose, CI) | ✅ Done |
| M1 | Catalogue ingest + accent-insensitive search + literary scoping | ✅ Done |
| M2 | Availability history + autonomous resumable crawler | ✅ Done |
| M2.5 | Book covers (resolution + content-addressed store) | ✅ Done |
| M3 | Semantic search (BGE-M3 + pgvector) | ✅ Done |
| M4 | Reading-history import (Goodreads) | ✅ Done |
| M5 | Recommender v1 (user-scoped) | ✅ Done |
| Identity | Public registration, per-user shelves, GDPR self-service, hardening | ✅ Done |
| M6.5 | CI/CD auto-deploy (green main → NAS) |
✅ Done |
| Relevance | relevance_score (demand + holdings + recency) drives browse/search ordering |
⏳ Planned |
| Libraries | Follow branches by proximity; scope browse/search/recs to your libraries | ⏳ Planned |
| M7+ | More provinces · mobile app · external canon boost | ⏳ Planned |
Public deploy is live at biblio.josearcos.me; see docs/design/architecture.md §11 for milestone detail.
The production api is instrumented with OpenTelemetry (APM / distributed tracing). The container runtime wraps uvicorn with opentelemetry-instrument, which auto-instruments FastAPI and asyncpg — HTTP requests and DB queries become spans with no application-code changes. It is a no-op locally: instrumentation only activates when the OTEL_* env vars are set (defined only in docker-compose.prod.yml), so dev runs and tests are unaffected.
In production, telemetry is exported via OTLP to a shared OpenTelemetry collector, which fans traces out to Grafana Tempo and SigNoz (service.name=bibliohack-api). See docs/design/architecture.md §10 for the full picture.
Green pushes to main are gated by CI (lint, typecheck, tests, image build) and then auto-deploy to a Synology NAS over Tailscale. The public surface is served through a Cloudflare Tunnel (read-only; no inbound ports). Database migrations ship in the API image and run on deploy. The crawl plane runs as a separate, self-restarting container so it can't affect the public site. Details — including the hard-won Synology specifics — live in docs/design/architecture.md §10.
This is primarily a personal project, but issues, ideas and PRs are welcome. If you're contributing code:
- Read
AGENTS.mdfor the conventions (it's written for AI assistants but applies to humans too). - Before pushing, the backend must pass
ruff format --check .,ruff check .,mypy src, andpytest; the frontend must pass its lint/typecheck/test — all enforced in CI. - Add an Alembic migration for any schema change.
- Be a good OPAC citizen. The crawler is deliberately rate-limited and capped because it talks to a live public-library system. Please don't change it to be more aggressive.
biblioHack mirrors public-sector bibliographic data that belongs to the Junta de Andalucía and the Spanish public-library system, reused under the Spanish public-sector information rules (Ley 37/2007). Información obtenida del Portal de la Junta de Andalucía (CC-BY 3.0 ES); derivatives of this data must carry the same attribution. The crawler identifies itself, throttles every request, and caps its volume so it never burdens the source system. This project is independent and unaffiliated.
MIT © José Arcos.