📚 biblioHack

A reverse catalogue, availability tracker, and recommender for the Andalusian public-library network — built to help you find the book you didn't know you were looking for.

Live instance · Architecture · Roadmap

A personal side project that mirrors public catalogue data. Not affiliated with the Junta de Andalucía or any of its libraries.

What is this?

The official public-library OPAC (AbsysNET) is great if you already know the title you want. biblioHack flips it around: it keeps a local, searchable mirror of the catalogue so you can explore it the way you'd browse a bookshop — search in natural Spanish (accents optional), filter to the literary catalogue, see which branch has a copy on the shelf right now, import your Goodreads shelf, and get recommendations based on what you've already read.

It's also a study in doing this politely and sustainably: a rate-limited crawler that's a good citizen of a public system, a history-preserving availability time-series, and a clean hexagonal codebase.

What works today

🔎 Full-text search in Spanish — accent-insensitive (café finds cafe) via Postgres spanish_unaccent + a generated tsvector.
📖 Literary-first catalogue — records are classified by audience (adult / youth / children) and literary form (literary / non-fiction) from shelf-mark and CDU signals. The default scope shows the adult literary catalogue; a toggle includes children's, youth and non-fiction. Nothing is discarded — just scoped.
🟢 Live availability by branch — a history-preserving time-series of copies per branch, surfaced as "N disponibles ahora" badges.
🖼️ Book covers — resolved asynchronously (Open Library → Google Books → placeholder) and served from a content-addressed store, off the crawl path.
🤖 Autonomous, resumable crawler — a containerised, scheduled crawler walks the catalogue with a persisted cursor, growing the mirror hour by hour without ever re-scanning from the top.
🧠 Semantic search — BGE-M3 embeddings in pgvector: ?mode=semantic queries and "more like this" on every record.
👤 User accounts — public registration with email verification, Turnstile bot protection, Redis sessions, rate limiting, and GDPR self-service (data export + account deletion).
📥 Reading-history import — upload a Goodreads CSV; a background worker matches it against the catalogue (ISBN-13 first, fuzzy title+author fallback) and your shelf re-matches for free as the mirror grows.
✨ Per-user recommendations — a taste profile from your rated shelf drives pgvector retrieval over the catalogue, with optional LLM-written rationales.
📊 Production APM — OpenTelemetry tracing (FastAPI + asyncpg + Redis) exported to Grafana Tempo / SigNoz.

On the roadmap

🔀 Hybrid retrieval — fusing keyword and semantic rankings for better search.
🗺️ Expansion beyond Huelva to other Andalusian provinces.
📱 Mobile app reusing the same API.

How it works

biblioHack is a hexagonal (ports & adapters) modular monolith. The domain logic never imports a framework or a driver; adapters (the AbsysNET scraper, Postgres repositories, the cover providers) plug in behind Protocol ports, which keeps the core testable and the deployment topology a free choice.

                    Cloudflare Tunnel (read-only, TLS at the edge)
                                   │
            ┌──────────────────────┴───────────────────────┐
            ▼                                               ▼
   ┌─────────────────┐                            ┌──────────────────┐
   │  Astro frontend │ ── /catalog, /healthz ──▶  │  FastAPI  (api)  │
   │  (static + React│                            │  + OpenTelemetry │
   │   islands)      │                            └────────┬─────────┘
   └─────────────────┘                                     │
                                                           ▼
                                    ┌──────────────────────────────────────┐
                                    │ Postgres (TimescaleDB + pgvector +    │
                                    │ spanish_unaccent FTS) · MinIO covers  │
                                    └──────────────────────▲───────────────┘
                                                           │ LAN / private network
   ┌───────────────────────────────────────────┐          │
   │ Crawl plane (off the public API)           │ ─────────┘
   │  scheduled discover → worker → refresh,     │
   │  Scrapling/Camoufox→Chromium against the    │
   │  public OPAC, polite by design              │
   └───────────────────────────────────────────┘

Read + serve plane is public (read-only) behind the tunnel; write/admin, the database and the crawler are never exposed to the internet.
The crawler runs separately from the API so a heavy headless-browser workload can't affect request latency. It paginates the OPAC's expert-query results with a resumable offset cursor, so it advances through the whole catalogue across runs and then tracks new arrivals.
Availability is a time-series: copies are upserted (never delete+insert) so each re-scrape appends a snapshot rather than wiping history.

See docs/design/architecture.md for the full design, the scrape state machine, and the data model.

Tech stack

Layer	Choice
Backend	Python 3.12, FastAPI, SQLAlchemy 2.0 (async) + asyncpg, Typer CLI, `uv`
Database	PostgreSQL + TimescaleDB (hypertables), pgvector, `spanish_unaccent` FTS
Scraping	Scrapling (Camoufox → Patchright Chromium), polite token-bucket throttle
Frontend	Astro (static) + React islands, Zod, pnpm
Covers/store	Pillow (WebP), content-addressed filesystem / MinIO (S3)
Observability	OpenTelemetry → OTLP → Grafana Tempo / SigNoz
Infra	Docker Compose, Synology NAS, Cloudflare Tunnel, GitHub Actions CI/CD

Repository layout

biblioHack/
├── README.md                # this file (GitHub front door)
├── AGENTS.md                # conventions for AI assistants / contributors (CLAUDE.md points here)
├── docs/                    # canonical Markdown — design/ · ops/ · outreach/ + generated site/
├── docker-compose.yml       # dev environment
├── docker-compose.prod.yml  # production (read + serve plane)
├── docker-compose.crawler.yml  # the autonomous crawl plane
├── backend/                 # FastAPI hexagonal modular monolith (uv)
│   └── src/bibliohack/       # bounded contexts: catalog · holdings · availability ·
│                             #   covers · identity · reading_history · recommendations · shared
├── frontend/                # Astro + React islands (pnpm)
├── infra/                   # Dockerfiles, crawler image + schedule, cloudflared config
└── .github/workflows/       # CI (lint, typecheck, test, build, deploy)

Getting started

Prerequisites

Python 3.12+ and uv
Node 20+ and pnpm
Docker + Docker Compose
make (optional, for the convenience targets)

Quick start

git clone https://github.com/jarcos/biblioHack.git
cd biblioHack
cp .env.example .env

# 1. Bring up postgres + redis + api + frontend
make dev-up

# 2. Backend checks (ruff + mypy + pytest)
make backend-check

# 3. Frontend checks (eslint + astro check + vitest)
make frontend-check

# 4. Open the apps
open http://localhost:8800/docs   # FastAPI Swagger
open http://localhost:4321        # Astro frontend

Every make target is a one-liner — peek inside the Makefile if you'd rather run things directly. Scraping is opt-in (it's a heavy, browser-backed dependency set): cd backend && uv sync --extra scraper && uv run camoufox fetch then uv run bibliohack catalog --help.

Configuration

Copy .env.example to .env and adjust as needed. Sensible defaults are provided for local development; the OpenTelemetry exporter and other production settings stay dormant unless their env vars are set.

📈 Status

Milestone	Scope	State
M0	Foundations (scaffold, compose, CI)	✅ Done
M1	Catalogue ingest + accent-insensitive search + literary scoping	✅ Done
M2	Availability history + autonomous resumable crawler	✅ Done
M2.5	Book covers (resolution + content-addressed store)	✅ Done
M3	Semantic search (BGE-M3 + pgvector)	✅ Done
M4	Reading-history import (Goodreads)	✅ Done
M5	Recommender v1 (user-scoped)	✅ Done
Identity	Public registration, per-user shelves, GDPR self-service, hardening	✅ Done
M6.5	CI/CD auto-deploy (green `main` → NAS)	✅ Done
Relevance	`relevance_score` (demand + holdings + recency) drives browse/search ordering	⏳ Planned
Libraries	Follow branches by proximity; scope browse/search/recs to your libraries	⏳ Planned
M7+	More provinces · mobile app · external canon boost	⏳ Planned

Public deploy is live at biblio.josearcos.me; see docs/design/architecture.md §11 for milestone detail.

Observability

The production api is instrumented with OpenTelemetry (APM / distributed tracing). The container runtime wraps uvicorn with opentelemetry-instrument, which auto-instruments FastAPI and asyncpg — HTTP requests and DB queries become spans with no application-code changes. It is a no-op locally: instrumentation only activates when the OTEL_* env vars are set (defined only in docker-compose.prod.yml), so dev runs and tests are unaffected.

In production, telemetry is exported via OTLP to a shared OpenTelemetry collector, which fans traces out to Grafana Tempo and SigNoz (service.name=bibliohack-api). See docs/design/architecture.md §10 for the full picture.

Deployment

Green pushes to main are gated by CI (lint, typecheck, tests, image build) and then auto-deploy to a Synology NAS over Tailscale. The public surface is served through a Cloudflare Tunnel (read-only; no inbound ports). Database migrations ship in the API image and run on deploy. The crawl plane runs as a separate, self-restarting container so it can't affect the public site. Details — including the hard-won Synology specifics — live in docs/design/architecture.md §10.

Contributing

This is primarily a personal project, but issues, ideas and PRs are welcome. If you're contributing code:

Read AGENTS.md for the conventions (it's written for AI assistants but applies to humans too).
Before pushing, the backend must pass ruff format --check ., ruff check ., mypy src, and pytest; the frontend must pass its lint/typecheck/test — all enforced in CI.
Add an Alembic migration for any schema change.
Be a good OPAC citizen. The crawler is deliberately rate-limited and capped because it talks to a live public-library system. Please don't change it to be more aggressive.

Responsible use & data

biblioHack mirrors public-sector bibliographic data that belongs to the Junta de Andalucía and the Spanish public-library system, reused under the Spanish public-sector information rules (Ley 37/2007). Información obtenida del Portal de la Junta de Andalucía (CC-BY 3.0 ES); derivatives of this data must carry the same attribution. The crawler identifies itself, throttles every request, and caps its volume so it never burdens the source system. This project is independent and unaffiliated.

License

_{Built with care for readers, libraries, and a polite internet.}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 biblioHack

What is this?

What works today

On the roadmap

How it works

Tech stack

Repository layout

Getting started

Prerequisites

Quick start

Configuration

📈 Status

Observability

Deployment

Contributing

Responsible use & data

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
.github/workflows		.github/workflows
backend		backend
docs		docs
frontend		frontend
infra		infra
scripts		scripts
tools		tools
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.crawler.yml		docker-compose.crawler.yml
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
goodreads_library_export.csv		goodreads_library_export.csv

Folders and files

Latest commit

History

Repository files navigation

📚 biblioHack

What is this?

What works today

On the roadmap

How it works

Tech stack

Repository layout

Getting started

Prerequisites

Quick start

Configuration

📈 Status

Observability

Deployment

Contributing

Responsible use & data

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages