A CLI/API tool that uses browser automation with paid account cookies to access paywalled articles, extract content with images, generate AI summaries, and output structured markdown files.
Users with valid paid subscriptions to news/article platforms cannot easily:
- Programmatically access articles they have legitimate access to
- Extract full article content with embedded images
- Generate quick summaries for knowledge management
- Create persistent markdown records with all associated media
Current solutions either:
- Require manual browser interaction
- Lack image preservation capabilities
- Don't generate summaries automatically
- Use ethically questionable paywall bypass methods
Build an automation tool that:
- Authenticates using existing paid-account cookies (legally legitimate)
- Renders articles in a real browser (handles JavaScript-heavy paywalls)
- Extracts full article text + all images
- Summarizes content using LLM (Claude/GPT)
- Generates a markdown file with:
- Article metadata (title, author, date, source URL)
- Summary (2-3 paragraphs)
- Embedded article images
- Full article text (optional, for archival)
- Cookie Import: Load cookies from browser export or JSON file
- Cookie Validation: Verify cookies grant access before processing
- Session Management: Maintain persistent sessions across requests
- Timeout Handling: Refresh cookies if expired
- Headless Browser: Use Puppeteer (Node.js) or Playwright for rendering
- JavaScript Execution: Full page rendering to bypass client-side paywalls
- Dynamic Loading: Wait for lazy-loaded content
- Cookie Injection: Insert cookies before navigation
- Anti-Detection: Mimic human behavior (random delays, user-agent rotation)
- Text Extraction: Remove boilerplate (headers, footers, ads)
- Image Extraction: Download all article images
- Metadata Parsing: Title, author, publication date, source URL
- Link Preservation: Maintain reference links in markdown
- HTML Cleaning: Convert to clean markdown
- LLM Integration: Local Ollama models (no API keys required)
- Model Support:
- Primary:
llama3.1(4.9GB) - Best balance of speed/quality - Fast Mode:
qwen3:4b(2.5GB) - Quick summaries, lower resources - Quality Mode:
mistral(4.4GB) - Nuanced understanding - Auto-fallback if model unavailable
- Primary:
- Smart Summarization:
- Executive summary (2-3 paragraphs)
- Key points extraction
- Optional TL;DR (one-liner)
- Customizable Length: Short/Medium/Long summary options
- Offline-First: Runs locally, no internet required after model download
- Structured Format:
# [Article Title] **Source**: [Publication Name] | [URL] **Author**: [Name] | **Date**: [Published Date] ## Summary [AI-generated summary] ## Images [Embedded images with captions] ## Full Article [Original article text] - Image Embedding: Relative paths, downloadable to local folder
- File Organization: Organized by publication/date
- Paywall detection failures
- Session expiration
- Image download failures
- Summarization API rate limits
- Graceful degradation
┌─────────────────────────────────────┐
│ CLI / API Entry Point │
│ (Node.js with Commander.js) │
└────────────────┬────────────────────┘
│
┌───────────┼───────────┐
│ │ │
┌────▼───┐ ┌────▼───┐ ┌───▼─────┐
│ Cookie │ │Browser │ │ LLM │
│Manager │ │Engine │ │ Client │
└────┬───┘ └────┬───┘ └───┬─────┘
│ │ │
└──────────┼─────────┘
│
┌───────▼────────┐
│ Content Store │
│ (Markdown/IMG) │
└────────────────┘
| Component | Option 1 | Option 2 |
|---|---|---|
| Browser Automation | Puppeteer | Playwright |
| Runtime | Node.js 18+ | Bun |
| LLM Backend | Ollama (Local) | |
| LLM Communication | ollama npm package |
Direct HTTP to localhost:11434 |
| Markdown Generation | Marked.js | Pandoc |
| Image Processing | Sharp | ImageMagick |
| CLI Framework | Commander.js | Yargs |
| HTTP Client | Axios | Node-fetch |
| Data Validation | Zod | Yup |
Why Ollama?
- ✅ No API costs (free, locally-hosted)
- ✅ No rate limits
- ✅ Works offline
- ✅ Complete privacy (data never leaves your machine)
- ✅ Instant inference (no network latency)
- ✅ Multiple models available
User exports cookies from browser
↓
Provides article URL + cookie file
↓
System validates cookie access
↓
Confirms article is accessible
Load article URL with cookies
↓
Render in headless browser
↓
Detect if paywall present
↓
Extract text + images
↓
Parse metadata
↓
Summarize with LLM
↓
Generate markdown file
↓
Download images to local folder
↓
Output complete package
- Soft Paywalls (JavaScript overlays): MIT Technology Review, Medium, some news sites
- Client-Side Hard Paywalls: Sites that load content then hide it
- Server-Side (Limited): If user is pre-authenticated with cookies
Not Supported:
- Server-side paywalls requiring fresh authentication
- Token-based APIs without cookie equivalents
- ✅ Successfully extract >90% of article text
- ✅ Preserve >95% of article images
- ✅ Generate summaries within 30 seconds
- ✅ Markdown output renders correctly in all markdown viewers
- ✅ Handle 50+ different paywall types
- ✅ Support batch processing (multiple URLs)
- Single article: < 60 seconds (render + extract + summarize)
- Batch processing: Parallel processing with rate limiting
- Image download: Concurrent (max 5 simultaneous)
- Retry logic for transient failures
- Graceful degradation (summary optional)
- Session recovery from cookie expiration
- Store cookies securely (encrypted config file)
- No credential logging
- HTTPS-only for all requests
- No telemetry/tracking
- Respect robots.txt from paid account perspective
- Only access URLs user has legitimate subscription to
- Clear terms: "For archival of legitimately accessible content"
- Single article extraction
- Manual cookie import
- Basic text + image extraction
- Local Ollama summarization (llama3.1 primary, qwen3:4b fallback)
- Markdown output with images
- Support 5 major paywall types
- Auto-detect and use available Ollama model
- Batch URL processing
- Scheduled article downloads
- Web UI dashboard
- Browser extension integration
- Advanced metadata extraction
- Citation generation
- Database storage with full-text search
- Email delivery of summaries
- Webhook integration
- Custom LLM prompt templates
| Risk | Impact | Mitigation |
|---|---|---|
| Cookie expiration | Tool stops working | Auto-refresh, user notification |
| Paywall updates break extraction | Lost functionality | Regular testing, modular extractors |
| Ollama not running | Summaries fail | Clear error message, setup guide |
| Required model not downloaded | Processing blocked | Auto-detect available models, fallback chain |
| Insufficient VRAM for model | Out of memory errors | Model selection based on system specs |
| Legal concerns | Takedown risk | Clear ToS, user responsibility |
| Image CDN blocking | Missing images | Retry, fallback URLs |
- CLI accepts URL + cookie file
- Extracts article text accurately
- Downloads and embeds images in markdown
- Generates coherent summaries
- Creates properly formatted markdown files
- Handles cookie expiration gracefully
- Processes batch URLs
- Comprehensive error messages
- Works with 5+ different paywall types
- Complete documentation + examples