A simple, library-free web crawler demonstrating Intuned's extend_payload and persistent_store features for parallel crawling with deduplication.
This project showcases two key Intuned runtime features:
Dynamically spawn new payloads within a job. This enables a fan-out pattern where one API call triggers many others, all within the same job run. Reference: https://intunedhq.com/docs/main/05-references/runtime-sdk-python/extend-payload
A shared key-value store that persists across all payloads in a job. Used here for URL deduplication — preventing the same page from being crawled multiple times. Reference: https://intunedhq.com/docs/main/05-references/runtime-sdk-python/persistent-store
┌─────────────────────────────────────────────────┐
│ JOB RUN │
│ │
│ ┌─────────────────────────────────────────┐ │
│ │ SHARED persistent_store │ │
│ │ (tracks visited URLs) │ │
│ └─────────────────────────────────────────┘ │
│ │
Start Job ───────►│ 1. crawl(seed_url) │
│ ├─► Extract markdown content │
│ ├─► Discover 50 links │
│ └─► extend_payload(crawl) × 50 │
│ │ │
│ ┌────────────────┼────────────────┐ │
│ ▼ ▼ ▼ │
│ 2. crawl(link1) crawl(link2) crawl(...) │
│ ├─► Extract ├─► Extract ├─► ... │
│ ├─► Discover ├─► Discover │ │
│ └─► extend... └─► extend... │ │
│ │
│ Continues until depth/page limits reached │
│ All results sent to job sink (webhook/S3) │
└─────────────────────────────────────────────────┘
native-crawler/
├── api/
│ └── crawl.py # Main API: extract content + discover links + recurse
├── utils/
│ ├── __init__.py
│ ├── content.py # extract_page_content() - markdown extraction
│ └── links.py # extract_links() - link discovery + normalization
├── intuned-resources/
│ └── jobs/
│ └── crawl.job.jsonc # Job definition for crawl API
├── .parameters/api/ # Parameter files for testing
├── Intuned.jsonc
└── README.md
Crawls a URL: extracts content, discovers links, and queues them for further crawling.
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
string | required | URL to crawl |
max_depth |
int | 2 | Maximum crawl depth from seed |
max_pages |
int | 50 | Maximum total pages to process |
depth |
int | 0 | Current depth (set internally by extend_payload) |
uv syncIf the intuned CLI is not installed, install it globally:
npm install -g @intuned/cliAfter installing dependencies, intuned command should be available in your environment.
Before running any API, provision and deploy the project first.
intuned dev provision
intuned dev deployintuned dev run api crawl .parameters/api/crawl/default.jsonWhen run as a job, extend_payload spawns parallel payloads and attempt_store deduplicates across all of them:
curl -X POST "https://api.intunedhq.com/projects/{project}/jobs" \
-H "Authorization: Bearer {api_key}" \
-d '{
"payload": {
"api": "crawl",
"parameters": {
"url": "https://books.toscrape.com",
"max_depth": 2,
"max_pages": 100
}
},
"sink": {
"type": "webhook",
"url": "https://your-webhook.com/results"
}
}'You can extract structured data instead of markdown by providing a generic JSON schema. This uses Intuned's AI extraction model.
Edit .parameters/api/crawl/default.json to include a schema:
{
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"max_depth": 0,
"schema": {
"type": "object",
"properties": {
"title": { "type": "string" },
"price": { "type": "string" },
"availability": { "type": "string" }
},
"required": ["title", "price"]
}
}Then run:
intuned dev run api crawl .parameters/api/crawl/default.jsonYou can automatically find and download files (PDFs, images, etc.) to S3 by enabling include_attachments.
Edit .parameters/api/crawl/default.json:
{
"url": "https://sandbox.intuned.dev/pdfs",
"max_depth": 1,
"include_attachments": true
}Then run:
intuned dev run api crawl .parameters/api/crawl/default.jsonextract_page_content(page)— Returns{title, markdown, markdown_length}
extract_links(page, base_domain)— Returns list of normalized internal URLsnormalize_url(url)— Normalize URL (remove fragments, trailing slashes)get_base_domain(url)— Extract domain from URL
The persistent_store uses these key patterns:
| Key Pattern | Purpose |
|---|---|
visited:{url} |
Tracks URLs that have been crawled |
__page_count__ |
Global counter for pages processed |
__max_depth__ |
Stored config: max depth |
__max_pages__ |
Stored config: max pages |
__base_domain__ |
Stored config: base domain for filtering |