v0.4.8 by D4Vinci · Pull Request #277 · D4Vinci/Scrapling

D4Vinci · 2026-05-11T01:28:16Z

A big spider update that takes the crawling framework to the next level 🕷️

Note

Follow us on X for daily tips and tricks

🚀 New Stuff and quality of life changes

Added a LinkExtractor primitive in scrapling.spiders.LinkExtractor to pull URLs out of a Response. There are a lot of controls (Check the docs)
```
from scrapling.spiders import LinkExtractor

extractor = LinkExtractor(allow=r"/posts/", deny_domains=["ads.example.com"])
```

Added CrawlSpider and CrawlRule generic spider templates so you no longer have to hand-write the same "follow links matching this pattern" boilerplate. Override rules() to return a list of CrawlRule objects, each pairing a LinkExtractor. (Check the docs)

from scrapling.spiders import CrawlSpider, CrawlRule, LinkExtractor

class QuotesSpider(CrawlSpider):
    name = "blog"
    start_urls = ["https://quotes.toscrape.com/"]

    def rules(self):
        return [
            CrawlRule(LinkExtractor(allow=r"/author/"), callback=self.parse_author),
            CrawlRule(LinkExtractor(allow=r"/page/\d+/")),  # pagination, no callback
        ]

    async def parse_author(self, response):
        yield {
            "name": response.css(".author-title::text").get(),
            "birthday": response.css(".author-born-date::text").get(),
            "url": response.url,
        }

Added a SitemapSpider template that seeds a crawl directly from a sitemap, or robots.txt URLs. Handles gzip-compressed sitemaps, and a lot of controls and options. URLs are dispatched via the crawl rules as shown above for CrawlSpider. (Check the docs)

from scrapling.spiders import SitemapSpider, CrawlRule, LinkExtractor

class NewsSitemap(SitemapSpider):
    name = "news"
    sitemap_urls = ["https://example.com/robots.txt"]

    def rules(self):
        return [
            CrawlRule(LinkExtractor(allow=r"/articles/"), callback=self.parse_article),
        ]

    async def parse_article(self, response):
        yield {"url": response.url, "title": response.css("h1::text").get()}

Adaptive relocation now defaults to a 40% similarity threshold instead of 0 across all methods. This will make the adaptive feature work better. When nothing crosses the threshold, a warning now tells you the top score it did see, so you can lower percentage deliberately if needed.
Updated all browsers and fingerprints. Run a new scrapling install --force after updating to refresh the browsers and fingerprints.

🐛 Bug Fixes

Fixed Fetcher.configure(...) not applying to per-request calls. Same fix applied to AsyncFetcher.
Fixed incorrect request fingerprinting that caused duplicate requests in spiders by @yetval in #255.
Fixed the Adaptive scraping engine staying silent on weak matches. Combined with the threshold change above, you now get a warning instead of a misleading "best guess" element when relocation fails.

Docs

Refreshed older code examples across the documentation to match the current version.
Improved the code copy-paste experience on the docs site and trimmed the agent skill so it uses fewer tokens per invocation.

🙏 Special thanks to the community for all the continuous testing and feedback

Big shoutout to our Platinum Sponsors

…r the agent skill

D4Vinci and others added 27 commits April 22, 2026 15:31

fix: solving a bug with using configure on Fetcher

35032d6

test: adding new tests for the configure function

b626b4d

docs: improving the code copy-paste experience and use less tokens fo…

9af644e

…r the agent skill

build: pump up version and deps

b6f0f2a

docs: Updating deps and allow code copy

fa41ca6

Merge branch 'main' into dev

82837ee

docs: update old examples

835e7ca

fix: hash request kwargs and headers correctly

b173635

test: add request fingerprint regressions

a5a5652

Merge branch 'main' into dev

756a595

Merge branch 'main' into dev

f305580

fix: str(value)

334b0f5

Merge branch 'dev' into fix/request-fingerprint

809d478

fix(request fp): hash request kwargs and headers correctly (#255)

90a853d

Merge branch 'main' into dev

ea7b4bd

fix(parser): change the default threshold and add warning

333b6de

Merge branch 'main' into dev

e5d7ed2

feat(spiders): Add pure URL discovery primitive

18e9121

feat(spiders): Add CrawlSpider and CrawlRule

f093d0c

feat(spiders): Add SitemapSpider

f7da157

test: add tests accordingly

8a8b2d1

docs: add docs for the new features

6a64191

docs: update zensical version

c0ed894

build: update deps and browser useragents

ebd7e09

tests: remove old code and update the rest

34651ab

docs(agent): update skill zip file

5593189

Merge branch 'main' into dev

149d5f0

github-advanced-security AI found potential problems May 11, 2026

View reviewed changes

Comment thread tests/spiders/test_links.py Dismissed

ops: update tests deps

66d8917

D4Vinci merged commit 9ee3501 into main May 11, 2026
13 checks passed

D4Vinci deployed to PyPI May 11, 2026 01:59 — with GitHub Actions View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.4.8#277

v0.4.8#277
D4Vinci merged 28 commits into
mainfrom
dev

D4Vinci commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

D4Vinci commented May 11, 2026

🚀 New Stuff and quality of life changes

🐛 Bug Fixes

Docs

Big shoutout to our Platinum Sponsors

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants