Skip to content

Releases: D4Vinci/Scrapling

v0.3

01 Sep 04:26
d5aaeb9
Compare
Choose a tag to compare

Scrapling v0.3.0 Release Notes

🎉 Major Release — Complete Architecture Overhaul

Scrapling v0.3 represents the most significant update in the project's history, featuring a complete architectural rewrite, considerable performance improvements, and powerful new features, including AI integration and interactive Web Scraping shell capabilities.

This release includes multiple breaking changes; please review the release notes carefully.

🚀 Major New Features

Session-Based Architecture

  • New Session Classes: Complete rewrite introducing persistent session support
    • FetcherSession - HTTP requests with persistent state management that works with both sync and async code
    • DynamicSession/AsyncDynamicSession - Browser automation while keeping the browser open till you finish
    • StealthySession/AsyncStealthySession - Stealth browsing while keeping the browser open till you finish
  • Async Browser Tabs Management: A new pool of tabs feature through the max_pages argument that rotates browser tabs for concurrent browser fetches
  • Concurrent Sessions: Run multiple isolated sessions simultaneously

Refer to the Fetching section on the website for more details.

A lot of new stealth/anti-bot Capabilities

  • 🤖 Cloudflare Solver: Automatic Cloudflare Turnstile challenge solving in StealthyFetcher and its session classes
  • Browser fingerprint impersonation: Mimic real browsers' TLS fingerprints, version-matching browser headers, HTTP/3 support, and more with the all-new Fetcher class
  • Improved stealth mode: The stealth mode for DynamicFetcher and its session classes is now more robust and reliable (AKA PlayWrightFetcher)

AI Integration & MCP Server

  • Built-in MCP Server: Model Context Protocol server for AI-assisted web scraping
  • 6 Powerful Tools: get, bulk_get, fetch, bulk_fetch, stealthy_fetch, bulk_stealthy_fetch
  • Smart Content Extraction: Convert web pages/elements to Markdown, HTML, or extract a clean version of the text content
  • CSS Selector Support: Use the Scrapling engine to target specific elements with precision before handing the content to the AI
  • Anti-Bot Bypass: Handle Cloudflare Turnstile and other protections
  • Proxy Support: Use proxies for anonymity and geo-targeting
  • Browser Impersonation: Mimic real browsers with TLS fingerprinting, real browser headers matching that version, and more
  • Parallel Processing: Scrape multiple URLs concurrently for efficiency
  • and more...

New Interactive Web Scraping Shell

  • A New Shell: Custom IPython shell with many smart Built-in Shortcuts like get, post, put, delete, fetch, and stealthy_fetch
  • Smart Page Management: New commands page and pages to automatically store the current page and history for all requests done through the shell
  • Curl Integration: Convert browser DevTools curl commands with uncurl and curl2fetcher functions to Fetcher requests
  • and more...

Scrape from the terminal without programming

  • New Extract Commands: Terminal-based scraping without programming
    • scrapling extract get/post/put/delete - Simple HTTP requests
    • scrapling extract fetch - Dynamic content scraping
    • scrapling extract stealthy-fetch - Anti-bot bypass
  • Downloads web pages and saves their content to files.
  • Converts HTML to readable formats like Markdown, keeps it as HTML, or just extracts the text content of the page.
  • Supports custom CSS selectors to extract specific parts of the page.
  • Handles HTTP requests and fetching through browsers.
  • Highly customizable with custom headers, cookies, proxies, and the rest of the options. Almost all the options available through the code are also accessible through the terminal.
  • and more...

🔧 Technical Improvements

Performance Enhancements

  • DynamicFetcher is now ~60% faster - A much faster version depending on your config (especially stealth mode)
  • StealthyFetcher is now 20–30% faster - Using the new structure, and starting to use our implementation instead of Camoufox Python interface
  • 50%+ combined speed gains across core selection methods (find_by_text, find_similar, find_by_regex, relocate, etc.) 🚀
  • ~10% CSS/XPath first methods speed increase - css_first and xpath_first are now faster than css and xpath
  • 40% faster get_all_text() method for content extraction
  • 20% speed improvement in adaptive element relocation
  • Navigation properties optimization — Properties like next, previous, below_elements, and more are now noticeably faster
  • 5x faster text cleaning operations
  • Memory efficiency improvements with optimized imports and reduced overhead
  • ⚡ Lightning-fast imports: Reduced startup time with optimized module loading
  • Better benchmarks: All the speed improvements Scrapling got made it much faster than before, compared to other libraries (1775x faster than BeautifulSoup and 5.1x faster than AutoScraper, check benchmarks)

Architecture/Code Quality, and Quality of life

  • Persistent Context: All browser-based fetchers now use persistent context by default. (Solves #64 too)
  • Using msgspec to validate all browser-based fetchers very fast before running the requests, so now it's easier to debug errors.
  • All cookies returned from fetchers are now matching the format accepted by the same fetcher. So you can retrieve cookies and pass them again to all fetchers and their session classes.
  • Faster linting and formatting due to migrating to ruff
  • Modern Build System: Migrated from setup.py to pyproject.toml 📦
  • Better GitHub actions and workflows for smoother development and testing
  • 🎨 Enhanced Type Hints: Complete type coverage with modern Python standards for better IDE support and reliability
  • Cleaner Codebase: Removed dead code and optimized core functions 🧹
  • 🚀 Backward Compatibility: Added shortcuts to maintain compatibility with older code

Breaking Changes

Minimum Python Version

  • Python 3.10+ Required: Dropped support for Python 3.9 and below

Class and Method Naming

These renamings are intended to improve clarity and consistency, particularly for new users.

  • AdaptorSelector: Core parsing class renamed (But still can be imported as Adaptor for backward compatibility)
  • AdaptorsSelectors: Collection class renamed (But still can be imported as Adaptors for backward compatibility)
  • auto_matchadaptive: Parameter renamed across all methods
  • adaptor_argumentsselector_config: Configuration parameter renamed
  • automatch_domainadaptive_domain: Domain parameter renamed
  • additional_argumentsadditional_args: Shortened parameter name
  • ⚠️ text/bodycontent: Selector constructor parameter is now accepting both str and bytes format
  • PlayWrightFetcherDynamicFetcher: Browser automation class renamed (But still can be imported as PlayWrightFetcher for backward compatibility)
  • DynamicFetcher doesn't have the NSTBrowser logic/arguments anymore since it's pointless to leave this logic now anyway.
  • StealthyFetcher's headless argument can't accept 'virtual' as an argument anymore since we are not using Camoufox's library right now in anything other than getting the browser installation path and the rest of the launch options

🐛 Bug Fixes

  • Fixed nested children counting in ignored tags for get_all_text (#61)
  • Fixed the issue with installation due to spaces in Python's executable path (#57)
  • Resolved threading issues in storage with recursion handling while the adaptive feature is enabled
  • Fixed argument precedence issues using the Sentinel pattern in FetcherSession
  • Resolved proxy type handling in StealthyFetcher
  • Fixed referer and google_search argument conflicts
  • Fixed async stealth script injection problems

🙏 Special thanks to our Discord community for all the continuous testing, feedback, and contributions across the last four months


Big shoutout to our biggest Sponsors


v0.2.99

08 Apr 04:42
96e3c3d
Compare
Choose a tag to compare

This is an essential update for everyone to fully enjoy Scrapling as it's intended.

What's changed

New full documentation website

  • Yup, finally 😄 Check it out from here

Unified import logic for fetchers

  • Now you can import all fetchers with from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher, then use them directly like page = Fetcher.get(...) without initialization.
    This replaces this old import from scrapling.defaults import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher.

Breaking change: automatch is now turned off by default

  • Now there's new logic to enable automatch from fetchers or other parsing options. Check out the documentation page for details.

Old imports and logic are left usable with a warning for backward compatibility.

New options added to fetchers

  • Now, both StealthyFetcher and PlayWrightFetcher have a new argument while fetching called wait. This makes the fetcher wait/sleep for a specific period (milliseconds) before closing the page and returning the response to you.
  • Now StealthyFetcher methods fetch and async_fetch have the argument additional_arguments to be passed to Camoufox as additional settings, which takes higher priority than Scrapling's settings (#54 )

Bugs squashed

  • Fixed a bug in async_fetch in both StealthyFetcher and PlayWrightFetcher classes, with catching redirections.

Thanks for all your support and donations!


Big shoutout to our biggest Sponsor: Scrapeless

Scrapeless Banner

v0.2.98

17 Mar 13:26
e60d0cb
Compare
Choose a tag to compare

This is an essential update for everyone to enjoy Scrapling as it's intended fully

What's changed

Various memory usage and speed optimizations

  • All selection methods' memory usage is ~40% of previous memory usage and the speed slightly increased.
  • Implemented Lazy loading for all submodules of the library so now what you use is what you load, for example:
    Before the update this import from scrapling import Adaptor was using 30-40mb of RAM because it loaded all fetchers and stuff with it too, now it uses ~1.2mb.
  • The last update made the library use ~32% memory it used before with a large requests pool, now we adjusted the caching further to use even less than that.
  • Overall speed increase in the parser by a slight 2-5%

Thanks for all your support and donations!


Big shoutout to our biggest Sponsor: Scrapeless

Scrapeless Banner

v0.2.97

15 Mar 01:29
Compare
Choose a tag to compare

This is an essential update for everyone to fully enjoy Scrapling as it's intended

What's changed

Lower memory usage and small speed increase across all Fetchers.

  • With new limitations across the library over caching size you will notice significantly lower memory usage than before while doing large numbers of requests/operations.
  • Refactored big parts of the fetchers to easier maintainability and small speed increase.

Bugs fixed

  • Fixed a bug in TextHandler where importing it alone and passing a non-string value converts it to an empty string. Now anything passed to TextHandler is automatically converted to a string before being converted to TextHandler, this is forced on any value passed -- TextHandler as the name implies is intended to work with strings only after all! (#45 )
  • Fixed a bug where the retries arguments weren't taken into account in most AsyncFetcher methods.

Miscellaneous

  • Update type hints for most arguments in all fetchers to be clearer and more accurate.

Thanks for all your support and donations!


Big shoutout to our biggest Sponsor: Scrapeless

Scrapeless Banner

v0.2.96

05 Mar 01:45
3ca4ea1
Compare
Choose a tag to compare

This is an essential update for everyone to fully enjoy Scrapling as it's intended

What's changed

  1. Added the -f option to scrapling install to force reinstall browser dependencies. I recommend you do scrapling install -f now to enjoy the big speed performance StealthyFetcher just got with the new Camoufox browser version :)
  2. Fixed a bug in TextHandler where slicing returned TextHandlers instead of TextHandler and fixed the type hint there (#41 )
  3. Fixed an issue where scrapling install might in some instances drop the user into a Python shell!

Thanks for all your support!


Big shoutout to our biggest Sponsor: Scrapeless

Scrapeless Banner

v0.2.95

25 Feb 22:28
573bfe0
Compare
Choose a tag to compare

This is an essential update for everyone to fully enjoy Scrapling as it's intended

What's changed

  1. Fixed a bug in Fetcher that made headers generated by the stealthy_headers argument overwrite some of the headers provided by the user like Accept (#39 )
  2. Improved the headers generation logic a bit so it should give a slight speed boost.

Thanks for all your support!


Shoutout to our biggest Sponsor: Scrapeless

Scrapeless Banner

v0.2.94

22 Feb 16:56
5251abb
Compare
Choose a tag to compare

This is an essential update for everyone to fully enjoy Scrapling as it's intended

What's changed

  1. Added the history property to all fetchers to show redirections (#32 )
  2. Fixed the logic of the case_sensitive argument logic for all re/re_first. This may make your code return different results if you were using it (but you probably deserve it because you noticed it wasn't working as intended and didn't open an issue LOL)
  3. Updated dependencies and enabled coop back again in the Camoufox engine (StealthyFetcher).

Thanks for all your support!


Shoutout to our biggest Sponsor: Scrapeless

Scrapeless Banner

v0.2.93

31 Jan 01:58
Compare
Choose a tag to compare

This is an essential update for everyone to fully enjoy Scrapling as it's intended

What's changed

  1. The return type is now consistent across all the parser engine so you will always get a return type as one of these Adaptor, Adaptors, TextHandler, TextHandlers, None, and a list in case you have mixed results like combined CSS selector. This allows a better coding experience with minimum manual type checking, makes the library more stable, and makes chaining methods always possible.
  2. Most of the parser engine especially the Adaptor class got refactored to a cleaner version and most importantly a faster version. So now almost all the methods/properties, especially the searching methods, got a speed increase between 5-40%. Some methods got bigger speed boosts like find_by_regex got a ~60% speed boost! The automatch feature got a small ~5% speed boost.
  3. Fixed logic bugs with the find_all/find methods that made the passed filters used in OR fashion and other times as an AND. So now all elements returned need to fulfill all filters you pass.
  4. Now all regex-related methods return TextHandler/TextHandlers for easier methods chaining.
  5. Added a new below_elements property that returns an Adaptors object of all elements under the current element in the DOM tree.
  6. Now all methods/properties that were returning HTML source as string are now returning it as TextHandler so you can do regex easily on it etc...
  7. StealthyFetcher is now a bit faster and more stealthy. Also, now it's possible to click Captchas in iframes like Cloudflare Turnstile.
  8. The auto-completion and type hints improved a lot in nearly half the library. Especially Adaptor, TextHandler, and TextHandlers.
  9. Now slicing TextHandler, accessing by index, or using the split method returns another TextHandler instead of the standard Python string. Now almost all standard string operations/methods return other Texthandler instead of standard string to make chaining methods/functions always possible.
  10. Fixed some small bugs and typos. For example, the Fetcher async_put was doing post request instead of put request 😶‍🌫️
  11. Improved the README a bit till I finish the documentation website.

This was supposed to be a small update till version 0.3 but thought to make it better.

Thanks for all your support!


Shoutout to our biggest Sponsor: Scrapeless

Scrapeless Banner

v0.2.92

26 Dec 18:05
32d9660
Compare
Choose a tag to compare

What's changed

  • Now response returned by browser-based fetchers uses more reliable data sources in cases where the page loaded uses many Iframes.
  • Now installing Scrapling is made even easier, you install it with pip then run scrapling install in the terminal and you are ready!
  • Fixed an inaccurate type hint in the parser.

Note

A friendly reminder that maintaining and improving Scrapling takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling and want it to keep improving, you can help by supporting me through the Sponsor button.

v0.2.91

19 Dec 11:52
ee59914
Compare
Choose a tag to compare

What's changed

  • Fixed a bug where the logging fetch logging sentence was showing in the first request only.
  • The default behavior for Playwright API while browsing a page is returning the first response that fulfills the load state given to the goto method ["load", "domcontentloaded", "networkidle"] so if a website has a wait page like Cloudflare's one that redirects you to the real website afterward, Playwright will return the first status code which in this case would be something like 403. This update solves this issue for both PlaywrightFetcher and StealthyFetcher as both are using Playwright API so the result depends on Playwright's default behavior no more.
  • Added support for proxies that use SOCKS proxies in the Fetcher class.
  • Fixed the type hint for the wait_selector_state argument so now it will show the accurate values you should use while auto-completing.

Note

A friendly reminder that maintaining and improving Scrapling takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling and want it to keep improving, you can help by supporting me through the Sponsor button.