Skip to content

Commit e40a1e1

Browse files
authored
Merge pull request #33 from D4Vinci/dev
v0.2.93
2 parents 198b76f + 7c35341 commit e40a1e1

File tree

12 files changed

+299
-275
lines changed

12 files changed

+299
-275
lines changed

README.md

Lines changed: 27 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -92,27 +92,27 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha
9292
## Key Features
9393

9494
### Fetch websites as you prefer with async support
95-
- **HTTP requests**: Stealthy and fast HTTP requests with `Fetcher`
96-
- **Stealthy fetcher**: Annoying anti-bot protection? No problem! Scrapling can bypass almost all of them with `StealthyFetcher` with default configuration!
97-
- **Your preferred browser**: Use your real browser with CDP, [NSTbrowser](https://app.nstbrowser.io/r/1vO5e5)'s browserless, PlayWright with stealth mode, or even vanilla PlayWright - All is possible with `PlayWrightFetcher`!
95+
- **HTTP Requests**: Fast and stealthy HTTP requests with the `Fetcher` class.
96+
- **Dynamic Loading & Automation**: Fetch dynamic websites with the `PlayWrightFetcher` class through your real browser, Scrapling's stealth mode, Playwright's Chrome browser, or [NSTbrowser](https://app.nstbrowser.io/r/1vO5e5)'s browserless!
97+
- **Anti-bot Protections Bypass**: Easily bypass protections with `StealthyFetcher` and `PlayWrightFetcher` classes.
9898

9999
### Adaptive Scraping
100-
- 🔄 **Smart Element Tracking**: Locate previously identified elements after website structure changes, using an intelligent similarity system and integrated storage.
101-
- 🎯 **Flexible Querying**: Use CSS selectors, XPath, Elements filters, text search, or regex - chain them however you want!
102-
- 🔍 **Find Similar Elements**: Automatically locate elements similar to the element you want on the page (Ex: other products like the product you found on the page).
100+
- 🔄 **Smart Element Tracking**: Relocate elements after website changes, using an intelligent similarity system and integrated storage.
101+
- 🎯 **Flexible Selection**: CSS selectors, XPath selectors, filters-based search, text search, regex search and more.
102+
- 🔍 **Find Similar Elements**: Automatically locate elements similar to the element you found!
103103
- 🧠 **Smart Content Scraping**: Extract data from multiple websites without specific selectors using Scrapling powerful features.
104104

105-
### Performance
106-
- 🚀 **Lightning Fast**: Built from the ground up with performance in mind, outperforming most popular Python scraping libraries (outperforming BeautifulSoup in parsing by up to 620x in our tests).
105+
### High Performance
106+
- 🚀 **Lightning Fast**: Built from the ground up with performance in mind, outperforming most popular Python scraping libraries.
107107
- 🔋 **Memory Efficient**: Optimized data structures for minimal memory footprint.
108-
-**Fast JSON serialization**: 10x faster JSON serialization than the standard json library with more options.
108+
-**Fast JSON serialization**: 10x faster than standard library.
109109

110-
### Developing Experience
111-
- 🛠️ **Powerful Navigation API**: Traverse the DOM tree easily in all directions and get the info you want (parent, ancestors, sibling, children, next/previous element, and more).
112-
- 🧬 **Rich Text Processing**: All strings have built-in methods for regex matching, cleaning, and more. All elements' attributes are read-only dictionaries that are faster than standard dictionaries with added methods.
113-
- 📝 **Automatic Selector Generation**: Create robust CSS/XPath selectors for any element.
114-
- 🔌 **API Similar to Scrapy/BeautifulSoup**: Familiar methods and similar pseudo-elements for Scrapy and BeautifulSoup users.
115-
- 📘 **Type hints and test coverage**: Complete type coverage and almost full test coverage for better IDE support and fewer bugs, respectively.
110+
### Developer Friendly
111+
- 🛠️ **Powerful Navigation API**: Easy DOM traversal in all directions.
112+
- 🧬 **Rich Text Processing**: All strings have built-in regex, cleaning methods, and more. All elements' attributes are optimized dictionaries that takes less memory than standard dictionaries with added methods.
113+
- 📝 **Auto Selectors Generation**: Generate robust short and full CSS/XPath selectors for any element.
114+
- 🔌 **Familiar API**: Similar to Scrapy/BeautifulSoup and the same pseudo-elements used in Scrapy.
115+
- 📘 **Type hints**: Complete type/doc-strings coverage for future-proofing and best autocompletion support.
116116

117117
## Getting Started
118118

@@ -121,32 +121,33 @@ from scrapling import Fetcher
121121

122122
fetcher = Fetcher(auto_match=False)
123123

124-
# Fetch a web page and create an Adaptor instance
124+
# Do http GET request to a web page and create an Adaptor instance
125125
page = fetcher.get('https://quotes.toscrape.com/', stealthy_headers=True)
126-
# Get all strings in the full page
126+
# Get all text content from all HTML tags in the page except `script` and `style` tags
127127
page.get_all_text(ignore_tags=('script', 'style'))
128128

129-
# Get all quotes, any of these methods will return a list of strings (TextHandlers)
129+
# Get all quotes elements, any of these methods will return a list of strings directly (TextHandlers)
130130
quotes = page.css('.quote .text::text') # CSS selector
131131
quotes = page.xpath('//span[@class="text"]/text()') # XPath
132132
quotes = page.css('.quote').css('.text::text') # Chained selectors
133133
quotes = [element.text for element in page.css('.quote .text')] # Slower than bulk query above
134134

135135
# Get the first quote element
136-
quote = page.css_first('.quote') # / page.css('.quote').first / page.css('.quote')[0]
136+
quote = page.css_first('.quote') # same as page.css('.quote').first or page.css('.quote')[0]
137137

138138
# Tired of selectors? Use find_all/find
139+
# Get all 'div' HTML tags that one of its 'class' values is 'quote'
139140
quotes = page.find_all('div', {'class': 'quote'})
140141
# Same as
141142
quotes = page.find_all('div', class_='quote')
142143
quotes = page.find_all(['div'], class_='quote')
143144
quotes = page.find_all(class_='quote') # and so on...
144145

145146
# Working with elements
146-
quote.html_content # Inner HTML
147-
quote.prettify() # Prettified version of Inner HTML
148-
quote.attrib # Element attributes
149-
quote.path # DOM path to element (List)
147+
quote.html_content # Get Inner HTML of this element
148+
quote.prettify() # Prettified version of Inner HTML above
149+
quote.attrib # Get that element's attributes
150+
quote.path # DOM path to element (List of all ancestors from <html> tag till the element itself)
150151
```
151152
To keep it simple, all methods can be chained on top of each other!
152153

@@ -262,7 +263,7 @@ True
262263
| humanize | Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window. | ✔️ |
263264
| allow_webgl | Enabled by default. Disabling it WebGL not recommended as many WAFs now checks if WebGL is enabled. | ✔️ |
264265
| geoip | Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, & spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region. | ✔️ |
265-
| disable_ads | Enabled by default, this installs `uBlock Origin` addon on the browser if enabled. | ✔️ |
266+
| disable_ads | Disabled by default, this installs `uBlock Origin` addon on the browser if enabled. | ✔️ |
266267
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
267268
| timeout | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000. | ✔️ |
268269
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
@@ -544,7 +545,7 @@ Inspired by BeautifulSoup's `find_all` function you can find elements by using `
544545
* Any string passed is considered a tag name
545546
* Any iterable passed like List/Tuple/Set is considered an iterable of tag names.
546547
* Any dictionary is considered a mapping of HTML element(s) attribute names and attribute values.
547-
* Any regex patterns passed are used as filters
548+
* Any regex patterns passed are used as filters to elements by their text content
548549
* Any functions passed are used as filters
549550
* Any keyword argument passed is considered as an HTML element attribute with its value.
550551
@@ -553,7 +554,7 @@ So the way it works is after collecting all passed arguments and keywords, each
553554
554555
1. All elements with the passed tag name(s).
555556
2. All elements that match all passed attribute(s).
556-
3. All elements that match all passed regex patterns.
557+
3. All elements that its text content match all passed regex patterns.
557558
4. All elements that fulfill all passed function(s).
558559
559560
Note: The filtering process always starts from the first filter it finds in the filtering order above so if no tag name(s) are passed but attributes are passed, the process starts from that layer and so on. **But the order in which you pass the arguments doesn't matter.**

scrapling/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
from scrapling.parser import Adaptor, Adaptors
66

77
__author__ = "Karim Shoair ([email protected])"
8-
__version__ = "0.2.92"
8+
__version__ = "0.2.93"
99
__copyright__ = "Copyright (c) 2024 Karim Shoair"
1010

1111

scrapling/core/_types.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,8 @@
33
"""
44

55
from typing import (TYPE_CHECKING, Any, Callable, Dict, Generator, Iterable,
6-
List, Literal, Optional, Pattern, Tuple, Type, Union)
6+
List, Literal, Optional, Pattern, Tuple, Type, TypeVar,
7+
Union)
78

89
SelectorWaitStates = Literal["attached", "detached", "hidden", "visible"]
910

0 commit comments

Comments
 (0)