-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job postings + ProductList extraction #103
Merged
Merged
Changes from all commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
40fd2aa
Job posting spider.
wRAR 2acf2a8
Remove custom priorities from the job posting spider.
wRAR d1cbd50
Rename JobPostingCrawlStrategy.category.
wRAR 75e7bb0
Fix typing issues for typed Scrapy.
wRAR cb76df4
Merge remote-tracking branch 'origin/main' into job-posting
wRAR 60a7ac0
Merge remote-tracking branch 'origin/main' into job-posting
wRAR ef3d776
Fixes.
wRAR 9e0e658
Typing fixes.
wRAR 1a1fe87
Basic ProductList extraction.
wRAR a6ee446
Skip product links with dont_follow_product_links.
wRAR 4f6768d
Merge branch 'product-list' into articles_to_main
wRAR d5ccf01
Merge branch 'job-posting' into articles_to_main
wRAR 00f527f
Merge remote-tracking branch 'origin/articles_to_main' into job-posti…
wRAR 3c1c44b
Add the spider to the API reference.
wRAR 178bed4
E-commerce spider: Add an extract parameter (#94)
Gallaecio aa164f0
Extend the Search Queries description
Gallaecio 232e7ff
Enable CI for all PRs
Gallaecio e9725d6
Merge pull request #97 from Gallaecio/search-queries-tooltip
kmike 6dcacac
Merge remote-tracking branch 'origin/main' into job-posting-product-list
wRAR ea3ee4b
Do not let request-building exceptions break crawling (#96)
Gallaecio 78c495c
Merge remote-tracking branch 'origin/articles_to_main' into job-posti…
wRAR 3367b0d
Mark job posting template as experimental
kmike d4c4855
Merge pull request #102 from zytedata/mark-job-postings-as-experimental
kmike f35a621
Merge branch 'main' into job-posting-product-list
kmike 8e040fa
Merge pull request #92 from zytedata/job-posting-product-list
kmike File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,7 +7,6 @@ on: | |
push: | ||
branches: [ main ] | ||
pull_request: | ||
branches: [ main ] | ||
|
||
jobs: | ||
test: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
.. _job-posting: | ||
|
||
============================================= | ||
Job posting spider template (``job_posting``) | ||
============================================= | ||
|
||
Basic use | ||
========= | ||
|
||
.. code-block:: shell | ||
|
||
scrapy crawl job_posting -a url="https://books.toscrape.com" | ||
|
||
Parameters | ||
========== | ||
|
||
.. autopydantic_model:: zyte_spider_templates.spiders.job_posting.JobPostingSpiderParams | ||
:inherited-members: BaseModel | ||
:exclude-members: model_computed_fields |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,19 +1,22 @@ | ||
from typing import Any, Dict, Optional | ||
from typing import Any, Dict, Optional, Type | ||
|
||
import pytest | ||
from scrapy import Spider | ||
from scrapy.utils.test import TestSpider | ||
|
||
# https://docs.pytest.org/en/stable/how-to/writing_plugins.html#assertion-rewriting | ||
pytest.register_assert_rewrite("tests.utils") | ||
|
||
|
||
# scrapy.utils.test.get_crawler alternative that does not freeze settings. | ||
def get_crawler(*, settings: Optional[Dict[str, Any]] = None): | ||
def get_crawler( | ||
*, settings: Optional[Dict[str, Any]] = None, spider_cls: Type[Spider] = TestSpider | ||
): | ||
from scrapy.crawler import CrawlerRunner | ||
|
||
settings = settings or {} | ||
# Set by default settings that prevent deprecation warnings. | ||
settings["REQUEST_FINGERPRINTER_IMPLEMENTATION"] = "2.7" | ||
runner = CrawlerRunner(settings) | ||
crawler = runner.create_crawler(TestSpider) | ||
crawler = runner.create_crawler(spider_cls) | ||
return crawler |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
import pytest | ||
|
||
|
||
@pytest.fixture(scope="session") | ||
def mockserver(): | ||
from .mockserver import MockServer | ||
|
||
with MockServer() as server: | ||
yield server |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,178 @@ | ||
import argparse | ||
import json | ||
import socket | ||
import sys | ||
import time | ||
from importlib import import_module | ||
from subprocess import PIPE, Popen | ||
from typing import Any, Dict | ||
|
||
from scrapy_zyte_api.responses import _API_RESPONSE | ||
from twisted.internet import reactor | ||
from twisted.web.resource import Resource | ||
from twisted.web.server import Site | ||
|
||
|
||
def get_ephemeral_port(): | ||
s = socket.socket() | ||
s.bind(("", 0)) | ||
return s.getsockname()[1] | ||
|
||
|
||
class DefaultResource(Resource): | ||
"""Mock server to fake Zyte API responses. | ||
|
||
To use, include the mockserver fixture in the signature of your test, and | ||
point the ZYTE_API_URL setting to the mock server. See | ||
``tests/test_ecommerce.py::test_crawl_strategies`` for an example. | ||
|
||
This mock server is designed to fake a website with the following pages: | ||
|
||
``` | ||
https://example.com/ | ||
https://example.com/page/2 | ||
https://example.com/category/1 | ||
https://example.com/category/1/page/2 | ||
https://example.com/non-navigation | ||
``` | ||
|
||
When browserHtml is requested (for any URL, listed above or not), it is | ||
a minimal HTML with an anchor tag pointing to | ||
https://example.com/non-navigation. | ||
|
||
When productNavigation is requested, nextPage and subCategories are filled | ||
accordingly. productNavigation.items always has 2 product URLs, which are | ||
the result of appending ``/product/<n>`` to the request URL. | ||
https://example.com/non-navigation is not reachable through | ||
productNavigation. | ||
|
||
When product or productList is requested, an item with the current URL is | ||
always returned. | ||
|
||
All output also includes unsupported links (mailto:…). | ||
""" | ||
|
||
def getChild(self, path, request): | ||
return self | ||
|
||
def render_POST(self, request): | ||
request_data = json.loads(request.content.read()) | ||
request.responseHeaders.setRawHeaders( | ||
b"Content-Type", | ||
[b"application/json"], | ||
) | ||
request.responseHeaders.setRawHeaders( | ||
b"request-id", | ||
[b"abcd1234"], | ||
) | ||
|
||
response_data: _API_RESPONSE = {} | ||
|
||
response_data["url"] = request_data["url"] | ||
|
||
non_navigation_url = "https://example.com/non-navigation" | ||
html = f"""<html><body><a href="{non_navigation_url}"></a><a href="mailto:[email protected]"></a></body></html>""" | ||
if request_data.get("browserHtml", False) is True: | ||
response_data["browserHtml"] = html | ||
|
||
if request_data.get("product", False) is True: | ||
response_data["product"] = { | ||
"url": request_data["url"], | ||
} | ||
|
||
if request_data.get("productList", False) is True: | ||
response_data["productList"] = { | ||
"url": request_data["url"], | ||
} | ||
|
||
if request_data.get("productNavigation", False) is True: | ||
kwargs: Dict[str, Any] = {} | ||
if ( | ||
"/page/" not in request_data["url"] | ||
and "/non-navigation" not in request_data["url"] | ||
): | ||
kwargs["nextPage"] = { | ||
"url": f"{request_data['url'].rstrip('/')}/page/2" | ||
} | ||
if "/category/" not in request_data["url"]: | ||
kwargs["subCategories"] = [ | ||
{"url": "mailto:[email protected]"}, | ||
{"url": f"{request_data['url'].rstrip('/')}/category/1"}, | ||
] | ||
else: | ||
kwargs["nextPage"] = {"url": "mailto:[email protected]"} | ||
response_data["productNavigation"] = { | ||
"url": request_data["url"], | ||
"items": [ | ||
{"url": "mailto:[email protected]"}, | ||
{"url": f"{request_data['url'].rstrip('/')}/product/1"}, | ||
{"url": f"{request_data['url'].rstrip('/')}/product/2"}, | ||
], | ||
**kwargs, | ||
} | ||
|
||
return json.dumps(response_data).encode() | ||
|
||
|
||
class MockServer: | ||
def __init__(self, resource=None, port=None): | ||
resource = resource or DefaultResource | ||
self.resource = "{}.{}".format(resource.__module__, resource.__name__) | ||
self.proc = None | ||
self.host = socket.gethostbyname(socket.gethostname()) | ||
self.port = port or get_ephemeral_port() | ||
self.root_url = "http://%s:%d" % (self.host, self.port) | ||
|
||
def __enter__(self): | ||
self.proc = Popen( | ||
[ | ||
sys.executable, | ||
"-u", | ||
"-m", | ||
"tests.mockserver", | ||
self.resource, | ||
"--port", | ||
str(self.port), | ||
], | ||
stdout=PIPE, | ||
) | ||
assert self.proc.stdout is not None | ||
self.proc.stdout.readline() | ||
return self | ||
|
||
def __exit__(self, exc_type, exc_value, traceback): | ||
assert self.proc is not None | ||
self.proc.kill() | ||
self.proc.wait() | ||
time.sleep(0.2) | ||
|
||
def urljoin(self, path): | ||
return self.root_url + path | ||
|
||
|
||
def main(): | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument("resource") | ||
parser.add_argument("--port", type=int) | ||
args = parser.parse_args() | ||
module_name, name = args.resource.rsplit(".", 1) | ||
sys.path.append(".") | ||
resource = getattr(import_module(module_name), name)() | ||
# Typing issue: https://github.com/twisted/twisted/issues/9909 | ||
http_port = reactor.listenTCP(args.port, Site(resource)) # type: ignore[attr-defined] | ||
|
||
def print_listening(): | ||
host = http_port.getHost() | ||
print( | ||
"Mock server {} running at http://{}:{}".format( | ||
resource, host.host, host.port | ||
) | ||
) | ||
|
||
# Typing issue: https://github.com/twisted/twisted/issues/9909 | ||
reactor.callWhenRunning(print_listening) # type: ignore[attr-defined] | ||
reactor.run() # type: ignore[attr-defined] | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Check warning
Code scanning / CodeQL
Binding a socket to all network interfaces Medium test
Copilot Autofix AI 3 months ago
To fix the problem, we need to bind the socket to a specific interface instead of all interfaces. This can be achieved by replacing the empty string (
""
) with a specific IP address. In this case, we will use127.0.0.1
to bind the socket to the localhost interface, which limits access to the local machine only.The changes will be made in the
get_ephemeral_port
function in thetests/mockserver.py
file. Specifically, we will update thes.bind
call on line 18 to use127.0.0.1
instead of an empty string.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not fixing it here; the intention of PR was to merge changes to the main branch.