-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: lessons about crawling and scraping product detail pages in Python #1244
base: master
Are you sure you want to change the base?
Conversation
625726c
to
3199df8
Compare
fc1ebd6
to
4339dcd
Compare
@@ -142,10 +142,10 @@ Letting our program visibly crash on error is enough for our purposes. Now, let' | |||
|
|||
### Scrape Amazon | |||
|
|||
Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with Amazon search results: | |||
Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with AliExpress search results: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to do this because this is like... the very first exercise, and I don't want the reader to end up with an error. I don't want to direct them to a different course when they're touching the internet for the first time with their tiny program. So hoping for better results with Ali, which is a real world e-commerce website as well, but doesn't seem to give so much damn about scraping.
@@ -122,6 +122,14 @@ for product in soup.select(".product-item"): | |||
|
|||
This program does the same as the one we already had, but its code is more concise. | |||
|
|||
:::note Fragile code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicitly addressing some previous concerns of @vdusek. I thought about it and I think it's better if we're upfront about this decision and the reader knows that this is expected.
@@ -199,8 +199,12 @@ def export_json(file, data): | |||
json.dump(data, file, default=serialize, indent=2) | |||
|
|||
listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" | |||
soup = download(listing_url) | |||
data = [parse_product(product) for product in soup.select(".product-item")] | |||
listing_soup = download(listing_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this change because this way the code is better suited for the newly added lessons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Honza, great job as always.
sources/academy/webscraping/scraping_basics_python/04_downloading_html.md
Outdated
Show resolved
Hide resolved
sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md
Outdated
Show resolved
Hide resolved
sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md
Outdated
Show resolved
Hide resolved
6187b18
to
b82b673
Compare
I addressed all comments from @mnmkng. I also added a gif as I thought it will make one part clearer. I'm not a seasoned gif creator, so if someone can make a more beautiful one, go for it. Attaching the source recording as a file. Screen.Recording.2024-10-18.at.9.42.13.movThis is the gif: I made it like this:
Unfortunately it's not possible to show how the price changes dynamically, as the JS touches the parent elements and in both Chrome and Firefox devtools this causes the elements to collapse. At least it's visible that something changes there, and how the text of the stock availability changes. |
@vdusek can you please check the Python so that we can merge? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good and works well. I just have one suggestion regarding the images...
Currently, the images are quite large (~1.2 MB each). Could we apply some compression? I would say using jpeg
would be enough. To avoid unnecessary bloating of the Git history.
Co-authored-by: Ondra Urban <[email protected]>
5874efe
to
30a75bd
Compare
I applied compression to the PNG files added in this PR, results being:
I rebased the branch so that it contains only these new files, I hope that's gonna be sufficient. @vdusek @mnmkng if you see no further problems with the PR, bless this with green and I think we can merge. |
Done
This PR introduces two new lessons to the Python course, including real-world exercises. These two lessons conclude the base of the course, as at the end of the course the reader should be able to build their own scraper. The final exercises focus on that fact and test the reader's ability of independent building. The PR also includes some edits to the previous parts of the course (code examples, exercises).
Next
Before the course is done, there should be two more lessons: One about building the very same scraper using a framework (Crawlee), and one about deploying the scraper to a platform (Apify). Then I should return back to the beginning and complete the three initial lessons about DevTools.