feat: lessons about crawling and scraping product detail pages in Python #1244

honzajavorek · 2024-10-08T14:39:24Z

Done

This PR introduces two new lessons to the Python course, including real-world exercises. These two lessons conclude the base of the course, as at the end of the course the reader should be able to build their own scraper. The final exercises focus on that fact and test the reader's ability of independent building. The PR also includes some edits to the previous parts of the course (code examples, exercises).

@@ -142,10 +142,10 @@ Letting our program visibly crash on error is enough for our purposes. Now, let'

 ### Scrape Amazon

-Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with Amazon search results:
+Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with AliExpress search results:


I decided to do this because this is like... the very first exercise, and I don't want the reader to end up with an error. I don't want to direct them to a different course when they're touching the internet for the first time with their tiny program. So hoping for better results with Ali, which is a real world e-commerce website as well, but doesn't seem to give so much damn about scraping.

honzajavorek · 2024-10-11T10:36:19Z

sources/academy/webscraping/scraping_basics_python/06_locating_elements.md

@@ -122,6 +122,14 @@ for product in soup.select(".product-item"):

 This program does the same as the one we already had, but its code is more concise.

+:::note Fragile code


Explicitly addressing some previous concerns of @vdusek. I thought about it and I think it's better if we're upfront about this decision and the reader knows that this is expected.

honzajavorek · 2024-10-11T10:37:11Z

sources/academy/webscraping/scraping_basics_python/09_getting_links.md

@@ -199,8 +199,12 @@ def export_json(file, data):
    json.dump(data, file, default=serialize, indent=2)

 listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
-soup = download(listing_url)
-data = [parse_product(product) for product in soup.select(".product-item")]
+listing_soup = download(listing_url)


I made this change because this way the code is better suited for the newly added lessons.

mnmkng

Thanks Honza, great job as always.

sources/academy/webscraping/scraping_basics_python/04_downloading_html.md

sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md

honzajavorek · 2024-10-18T08:15:50Z

I addressed all comments from @mnmkng. I also added a gif as I thought it will make one part clearer. I'm not a seasoned gif creator, so if someone can make a more beautiful one, go for it. Attaching the source recording as a file.

Screen.Recording.2024-10-18.at.9.42.13.mov

This is the gif:

I made it like this:

ffmpeg -i Screen\ Recording\ 2024-10-18\ at\ 9.42.13.mov -pix_fmt rgb24 -r 10 -f gif - | gifsicle --optimize=2 --delay=10 > variants-js.gif

Unfortunately it's not possible to show how the price changes dynamically, as the JS touches the parent elements and in both Chrome and Firefox devtools this causes the elements to collapse. At least it's visible that something changes there, and how the text of the stock availability changes.

mnmkng · 2024-10-22T11:46:23Z

@vdusek can you please check the Python so that we can merge? Thanks!

vdusek

The code looks good and works well. I just have one suggestion regarding the images...

Currently, the images are quite large (~1.2 MB each). Could we apply some compression? I would say using jpeg would be enough. To avoid unnecessary bloating of the Git history.

sources/academy/webscraping/scraping_basics_python/10_crawling.md

Co-authored-by: Ondra Urban <[email protected]>

honzajavorek · 2024-11-25T09:42:53Z

I applied compression to the PNG files added in this PR, results being:

+243 KB sources/academy/webscraping/scraping_basics_python/images/pdp.png
+494 KB sources/academy/webscraping/scraping_basics_python/images/variants.png

I rebased the branch so that it contains only these new files, I hope that's gonna be sufficient. @vdusek @mnmkng if you see no further problems with the PR, bless this with green and I think we can merge.

sources/academy/webscraping/scraping_basics_python/06_locating_elements.md

vdusek

LGTM

honzajavorek added the t-academy Issues related to Web Scraping and Apify academies. label Oct 8, 2024

honzajavorek force-pushed the honzajavorek/py-crawling branch 3 times, most recently from 625726c to 3199df8 Compare October 10, 2024 14:14

honzajavorek changed the title ~~feat: Python course (end)~~ feat: lessons about crawling and scraping product detail pages in Python Oct 10, 2024

honzajavorek force-pushed the honzajavorek/py-crawling branch from fc1ebd6 to 4339dcd Compare October 11, 2024 10:26

honzajavorek marked this pull request as ready for review October 11, 2024 10:31

honzajavorek requested review from vdusek, mnmkng, metalwarrior665 and TC-MO October 11, 2024 10:31

honzajavorek commented Oct 11, 2024

View reviewed changes

mnmkng requested changes Oct 11, 2024

View reviewed changes

honzajavorek force-pushed the honzajavorek/py-crawling branch 2 times, most recently from 6187b18 to b82b673 Compare October 18, 2024 08:11

vdusek requested changes Oct 22, 2024

View reviewed changes

sources/academy/webscraping/scraping_basics_python/10_crawling.md Show resolved Hide resolved

honzajavorek and others added 10 commits November 25, 2024 10:38

feat: draft further course development

5e2399b

feat: the crawling lesson and more

a0648f5

feat: lesson about scraping variants

9e0a6a1

feat: prepare next lesson stubs

16ea039

style: better English in the crawling lesson

ea4ec88

style: better English in the variants lesson

0749d5f

feat: add exercises

e0ecf7c

feat: exercises

f88abc8

style: better English

7680d36

fix: heading

3cf2893

Co-authored-by: Ondra Urban <[email protected]>

honzajavorek and others added 2 commits November 25, 2024 10:38

style: split sentences for better readability

017d2ae

Co-authored-by: Ondra Urban <[email protected]>

fix: part with samples more comprehensible, add gif of variants

30a75bd

honzajavorek force-pushed the honzajavorek/py-crawling branch from 5874efe to 30a75bd Compare November 25, 2024 09:38

TC-MO approved these changes Nov 25, 2024

View reviewed changes

sources/academy/webscraping/scraping_basics_python/06_locating_elements.md Show resolved Hide resolved

honzajavorek mentioned this pull request Nov 25, 2024

feat: lesson about using a framework #1303

Merged

honzajavorek requested review from mnmkng and vdusek November 26, 2024 22:56

vdusek approved these changes Nov 27, 2024

View reviewed changes

mnmkng approved these changes Nov 27, 2024

View reviewed changes

honzajavorek merged commit a8cfbec into master Nov 27, 2024
8 checks passed

honzajavorek deleted the honzajavorek/py-crawling branch November 27, 2024 16:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: lessons about crawling and scraping product detail pages in Python #1244

feat: lessons about crawling and scraping product detail pages in Python #1244

honzajavorek commented Oct 8, 2024 •

edited

Loading

honzajavorek Oct 11, 2024 •

edited

Loading

honzajavorek Oct 11, 2024

honzajavorek Oct 11, 2024

mnmkng left a comment

honzajavorek commented Oct 18, 2024

mnmkng commented Oct 22, 2024

vdusek left a comment

honzajavorek commented Nov 25, 2024

vdusek left a comment

		@@ -122,6 +122,14 @@ for product in soup.select(".product-item"):

		This program does the same as the one we already had, but its code is more concise.

		:::note Fragile code

feat: lessons about crawling and scraping product detail pages in Python #1244

feat: lessons about crawling and scraping product detail pages in Python #1244

Conversation

honzajavorek commented Oct 8, 2024 • edited Loading

Done

Next

honzajavorek Oct 11, 2024 • edited Loading

Choose a reason for hiding this comment

honzajavorek Oct 11, 2024

Choose a reason for hiding this comment

honzajavorek Oct 11, 2024

Choose a reason for hiding this comment

mnmkng left a comment

Choose a reason for hiding this comment

honzajavorek commented Oct 18, 2024

mnmkng commented Oct 22, 2024

vdusek left a comment

Choose a reason for hiding this comment

honzajavorek commented Nov 25, 2024

vdusek left a comment

Choose a reason for hiding this comment

honzajavorek commented Oct 8, 2024 •

edited

Loading

honzajavorek Oct 11, 2024 •

edited

Loading