Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: lessons about crawling and scraping product detail pages in Python #1244

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

honzajavorek
Copy link
Collaborator

@honzajavorek honzajavorek commented Oct 8, 2024

Done

This PR introduces two new lessons to the Python course, including real-world exercises. These two lessons conclude the base of the course, as at the end of the course the reader should be able to build their own scraper. The final exercises focus on that fact and test the reader's ability of independent building. The PR also includes some edits to the previous parts of the course (code examples, exercises).

Next

Before the course is done, there should be two more lessons: One about building the very same scraper using a framework (Crawlee), and one about deploying the scraper to a platform (Apify). Then I should return back to the beginning and complete the three initial lessons about DevTools.

@honzajavorek honzajavorek added the t-academy Issues related to Web Scraping and Apify academies. label Oct 8, 2024
@honzajavorek honzajavorek force-pushed the honzajavorek/py-crawling branch 3 times, most recently from 625726c to 3199df8 Compare October 10, 2024 14:14
@honzajavorek honzajavorek changed the title feat: Python course (end) feat: lessons about crawling and scraping product detail pages in Python Oct 10, 2024
@honzajavorek honzajavorek marked this pull request as ready for review October 11, 2024 10:31
@@ -142,10 +142,10 @@ Letting our program visibly crash on error is enough for our purposes. Now, let'

### Scrape Amazon

Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with Amazon search results:
Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with AliExpress search results:
Copy link
Collaborator Author

@honzajavorek honzajavorek Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to do this because this is like... the very first exercise, and I don't want the reader to end up with an error. I don't want to direct them to a different course when they're touching the internet for the first time with their tiny program. So hoping for better results with Ali, which is a real world e-commerce website as well, but doesn't seem to give so much damn about scraping.

@@ -122,6 +122,14 @@ for product in soup.select(".product-item"):

This program does the same as the one we already had, but its code is more concise.

:::note Fragile code
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly addressing some previous concerns of @vdusek. I thought about it and I think it's better if we're upfront about this decision and the reader knows that this is expected.

@@ -199,8 +199,12 @@ def export_json(file, data):
json.dump(data, file, default=serialize, indent=2)

listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
soup = download(listing_url)
data = [parse_product(product) for product in soup.select(".product-item")]
listing_soup = download(listing_url)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change because this way the code is better suited for the newly added lessons.

Copy link
Member

@mnmkng mnmkng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Honza, great job as always.

@honzajavorek honzajavorek force-pushed the honzajavorek/py-crawling branch 2 times, most recently from 6187b18 to b82b673 Compare October 18, 2024 08:11
@honzajavorek
Copy link
Collaborator Author

I addressed all comments from @mnmkng. I also added a gif as I thought it will make one part clearer. I'm not a seasoned gif creator, so if someone can make a more beautiful one, go for it. Attaching the source recording as a file.

Screen.Recording.2024-10-18.at.9.42.13.mov

This is the gif:

variants-js

I made it like this:

ffmpeg -i Screen\ Recording\ 2024-10-18\ at\ 9.42.13.mov -pix_fmt rgb24 -r 10 -f gif - | gifsicle --optimize=2 --delay=10 > variants-js.gif

Unfortunately it's not possible to show how the price changes dynamically, as the JS touches the parent elements and in both Chrome and Firefox devtools this causes the elements to collapse. At least it's visible that something changes there, and how the text of the stock availability changes.

@mnmkng
Copy link
Member

mnmkng commented Oct 22, 2024

@vdusek can you please check the Python so that we can merge? Thanks!

Copy link
Contributor

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good and works well. I just have one suggestion regarding the images...

Currently, the images are quite large (~1.2 MB each). Could we apply some compression? I would say using jpeg would be enough. To avoid unnecessary bloating of the Git history.

@honzajavorek
Copy link
Collaborator Author

I applied compression to the PNG files added in this PR, results being:

+243 KB sources/academy/webscraping/scraping_basics_python/images/pdp.png
+494 KB sources/academy/webscraping/scraping_basics_python/images/variants.png

I rebased the branch so that it contains only these new files, I hope that's gonna be sufficient. @vdusek @mnmkng if you see no further problems with the PR, bless this with green and I think we can merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-academy Issues related to Web Scraping and Apify academies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants