Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Advertools - Check website pages status code #2100

Merged
merged 7 commits into from
Aug 4, 2023

Conversation

FlorentLvr
Copy link
Contributor

This PR resolves #2099

@FlorentLvr FlorentLvr self-assigned this Aug 2, 2023
@FlorentLvr FlorentLvr added the templates maintainer To be prioritize for notebook templates maintainer label Aug 2, 2023
@FlorentLvr
Copy link
Contributor Author

@eliasdabbas, I have created a notebook that crawls a website and checks the status code of all URLs using your notebook templates. It has helped us identify 100 failed URLs on our website page. :)
Could you please take a look at it and let me know if there is anything you would suggest improving?🙏

@eliasdabbas
Copy link
Contributor

@FlorentLvr
Great to see this, and happy to know that it helped discover pages with issues!

Please note that the crawl function already checks the status codes, and you can get them through the status column.
The main differences:

  • crawl_headers: You need to know the URLs that you want to check status codes. It is also much faster, because you don't download the page, it just sends a HEAD request, which includes the status code and a few other things.
  • crawl: It discovers URLs by starting from the home page, and discovering and following links in the website.

So, if the website is frequently changing, adding/removing new pages, crawl would be a better choice. If you have a static list of URLs that you want to check on a periodic basis, you can use crawl_headers.

So, based on this, how do you prefer to with this notebook, split it, or something else?
Happy to help on this.

@FlorentLvr
Copy link
Contributor Author

@FlorentLvr Great to see this, and happy to know that it helped discover pages with issues!

Please note that the crawl function already checks the status codes, and you can get them through the status column. The main differences:

  • crawl_headers: You need to know the URLs that you want to check status codes. It is also much faster, because you don't download the page, it just sends a HEAD request, which includes the status code and a few other things.
  • crawl: It discovers URLs by starting from the home page, and discovering and following links in the website.

So, if the website is frequently changing, adding/removing new pages, crawl would be a better choice. If you have a static list of URLs that you want to check on a periodic basis, you can use crawl_headers.

So, based on this, how do you prefer to with this notebook, split it, or something else? Happy to help on this.

@eliasdabbas, Thank you for the review and providing valuable feedback! I appreciate your observation regarding the status on the crawldf and it has been very helpful. I have now removed the section with crawl headers, and the code is working perfectly now. 🙏

@FlorentLvr FlorentLvr merged commit 40688a3 into master Aug 4, 2023
5 checks passed
@FlorentLvr FlorentLvr deleted the 2099-advertools-check-website-pages-status-code branch August 4, 2023 07:57
@github-actions
Copy link

github-actions bot commented Aug 4, 2023

The template is now available on the master branch on this link:
https://github.com/jupyter-naas/awesome-notebooks/blob/master/Advertools/Advertools_Check_website_pages_status_code.ipynb

@github-actions
Copy link

github-actions bot commented Aug 4, 2023

Thank you for your contribution @FlorentLvr, your PR has been merged into the master branch of awesome-notebook.
Here is the contribution certificate you can share on social media so everybody knows how awesome you are 🤙🌎.
Spread the #opensource love 💚

FlorentLvr.png

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
templates maintainer To be prioritize for notebook templates maintainer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Advertools - Check website pages status code
2 participants