-
-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to scrape all posts in a specific subreddit? #13
Comments
It is currently not possible to do this with URS. PRAW implemented this feature in the past, but it is now deprecated as of PRAW v6.0.0. In the past, Reddit used the Cloudsearch API to search for posts based on UNIX timestamps. Reddit removed the API in PRAW v6.0.0, rendering the There is an alternative to scraping Reddit, however. Pushshift.io looks like a good alternative, although I have not used it before. I am considering a code refactor in the future to possibly utilize Pushshift's API instead of PRAW and allow for more versatile scraping capabilities. It seems like this is a feature many people would like have in URS, and a growing number of social media websites are beginning to limit their official API's versatility. I am not familiar with Pushshift so will have to do more research before making any changes to current functionality. It is likely that I will refactor URS if Pushshift's API seems more promising than PRAW. Stay tuned for updates! |
Did you have any success using pushshift? I'd like to try to use their API for scraping all the posts, but it's quite poorly documented. |
I have had success using Pushshift and am in the process of integrating the API so that it may be accessed via command-line flags (spoiler: there are a lot of optional flags for granular scrape settings that are associated with the Pushshift scrapers). I mentioned I would consider an entire refactor using Pushshift in my May 2020 comment. After some research, I realized integrating Pushshift alongside PRAW would be the better choice because each API provides unique features. Livestreaming comments or submissions that are submitted in a Subreddit is not possible with Pushshift, for example. The ability to use both/either API would make a powerful Reddit scraping tool. As you mentioned, the Pushshift documentation is subpar. I would need to do a fair amount of testing and provide clear explanations of how each optional flag interacts with the API within this repository's I believe the integration is almost finished, however I stepped away from URS development for now to focus on other things - building a portfolio site and practicing Leetcode since I am unfortunately still looking for a full-time job. I plan on releasing the Pushshift integration in the next minor iteration (v3.4.0), although it may take some time before that happens. Keep an eye out for updates! |
That's good to hear :) Good luck with the job search! |
Hey. Any updates on this issue request? I need to scrape some of my own posts. |
This is not a bug report, but a question about the functionality. Can this scraper be used to obtain all posts in a specific subreddit surpassing the htrcnrs criteria? Looks like the reddit API is now limiting the number of posts we can pull?
The text was updated successfully, but these errors were encountered: