-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide more help for spidering/crawling #193
Comments
I think it might make sense to start with a function that returns a list of pages, rather than flattening into a single list of nodes. Would definitely want to supply maximum number of pages. |
Seems like maybe it should be a function that works with an Or would it be better to provide a new |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
In general, this issue is really about "crawling" — i.e. making a queue of urls and systematically visiting them, parsing data, and adding more links to the queue. In an ideal world this would be done as a synchronously as possible so that one web page could be downloading while is another is parsing. (While still incorporating rate-limiting to avoid hammering a single site). See https://github.com/salimk/Rcrawler for related work. |
That indeed makes sense, because you could always pass the list of pages to
Or perhaps
I think all matching links should be visited by default. Depending on the HTML, there might be a CSS selector or XPath expression that matches only one link. Of course this is not always the case, so I'm wondering whether it makes sense to allow the session_crawl(s, ~ html_nodes(.x, "a")[1]) %>% html_node("title")
My gut feeling tells me that most often you'd want to do a breath first search, because the length of a page is usually shorter than the potential number of links the crawler may wonder off to. I guess it really depends on the situation, so concrete examples might be useful here. Or would it make sense to let the user specify this? |
Another place to look for API inspiration isscrapy. |
Oftentimes, content is paginated into multiple HTML documents. When the number of HTML documents is known and their corresponding URLs can be generated beforehand, then selecting the desired nodes is a matter of combining
xml2::read_html()
andrvest::html_nodes()
with, say,purrr::map()
.When the number of HTML documents is unknown or when their corresponding URLs cannot be generated beforehand we need a different approach. One approach is to "click" the More button using
rvest::follow_link()
and recursion. I recently implemented this approach as follows:I asked @hadley whether it makes sense to make this functionality part of
rvest
(see https://twitter.com/jeroenhjanssens/status/854390989942919170). I think at least the following questions need answering:I'd be happy to draft up a PR, but first I'm curious to hear your thoughts. Many thanks.
The text was updated successfully, but these errors were encountered: