Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrape the "all rooms" page #6

Open
mrdwab opened this issue Mar 31, 2016 · 4 comments
Open

Scrape the "all rooms" page #6

mrdwab opened this issue Mar 31, 2016 · 4 comments

Comments

@mrdwab
Copy link
Collaborator

mrdwab commented Mar 31, 2016

There should be a function to scrape the "all rooms" page (with the relevant options of at least "active" and "people") (http://chat.stackoverflow.com/?tab=all&sort=active and http://chat.stackoverflow.com/?tab=all&sort=people) and return a data.frame of the relevant URLs. This would make the package more generally relevant.

@alistaire47, you seem to know what's up when it comes to scraping ;-)

I'm guessing it's something along the lines of starting with:

the_url_i_want %>% 
  read_html() %>% 
  html_node('#roomlist') %>% 
  html_nodes("h3") %>% 
  html_nodes("a")
@alistaire47
Copy link
Collaborator

Check it out: https://github.com/alistaire47/room_utils/blob/master/rooms.R

I tried to get the roxygen comments roughly right, but please double-check them before we integrate it; I'm still pretty new to package development.

Also, I realized that despite prefixing all the non-base functions with ::, the scraping scripts still won't run without importing a package with pipes (unless we do something like magrittr::%>%, but that seems absurd), so I just put library(magrittr) at the top. I'm not sure if there's a standard way to deal with that issue, but I'm sure somebody has encountered it before.

@romunov
Copy link
Collaborator

romunov commented Mar 31, 2016

Of course you can. You need to specify exports in your roxygen part of the
script. See my examples here:
https://github.com/romunov/zvau/blob/7135636db9d5a7436a35121dbbd26fd5c1396660/R/writeINEST.R

On Thu, Mar 31, 2016 at 9:24 AM, Edward Visel [email protected]
wrote:

Check it out:
https://github.com/alistaire47/room_utils/blob/master/rooms.R

I tried to get the roxygen comments roughly right, but please double-check
them before we integrate it; I'm still pretty new to package development.

Also, I realized that despite prefixing all the non-base functions with ::,
the scraping scripts still won't run without importing a package with pipes
(unless we do something like magrittr::%>%, but that seems absurd), so I
just put library(magrittr) at the top. I'm not sure if there's a standard
way to deal with that issue, but I'm sure somebody has encountered it
before.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#6 (comment)

In God we trust, all others bring data.

@alistaire47
Copy link
Collaborator

@romunov Oh perfect, thanks! I updated the script linked above.

Also, here's a little add-on function, which is useful but slow because it scrapes everything every time. (Maybe rooms() could be cached and only called if there's no match or it's demanded by another parameter, but I'm not sure if anybody would use the function repeatedly anyway.)

find_room <- function(room_name, exact = FALSE){
    pattern <- ifelse(exact == TRUE, paste0('/', room_name, '$'), room_name)
    grep(pattern, rooms(), value = TRUE, ignore.case = TRUE)
}

Documented: https://github.com/alistaire47/room_utils/blob/master/find_room.R

@romunov
Copy link
Collaborator

romunov commented Mar 31, 2016

If you feel this overhead cost is too much, consider exporting the data into an external file and look for the existence (and time stamp) before scraping all the rooms again.

I would also suggest you write the code to the R package folder in a different branch. Once everyone likes the functionality (and it compiles OK), that branch can be merged seamlessly to the main branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants