Scrape the "all rooms" page #6

mrdwab · 2016-03-31T04:17:10Z

There should be a function to scrape the "all rooms" page (with the relevant options of at least "active" and "people") (http://chat.stackoverflow.com/?tab=all&sort=active and http://chat.stackoverflow.com/?tab=all&sort=people) and return a data.frame of the relevant URLs. This would make the package more generally relevant.

@alistaire47, you seem to know what's up when it comes to scraping ;-)

I'm guessing it's something along the lines of starting with:

the_url_i_want %>% 
  read_html() %>% 
  html_node('#roomlist') %>% 
  html_nodes("h3") %>% 
  html_nodes("a")

The text was updated successfully, but these errors were encountered:

alistaire47 · 2016-03-31T07:24:13Z

Check it out: https://github.com/alistaire47/room_utils/blob/master/rooms.R

I tried to get the roxygen comments roughly right, but please double-check them before we integrate it; I'm still pretty new to package development.

Also, I realized that despite prefixing all the non-base functions with ::, the scraping scripts still won't run without importing a package with pipes (unless we do something like magrittr::%>%, but that seems absurd), so I just put library(magrittr) at the top. I'm not sure if there's a standard way to deal with that issue, but I'm sure somebody has encountered it before.

romunov · 2016-03-31T07:41:40Z

Of course you can. You need to specify exports in your roxygen part of the
script. See my examples here:
https://github.com/romunov/zvau/blob/7135636db9d5a7436a35121dbbd26fd5c1396660/R/writeINEST.R

On Thu, Mar 31, 2016 at 9:24 AM, Edward Visel [email protected]
wrote:

Check it out:
https://github.com/alistaire47/room_utils/blob/master/rooms.R

I tried to get the roxygen comments roughly right, but please double-check
them before we integrate it; I'm still pretty new to package development.

Also, I realized that despite prefixing all the non-base functions with ::,
the scraping scripts still won't run without importing a package with pipes
(unless we do something like magrittr::%>%, but that seems absurd), so I
just put library(magrittr) at the top. I'm not sure if there's a standard
way to deal with that issue, but I'm sure somebody has encountered it
before.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#6 (comment)

In God we trust, all others bring data.

alistaire47 · 2016-03-31T08:10:37Z

@romunov Oh perfect, thanks! I updated the script linked above.

Also, here's a little add-on function, which is useful but slow because it scrapes everything every time. (Maybe rooms() could be cached and only called if there's no match or it's demanded by another parameter, but I'm not sure if anybody would use the function repeatedly anyway.)

find_room <- function(room_name, exact = FALSE){
    pattern <- ifelse(exact == TRUE, paste0('/', room_name, '$'), room_name)
    grep(pattern, rooms(), value = TRUE, ignore.case = TRUE)
}

Documented: https://github.com/alistaire47/room_utils/blob/master/find_room.R

romunov · 2016-03-31T08:12:34Z

If you feel this overhead cost is too much, consider exporting the data into an external file and look for the existence (and time stamp) before scraping all the rooms again.

I would also suggest you write the code to the R package folder in a different branch. Once everyone likes the functionality (and it compiles OK), that branch can be merged seamlessly to the main branch.

Bhargav-Rao added the enhancement label Mar 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrape the "all rooms" page #6

Scrape the "all rooms" page #6

mrdwab commented Mar 31, 2016

alistaire47 commented Mar 31, 2016

romunov commented Mar 31, 2016

alistaire47 commented Mar 31, 2016

romunov commented Mar 31, 2016

Scrape the "all rooms" page #6

Scrape the "all rooms" page #6

Comments

mrdwab commented Mar 31, 2016

alistaire47 commented Mar 31, 2016

romunov commented Mar 31, 2016

alistaire47 commented Mar 31, 2016

romunov commented Mar 31, 2016