Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloading data: child process has died #323

Open
peterdesmet opened this issue Sep 30, 2024 · 7 comments
Open

Downloading data: child process has died #323

peterdesmet opened this issue Sep 30, 2024 · 7 comments
Labels
Milestone

Comments

@peterdesmet
Copy link
Member

I got the following error when trying to download the largest dataset I know:

> download_acoustic_dataset(animal_project_code = "2013_albertkanaal")
Downloading data to directory `2013_albertkanaal`:
* (1/6): downloading animals.csv
* (2/6): downloading tags.csv                                                                   
* (3/6): downloading detections.csv                                                             
Error: child process has died

In call:
tryCatch({
    if (length(priority)) 
        setpriority(priority)
    if (length(rlimits)) 
        set_rlimits(rlimits)
    if (length(gid)) 
        setgid(gid)
    if (length(uid)) 
        setuid(uid)
    if (length(profile)) 
        aa_change_profile(profile)
    if (length(device)) 
        options(device = device)
    graphics.off()
    options(menu.graphics = FALSE)
    serialize(withVisible(eval(orig_expr, parent.frame())), NULL)
}, error = function(e) {
    old_class <- attr(e, "class")
    structure(e, class = c(old_class, "eval_fork_error"))
}, finally = substitute(graphics.off()))

This type of time-outs is expected when using the API. Is there an option to catch they and suggestion something more helpful?

@peterdesmet peterdesmet added this to the v2.3 milestone Sep 30, 2024
@PietrH PietrH removed this from the v2.3 milestone Oct 1, 2024
@PietrH PietrH added the API label Oct 1, 2024
@PietrH PietrH added this to the v2.3.1 milestone Oct 1, 2024
@PietrH
Copy link
Member

PietrH commented Oct 1, 2024

I get exactly the same error, at the same stage. I suspect the failure is actually at:

get_acoustic_detections(animal_project_code = "2013_albertkanaal")

I'm getting HTTP 502 request failed returns on the above call, this might be fixed with paging in https://github.com/inbo/etn/tree/paging:

etn/R/utils.R

Lines 99 to 159 in 8012306

fetch_result_paged <-
function(connection,
query,
page_size = 1000,
progress = FALSE
){
assertthat::assert_that(assertthat::is.count(page_size))
# Stop a progress bar from appearing if not required
if(!progress) {
withr::local_options(cli.progress_show_after = Inf)
}
# Create result object to page into = Execute query on DB
result <- DBI::dbSendQuery(connection, query, immediate = FALSE)
# When this function exits, clear the result (Mandatory)
withr::defer(DBI::dbClearResult(result))
# Fetch some information about our result object
result_colnames <- DBI::dbColumnInfo(result)$name
result_nrow <- DBI::dbGetInfo(result)$rows.affected
# Create tempfile to write to, automatically delete when function completes
partial_result_file <- withr::local_tempfile()
# Initialize a progress bar
# pb <- progress::progress_bar$new(
# total = result_nrow,
# format = " fetching [:bar] :percent in :elapsed",
# width = 60
# )
# withr::defer(pb$terminate())
cli::cli_progress_bar("Fetching result from ETN", total = result_nrow)
## set object to keep track of howmany rows have been fetched
rows_done <- 0
# Fetch pages of the result until we have everything
while(!DBI::dbHasCompleted(result)){
readr::write_csv(DBI::dbFetch(result, n = page_size),
partial_result_file,
append = TRUE,
progress = FALSE)
rows_done <- rows_done + page_size
# rows_done <- DBI::dbGetInfo(result)$row.count
# pb$update(rows_done/result_nrow)
cli::cli_progress_update(set =
# length(readr::read_lines(partial_result_file, progress = FALSE))
rows_done
)
}
# Read the temp file we wrote the result data.frame to
result_df <-
readr::read_csv(
partial_result_file,
col_names = result_colnames,
show_col_types = FALSE,
progress = FALSE
)
return(result_df)
}

Paging comes at a significant cost, not only the IO operations, but having to either rely on the parsing of readr, or having to store the mapping somewhere and reapplying it. It would like to avoid having to COUNT the size of a return object before paging or not paging, and I think leaving the choice up to the user is not so friendly either.

I'm thinking about it. In any case, this might have to be fixed on the etnservice side.


etn::get_acoustic_detections(animal_project_code = "2013_albertkanaal", api = FALSE) does work

@PietrH
Copy link
Member

PietrH commented Oct 1, 2024

Because it's a gateway error, I've contacted Stijn to see what he can see on his side.

I don't think the object is too big to pass over the API, especially compressed. I don't think server side paging will fix this, but client side paging might, altough with a very significant overhead (because we'd need to implement sorting, or maybe use R sessions to fetch from OpenCPU..)

@PietrH
Copy link
Member

PietrH commented Oct 1, 2024

502 errors are due to Nginx (opencpu-cache), forwarded information to Stijn. We'll need a look into the admin logs for more info.

image

@Stijn-VLIZ
Copy link
Collaborator

I tried many different things, but my conclusion is that we are running into limits here.
This image shows the memory usage of a local docker, running etnservice.
The function get_acoustic_detections, is altered so no ordering is being done and the dataframe is emptied before being serialized. So the memory is for doing the query only.
image
This also runs for 9 minutes.
What I propose is indeed pagination, I will first investigate the possibilities on our side (db).

@PietrH
Copy link
Member

PietrH commented Oct 4, 2024

If this is the case, why does the query work when using a local database connection? get_acoustic_detections(animal_project_code = "2013_albertkanaal", api = FALSE)

@Stijn-VLIZ
Copy link
Collaborator

That's a question on how OpenCPU works.
So opencpu starts a new R session on the server and then runs the ent package.
Also It creates a session of it's own where opencpu stores information about your request, and your result.
How and why it impacts the memory so hard, I don't now.

Their might be a solution in a async worker doing the query and writing it to file and than returning that.
In that case you can use the async endpoint and check when the data is ready.

@PietrH
Copy link
Member

PietrH commented Oct 8, 2024

I'm not sure if OpenCPU supports async requests. I agree that async requests would be the best solution for big datasets.

  1. Currently, if I make changes I'm directly working on a live environment that has some (beta) users. Especially towards the future, how can I experiment with fixes without effecting the live api that people are using?
  2. Is it possible the request succeeds on a local database connection simply because the RStudio Server has much more memory? In this case, optimizing the query might be the answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants