Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot remove local file after operation with duckplyr #257

Open
rafapereirabr opened this issue Aug 31, 2024 · 7 comments
Open

Cannot remove local file after operation with duckplyr #257

rafapereirabr opened this issue Aug 31, 2024 · 7 comments
Labels
duckdb 🦆 Issues where work in the duckb package is needed
Milestone

Comments

@rafapereirabr
Copy link

hi all, I'm tryin to use {duckplyr} as a dependency in one of my packages but I'm facing a strange problem. Basically, my package downloads and caches locally a few parquet files that it then uses to do a few basic operations. Due to CRAN policies , users should be able to manage / delete all files from the cache. However, for some reason, I cannot remove a local file after processing it with an operation with {duckplyr}. See reproducible example below.

obs. {duckplyr} is an amazing package, so thank you all for such a great work !

reprex

library(arrow)
library(duckdb)
library(duckplyr)

# download parquet files
url1 <- 'https://github.com/ipeaGIT/censobr/releases/download/v0.2.0/2010_emigration_v0.2.0.parquet'
url2 <- 'https://github.com/ipeaGIT/censobr/releases/download/v0.2.0/2010_households_v0.2.0.parquet'
curl::curl_download(url = url1, destfile = 'temp_parquet1.parquet')
curl::curl_download(url = url2, destfile = 'temp_parquet2.parquet')

# open data sets
arrow1 <- arrow::open_dataset('temp_parquet1.parquet')
arrow2 <- arrow::open_dataset('temp_parquet2.parquet')


# simple function that emulates part of what I do in my package
merge_fun <- function(arrow1, arrow2){

  key_vars <- c('code_muni', 'code_state', 'abbrev_state','name_state',
                'code_region', 'name_region', 'code_weighting', 'V0300')

  # drop repeated vars
  all_common_vars <- names(arrow1)[names(arrow1) %in% names(arrow2)]
  vars_to_drop <- setdiff(all_common_vars, key_vars)
  arrow2 <- dplyr::select(arrow2, -all_of(vars_to_drop)) |>
    dplyr::compute()

  # convert to duckdb
  arrow1 <- arrow::to_duckdb(arrow1)
  arrow2 <- arrow::to_duckdb(arrow2)

  # merge
  df_merge <- duckplyr::left_join(arrow1, arrow2) |>
            dplyr::compute()

  df_merge <- arrow::to_arrow(df_merge)
  return(df_merge)
}

# run function
test <- merge_fun( arrow1, arrow2)

Now here's when the error happens. After running the code above, I cannot remove the local file "temp_parquet1.parquet".

# try to remove file 1. 
file.exists('temp_parquet1.parquet')
#> TRUE
file.remove('temp_parquet1.parquet')

Warning message:
In file.remove("temp_parquet1.parquet") : cannot remove file 'temp_parquet1.parquet', reason 'Permission denied'

Surprisingly, there is no problem when I try to delete "temp_parquet2.parquet".

# remove file 2
file.exists('temp_parquet2.parquet')
#> TRUE
file.remove('temp_parquet2.parquet')
#> TRUE
@vcardonas
Copy link

I replicated the code and it worked fine for me.

> # try to remove file 1. 
> file.exists('temp_parquet1.parquet')
[1] TRUE
> file.remove('temp_parquet1.parquet')
[1] TRUE
> # remove file 2
> file.exists('temp_parquet2.parquet')
[1] TRUE
> file.remove('temp_parquet2.parquet')
[1] TRUE

@eitsupi
Copy link

eitsupi commented Sep 11, 2024

Is it possible that this is a Windows issue?
As I recall that readr 2.0.0 also had the lazy loading problem on Windows.
https://www.tidyverse.org/blog/2021/07/readr-2-0-0/#lazy-reading

@rafapereirabr Are you using Windows?

@rafapereirabr
Copy link
Author

Hi all. Yes, @eitsupi , I'm using windows and this is very much the problem. I thought that perhaps there was an issue with closing the database connection, though. So, I've tested a sliglty different version of my function and it seems to work.

The difference is that in my previous code (above in this issue) I used arrow::to_duckdb(arrow1) to create a table that is passed to duckplyr::left_join(). In the new code (see below), I explicility create a duckdb connection, register the arrow tables, follow to perform the duckplyr::left_join(), and then explicitly close the connection before returning the function result.

I think this could solve my problem, but I thought perhaps there is something here that could be incorporated into {duckplyr} to avoid the same issue for future Windows users.

library(arrow)
library(duckdb)
library(duckplyr)

# download parquet files
url1 <- 'https://github.com/ipeaGIT/censobr/releases/download/v0.2.0/2010_emigration_v0.2.0.parquet'
url2 <- 'https://github.com/ipeaGIT/censobr/releases/download/v0.2.0/2010_households_v0.2.0.parquet'
curl::curl_download(url = url1, destfile = 'temp_parquet1.parquet', quiet = F)
curl::curl_download(url = url2, destfile = 'temp_parquet2.parquet', quiet = F)

# open data sets
arrow1 <- arrow::open_dataset('temp_parquet1.parquet')
arrow2 <- arrow::open_dataset('temp_parquet2.parquet')


# simple function that emulates part of what I do in my package
merge_fun <- function(arrow1, arrow2){
  
  key_vars <- c('code_muni', 'code_state', 'abbrev_state','name_state',
                'code_region', 'name_region', 'code_weighting', 'V0300')
  
  # drop repeated vars
  all_common_vars <- names(arrow1)[names(arrow1) %in% names(arrow2)]
  vars_to_drop <- setdiff(all_common_vars, key_vars)
  arrow2 <- dplyr::select(arrow2, -all_of(vars_to_drop)) |>
    dplyr::compute()
  
  # OLD code # convert to duckdb
  # OLD code # arrow1 <- arrow::to_duckdb(arrow1)
  # OLD code # arrow2 <- arrow::to_duckdb(arrow2)
  
  # con <- DBI::dbConnect(duckdb::duckdb(), read_only = FALSE)
  con <- duckdb::dbConnect(duckdb::duckdb(), read_only = FALSE)
  duckdb::duckdb_register_arrow(con, 'arrow1', arrow1)
  duckdb::duckdb_register_arrow(con, 'arrow2', arrow2)
  
  # merge
  df_merge <- duckplyr::left_join(dplyr::tbl(con, "arrow1"),
                                  dplyr::tbl(con, "arrow2"))
  
  df_merge <- dplyr::compute(df_merge)
  
  df_merge <- arrow::to_arrow(df_merge)  |>
              dplyr::compute()
  
  # remove duckdb instance
  duckdb::duckdb_unregister_arrow(con, 'arrow1')
  duckdb::duckdb_unregister_arrow(con, 'arrow2')
  duckdb::dbDisconnect(con, shutdown = TRUE)
  rm(con)
  gc()
  return(df_merge)
}

# run function
test <- merge_fun( arrow1, arrow2)

# try to remove file 1.
file.exists('temp_parquet1.parquet')
file.remove('temp_parquet1.parquet')

# remove file 2
file.exists('temp_parquet2.parquet')
file.remove('temp_parquet2.parquet')


@eitsupi
Copy link

eitsupi commented Sep 12, 2024

In your example, I felt that you may have misunderstood the function of duckplyr.
arrow::to_duckdb() is a function for working with dbplyr, not duckdplyr, and should not be intended to be combined with duckplyr.
How about specifying dplyr::left_join() instead of duckplyr::left_join()?

Also, since duckdb is faster than arrow in my experience, it may be faster to use dbplyr (or duckplyr) completely instead of using arrow::to_duckdb().

@rafapereirabr
Copy link
Author

Hi @eitsupi . For the moment, using directly dplyr::left_join() on arrow tables is not an option in my case. The arrow package in cannot cannot cope with merging large data sets (see this issue)

This is what motivated me to use duckplyr::left_join(), because this way I would get best performance with duckdb but would still be able to work with more tidyverse friendly code

@eitsupi
Copy link

eitsupi commented Sep 13, 2024

In that case, why would you use the arrow package? Simply duckdb and dbplyr (or duckplyr) may be sufficient.

@rafapereirabr
Copy link
Author

rafapereirabr commented Sep 15, 2024

I only knew {arrow} when I started building the {censobr} package, it was a perfect choice because of the the really simple integration with {dplyr} . The only bottleneck I've found with {arrow} so far is merging really large data sets. I'm now considering importing {duckdb} or {duckplyr} to improve the performance of {censobr} when performing out-of-memory joins.
Now I see {duckplyr} doesn't have integration with arrow. right?

@krlmlr krlmlr added this to the 1.0.0 milestone Oct 16, 2024
@krlmlr krlmlr added the duckdb 🦆 Issues where work in the duckb package is needed label Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duckdb 🦆 Issues where work in the duckb package is needed
Projects
None yet
Development

No branches or pull requests

4 participants