-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add progress bar #124
Comments
Notes from testing with the dev build of progress <- function(x) {
if (x < 100 && cli::cli_progress_num() == 0) {
cli::cli_progress_bar("Duckdb SQL", total = 100, .envir = .GlobalEnv, )
}
cli::cli_progress_update(set = x, .envir = .GlobalEnv)
}
options("duckdb.progress_display" = progress) Source: duckdb/duckdb-r#951 (comment). But I'm sure This is what it looks like now if we try to convert all v2 data: As noted above, on our side, it will probably be sufficient to check |
Tested on a long CSV to DuckDB convert job using the version in associated branch . It just gets stuck on:
Even thought the job completes successfully. Maybe I should test with Python duckdb if the conversion progress bar also has this issue. |
Oh... Will be super nice to have in any case, good luck getting it working! |
Testing the progress bar for analysis tasksInstall dev version of duckdb# build form source, but that takes a lot of time on a consumer machine
remotes::install_github("meztez/duckdb-r")
# or get the precompiled binaries of 'meztez/duckdb-r'
install.packages("duckdb", repos = c("https://e-kotov.r-universe.dev", "https://cloud.r-project.org")) Get the dataFor reference, for this particular test I had all three years of origin-destination district level data that can be retrieved (warning, it's hundreds of GB...) using: remotes::install_github("rOpenSpain/spanishoddata@124-add-progress-bar", force = TRUE)
library(spanishoddata)
spod_set_data_dir("./data")
spod_download(
type = "od",
zones = "distr",
dates = spod_get_valid_dates(ver = 2),
max_download_size_gb = 300,
return_local_file_paths = TRUE
) Convert for analysis to DuckDB: od_distr_db_v2 <- spod_convert(
type = "od",
zones = "distr",
save_format = "duckdb",
dates = "cached_v2",
max_mem_gb = mem,
max_n_cpu = ncpu,
overwrite = TRUE
) And finally connect as a tbl_connection: od_distr <- spod_connect(od_distr_db_v2,
max_mem_gb = 30, # choose your own memory limit
max_n_cpu = 16 # choose your own cpu limit
)
To make the progress bar print, do this (as current pull request to duckdb does not suggest a specific implementation for this, except in the comments): progress <- function(x) {
if (x < 100 && cli::cli_progress_num() == 0) {
cli::cli_progress_bar("Duckdb SQL", total = 100, .envir = .GlobalEnv, )
}
cli::cli_progress_update(set = x, .envir = .GlobalEnv)
}
options("duckdb.progress_display" = progress) The tests1 of 16 monthsDuring the analysis of a large batch, e.g.: od_distr |> # all 3 years of data in duckdb db file
filter(month == 1, year == 2022) |>
group_by(id_origin, id_destination, activity_origin) |>
summarize(n_trips = sum(n_trips, na.rm = TRUE), .groups = "drop") |>
collect() The progress is also not very reliable:
So it starts with 1%, goes to 2%, then hangs for some time and jumps to 99% and finishes the job. I'm guessing that happens because the full dataset is actually 3 full years worth of data and I am filtering to the first month. Two months somewhere in the middleIf I do this: od_distr |> # all 3 years of data in duckdb db file
filter(month %in% c(1,10) & year == 2023) |>
group_by(id_origin, id_destination, activity_destination) |>
summarize(n_trips = sum(n_trips, na.rm = TRUE), .groups = "drop") |>
collect() I basically jump to the middle(-ish) (timeline-wise), so the progress bar starts at 32%. Makes sense. Then goes for a bit and increments the %, and then also finishes the job by suddenly jumping to 100% from around 50%:
Last month of the periodod_distr |>
filter(year == 2024, month == 12) |>
group_by(id_origin, id_destination, activity_destination) |>
summarize(n_trips = sum(n_trips, na.rm = TRUE), .groups = "drop") |>
collect()
As expected, the progress jumps to the end (though not the very end, the data inside the duckdb file must be somewhat shuffled depending on how multithreaded the CSV file readers and/or duckdb file writers were). Full data setIf we don't do any filtering:
In this case, progress bar seems to consistently go from 0% to 100% This is tested with both forced single core and max cores that I had on the testing machine (x16) to rule out possible jumps from 1-2% to 100% because of parallel processes possibly finishing around the same time to update the overall progress. ConclusionSo far, the conclusion is that the progress bar depends on the total size of the data and the portion that we are (or are not) filtering it to. In case of operations on the full dataset, the progress is quite useful. In case the dataset is very large, but we are only filtering to a small portion of it, it is quite useless. |
This is an exciting development,
{duckdb}
will soon get the progress bar for lengthy operations (any queries, including file import, file writing, any computation, etc.)!Continuing the #35 , now that
{duckdb}
R package will soon finally get the progress bar (duckdb/duckdb-r#951), we should definitely have it enabled by default, as most operations take significant amount of time.My tests show that the dev version of
{duckdb}
is already working well, but the exact code needed for this to work still needs to be figured out.And obviously, we will depend on this to work with only the latest version of
{duckdb}
installed by the user, so we'll add the package version check:{duckdb}
is installed, we'll provide the same message as now (suggestion to monitor the file size of a duckdb file... which is lame, but best we could do previusly) + suggest to install new{duckdb}
to show the progress bar. For non-conversion tasks, such as group by + summarize we could not and still cannot in this case do anything on our side.{duckdb}
, show the progress bar.The text was updated successfully, but these errors were encountered: