Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add progress bar #124

Open
e-kotov opened this issue Jan 6, 2025 · 4 comments
Open

add progress bar #124

e-kotov opened this issue Jan 6, 2025 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@e-kotov
Copy link
Member

e-kotov commented Jan 6, 2025

This is an exciting development, {duckdb} will soon get the progress bar for lengthy operations (any queries, including file import, file writing, any computation, etc.)!

Continuing the #35 , now that {duckdb} R package will soon finally get the progress bar (duckdb/duckdb-r#951), we should definitely have it enabled by default, as most operations take significant amount of time.

My tests show that the dev version of {duckdb} is already working well, but the exact code needed for this to work still needs to be figured out.

And obviously, we will depend on this to work with only the latest version of {duckdb} installed by the user, so we'll add the package version check:

  • if old {duckdb} is installed, we'll provide the same message as now (suggestion to monitor the file size of a duckdb file... which is lame, but best we could do previusly) + suggest to install new {duckdb} to show the progress bar. For non-conversion tasks, such as group by + summarize we could not and still cannot in this case do anything on our side.
  • if new {duckdb}, show the progress bar.
@e-kotov e-kotov self-assigned this Jan 7, 2025
@e-kotov e-kotov added the enhancement New feature or request label Jan 7, 2025
@e-kotov
Copy link
Member Author

e-kotov commented Jan 7, 2025

Notes from testing with the dev build of duckdb. Basically, the progress bar works out of the box. Sort of. We'll see what the final implementation of the actual progress bar will be. For now we would need to create a progress bar function internally like so:

progress <- function(x) {
  if (x < 100 && cli::cli_progress_num() == 0) {
    cli::cli_progress_bar("Duckdb SQL", total = 100, .envir = .GlobalEnv, )
  }
  cli::cli_progress_update(set = x, .envir = .GlobalEnv)
}

options("duckdb.progress_display" = progress)

Source: duckdb/duckdb-r#951 (comment).

But I'm sure {duckdb} maintainers will opt for some default option.

This is what it looks like now if we try to convert all v2 data:

Screenshot 2025-01-08 at 00 16 04

As noted above, on our side, it will probably be sufficient to check {duckdb} version and not print the long warning message suggesting to check the file size as the function runs, and just let the progress bar do its job.

@e-kotov
Copy link
Member Author

e-kotov commented Jan 9, 2025

Tested on a long CSV to DuckDB convert job using the version in associated branch . It just gets stuck on:

Duckdb SQL ■                                  1% | ETA:  2d

Even thought the job completes successfully.

Maybe I should test with Python duckdb if the conversion progress bar also has this issue.

@Robinlovelace
Copy link
Collaborator

Oh... Will be super nice to have in any case, good luck getting it working!

@e-kotov
Copy link
Member Author

e-kotov commented Jan 9, 2025

Testing the progress bar for analysis tasks

Install dev version of duckdb

# build form source, but that takes a lot of time on a consumer machine
remotes::install_github("meztez/duckdb-r")

# or get the precompiled binaries of 'meztez/duckdb-r'
install.packages("duckdb", repos = c("https://e-kotov.r-universe.dev", "https://cloud.r-project.org"))

Get the data

For reference, for this particular test I had all three years of origin-destination district level data that can be retrieved (warning, it's hundreds of GB...) using:

remotes::install_github("rOpenSpain/spanishoddata@124-add-progress-bar", force = TRUE)
library(spanishoddata)
spod_set_data_dir("./data")

spod_download(
  type = "od",
  zones = "distr",
  dates = spod_get_valid_dates(ver = 2),
  max_download_size_gb = 300,
  return_local_file_paths = TRUE
)

Convert for analysis to DuckDB:

od_distr_db_v2 <- spod_convert(
  type = "od",
  zones = "distr",
  save_format = "duckdb",
  dates = "cached_v2", 
  max_mem_gb = mem,
  max_n_cpu = ncpu,
  overwrite = TRUE
)

And finally connect as a tbl_connection:

od_distr <- spod_connect(od_distr_db_v2,
 max_mem_gb = 30, # choose your own memory limit
 max_n_cpu = 16 # choose your own cpu limit
)

spod_connect() applies these configuration settings to the DuckDB connection behind the od_distr tbl_connection:

SET enable_progress_bar = true;
SET enable_progress_bar_print = true;
SET progress_bar_time = 500;

To make the progress bar print, do this (as current pull request to duckdb does not suggest a specific implementation for this, except in the comments):

progress <- function(x) {
  if (x < 100 && cli::cli_progress_num() == 0) {
    cli::cli_progress_bar("Duckdb SQL", total = 100, .envir = .GlobalEnv, )
  }
  cli::cli_progress_update(set = x, .envir = .GlobalEnv)
}

options("duckdb.progress_display" = progress)

The tests

1 of 16 months

During the analysis of a large batch, e.g.:

od_distr |> # all 3 years of data in duckdb db file
  filter(month == 1, year == 2022) |>
  group_by(id_origin, id_destination, activity_origin) |> 
  summarize(n_trips = sum(n_trips, na.rm = TRUE), .groups = "drop") |> 
  collect()

The progress is also not very reliable:

Duckdb SQL ■                                  1% | ETA: 10h
Duckdb SQL ■                                  2% | ETA:  5h
Duckdb SQL ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■   99% | ETA:  4s

So it starts with 1%, goes to 2%, then hangs for some time and jumps to 99% and finishes the job.

I'm guessing that happens because the full dataset is actually 3 full years worth of data and I am filtering to the first month.

Two months somewhere in the middle

If I do this:

od_distr |> # all 3 years of data in duckdb db file
  filter(month %in% c(1,10) & year == 2023) |>
  group_by(id_origin, id_destination, activity_destination) |> 
  summarize(n_trips = sum(n_trips, na.rm = TRUE), .groups = "drop") |> 
  collect()

I basically jump to the middle(-ish) (timeline-wise), so the progress bar starts at 32%. Makes sense. Then goes for a bit and increments the %, and then also finishes the job by suddenly jumping to 100% from around 50%:

Duckdb SQL ■■■■■■■■■■■■■■■■                  50% | ETA: 32m

Last month of the period

od_distr |> 
    filter(year == 2024, month == 12) |>
    group_by(id_origin, id_destination, activity_destination) |> 
    summarize(n_trips = sum(n_trips, na.rm = TRUE), .groups = "drop") |> 
    collect()
Duckdb SQL ■■■■■■■■■■■■■■■■■■■■■■■           72% | ETA:  3m

As expected, the progress jumps to the end (though not the very end, the data inside the duckdb file must be somewhat shuffled depending on how multithreaded the CSV file readers and/or duckdb file writers were).

Full data set

If we don't do any filtering:

test1 <- od_distr |>  # all 3 years of data in duckdb db file
  group_by(id_origin, id_destination, activity_destination) |> 
  summarize(n_trips = sum(n_trips, na.rm = TRUE), .groups = "drop") |> 
  collect()

In this case, progress bar seems to consistently go from 0% to 100%

This is tested with both forced single core and max cores that I had on the testing machine (x16) to rule out possible jumps from 1-2% to 100% because of parallel processes possibly finishing around the same time to update the overall progress.

Conclusion

So far, the conclusion is that the progress bar depends on the total size of the data and the portion that we are (or are not) filtering it to. In case of operations on the full dataset, the progress is quite useful. In case the dataset is very large, but we are only filtering to a small portion of it, it is quite useless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants