add progress bar #124

e-kotov · 2025-01-06T23:44:40Z

This is an exciting development, {duckdb} will soon get the progress bar for lengthy operations (any queries, including file import, file writing, any computation, etc.)!

Continuing the #35 , now that {duckdb} R package will soon finally get the progress bar (duckdb/duckdb-r#951), we should definitely have it enabled by default, as most operations take significant amount of time.

My tests show that the dev version of {duckdb} is already working well, but the exact code needed for this to work still needs to be figured out.

And obviously, we will depend on this to work with only the latest version of {duckdb} installed by the user, so we'll add the package version check:

if old {duckdb} is installed, we'll provide the same message as now (suggestion to monitor the file size of a duckdb file... which is lame, but best we could do previusly) + suggest to install new {duckdb} to show the progress bar. For non-conversion tasks, such as group by + summarize we could not and still cannot in this case do anything on our side.
if new {duckdb}, show the progress bar.

The text was updated successfully, but these errors were encountered:

e-kotov · 2025-01-07T23:18:03Z

Notes from testing with the dev build of duckdb. Basically, the progress bar works out of the box. Sort of. We'll see what the final implementation of the actual progress bar will be. For now we would need to create a progress bar function internally like so:

progress <- function(x) {
  if (x < 100 && cli::cli_progress_num() == 0) {
    cli::cli_progress_bar("Duckdb SQL", total = 100, .envir = .GlobalEnv, )
  }
  cli::cli_progress_update(set = x, .envir = .GlobalEnv)
}

options("duckdb.progress_display" = progress)

Source: duckdb/duckdb-r#951 (comment).

But I'm sure {duckdb} maintainers will opt for some default option.

This is what it looks like now if we try to convert all v2 data:

As noted above, on our side, it will probably be sufficient to check {duckdb} version and not print the long warning message suggesting to check the file size as the function runs, and just let the progress bar do its job.

e-kotov · 2025-01-09T16:21:03Z

Tested on a long CSV to DuckDB convert job using the version in associated branch . It just gets stuck on:

Duckdb SQL ■                                  1% | ETA:  2d

Even thought the job completes successfully.

Maybe I should test with Python duckdb if the conversion progress bar also has this issue.

Robinlovelace · 2025-01-09T16:59:50Z

Oh... Will be super nice to have in any case, good luck getting it working!

e-kotov · 2025-01-09T17:19:54Z

Testing the progress bar for analysis tasks

Install dev version of duckdb

# build form source, but that takes a lot of time on a consumer machine
remotes::install_github("meztez/duckdb-r")

# or get the precompiled binaries of 'meztez/duckdb-r'
install.packages("duckdb", repos = c("https://e-kotov.r-universe.dev", "https://cloud.r-project.org"))

Get the data

For reference, for this particular test I had all three years of origin-destination district level data that can be retrieved (warning, it's hundreds of GB...) using:

remotes::install_github("rOpenSpain/spanishoddata@124-add-progress-bar", force = TRUE)
library(spanishoddata)
spod_set_data_dir("./data")

spod_download(
  type = "od",
  zones = "distr",
  dates = spod_get_valid_dates(ver = 2),
  max_download_size_gb = 300,
  return_local_file_paths = TRUE
)

Convert for analysis to DuckDB:

od_distr_db_v2 <- spod_convert(
  type = "od",
  zones = "distr",
  save_format = "duckdb",
  dates = "cached_v2", 
  max_mem_gb = mem,
  max_n_cpu = ncpu,
  overwrite = TRUE
)

And finally connect as a tbl_connection:

od_distr <- spod_connect(od_distr_db_v2,
 max_mem_gb = 30, # choose your own memory limit
 max_n_cpu = 16 # choose your own cpu limit
)

spod_connect() applies these configuration settings to the DuckDB connection behind the od_distr tbl_connection:

SET enable_progress_bar = true;
SET enable_progress_bar_print = true;
SET progress_bar_time = 500;

To make the progress bar print, do this (as current pull request to duckdb does not suggest a specific implementation for this, except in the comments):

progress <- function(x) {
  if (x < 100 && cli::cli_progress_num() == 0) {
    cli::cli_progress_bar("Duckdb SQL", total = 100, .envir = .GlobalEnv, )
  }
  cli::cli_progress_update(set = x, .envir = .GlobalEnv)
}

options("duckdb.progress_display" = progress)

The tests

1 of 16 months

During the analysis of a large batch, e.g.:

od_distr |> # all 3 years of data in duckdb db file
  filter(month == 1, year == 2022) |>
  group_by(id_origin, id_destination, activity_origin) |> 
  summarize(n_trips = sum(n_trips, na.rm = TRUE), .groups = "drop") |> 
  collect()

The progress is also not very reliable:

Duckdb SQL ■                                  1% | ETA: 10h
Duckdb SQL ■                                  2% | ETA:  5h
Duckdb SQL ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■   99% | ETA:  4s

So it starts with 1%, goes to 2%, then hangs for some time and jumps to 99% and finishes the job.

I'm guessing that happens because the full dataset is actually 3 full years worth of data and I am filtering to the first month.

Two months somewhere in the middle

If I do this:

od_distr |> # all 3 years of data in duckdb db file
  filter(month %in% c(1,10) & year == 2023) |>
  group_by(id_origin, id_destination, activity_destination) |> 
  summarize(n_trips = sum(n_trips, na.rm = TRUE), .groups = "drop") |> 
  collect()

I basically jump to the middle(-ish) (timeline-wise), so the progress bar starts at 32%. Makes sense. Then goes for a bit and increments the %, and then also finishes the job by suddenly jumping to 100% from around 50%:

Duckdb SQL ■■■■■■■■■■■■■■■■                  50% | ETA: 32m

Last month of the period

od_distr |> 
    filter(year == 2024, month == 12) |>
    group_by(id_origin, id_destination, activity_destination) |> 
    summarize(n_trips = sum(n_trips, na.rm = TRUE), .groups = "drop") |> 
    collect()

Duckdb SQL ■■■■■■■■■■■■■■■■■■■■■■■           72% | ETA:  3m

As expected, the progress jumps to the end (though not the very end, the data inside the duckdb file must be somewhat shuffled depending on how multithreaded the CSV file readers and/or duckdb file writers were).

Full data set

If we don't do any filtering:

test1 <- od_distr |>  # all 3 years of data in duckdb db file
  group_by(id_origin, id_destination, activity_destination) |> 
  summarize(n_trips = sum(n_trips, na.rm = TRUE), .groups = "drop") |> 
  collect()

In this case, progress bar seems to consistently go from 0% to 100%

This is tested with both forced single core and max cores that I had on the testing machine (x16) to rule out possible jumps from 1-2% to 100% because of parallel processes possibly finishing around the same time to update the overall progress.

Conclusion

So far, the conclusion is that the progress bar depends on the total size of the data and the portion that we are (or are not) filtering it to. In case of operations on the full dataset, the progress is quite useful. In case the dataset is very large, but we are only filtering to a small portion of it, it is quite useless.

e-kotov self-assigned this Jan 7, 2025

e-kotov added the enhancement New feature or request label Jan 7, 2025

e-kotov mentioned this issue Jan 9, 2025

Adding progress bar display func duckdb/duckdb-r#951

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add progress bar #124

add progress bar #124

e-kotov commented Jan 6, 2025

e-kotov commented Jan 7, 2025

e-kotov commented Jan 9, 2025

Robinlovelace commented Jan 9, 2025

e-kotov commented Jan 9, 2025

add progress bar #124

add progress bar #124

Comments

e-kotov commented Jan 6, 2025

e-kotov commented Jan 7, 2025

e-kotov commented Jan 9, 2025

Robinlovelace commented Jan 9, 2025

e-kotov commented Jan 9, 2025

Testing the progress bar for analysis tasks

Install dev version of duckdb

Get the data

The tests

1 of 16 months

Two months somewhere in the middle

Last month of the period

Full data set

Conclusion