Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding progress bar display func #951

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Adding progress bar display func #951

wants to merge 2 commits into from

Conversation

meztez
Copy link

@meztez meztez commented Jan 5, 2025

For #199.

@meztez
Copy link
Author

meztez commented Jan 5, 2025

Not printing, looking for further insight. @krlmlr? Any ideas.

@krlmlr
Copy link
Collaborator

krlmlr commented Jan 5, 2025

Thanks for working on it! I think this should be an R option with a callback that is called when the option is set, see duckdb.materialize_callback for an example.

@krlmlr
Copy link
Collaborator

krlmlr commented Jan 5, 2025

Or, perhaps even a slot in the duckdb_connection class?

@meztez
Copy link
Author

meztez commented Jan 5, 2025

Well, I'm almost done with the callback. So I'll test that first.

@meztez meztez force-pushed the main branch 3 times, most recently from 0e03670 to b3b600e Compare January 6, 2025 03:52
@meztez
Copy link
Author

meztez commented Jan 6, 2025

@meztez
Copy link
Author

meztez commented Jan 6, 2025

All right, it works, now ironing out the bugs.

@meztez
Copy link
Author

meztez commented Jan 6, 2025

library(duckdb)
library(cli)

progress <- function(x) {
  if (cli::cli_progress_num() == 0) {
    cli::cli_progress_bar("Duckdb SQL", total = 100, .envir = .GlobalEnv)
  }
  cli::cli_progress_update(set = x, .envir = .GlobalEnv)
  if (x > 100) {
    cli::cli_progress_done(.envir = .GlobalEnv)
  }
}
options("duckdb.progress_display" = progress)
conn <- duckdb::dbConnect(duckdb::duckdb())
duckdb::dbSendQuery(conn, "SET progress_bar_time = 0;")
q <- "CREATE OR REPLACE TABLE BOB AS (
      SELECT * FROM 'ldbc-sf300-comments-creationDate.parquet')"
duckdb::dbSendQuery(conn, q)

Mytherin added a commit to duckdb/duckdb that referenced this pull request Jan 6, 2025
#ifndef DUCKDB_DISABLE_PRINT seems redundant since it is already used in
printer.cpp and it prevents from using a display set via
config.create_display_func when compiled with flag
-DDUCKDB_DISABLE_PRINT, like the duckdb-r package, where I'm trying to
implement a display.

https://github.com/duckdb/duckdb/blob/main/src/common/printer.cpp
duckdb/duckdb-r#951

PrintProgress -> TerminalProgressBarDisplay::Update ->
TerminalProgressBarDisplay::PrintProgressInternal -> Printer::RawPrint
and there is a macro there.

Plus there is already a config option to enable_progress_bar and default
is FALSE.

So. Can it be remove?
cc: @krlmlr
@meztez
Copy link
Author

meztez commented Jan 6, 2025

I'm done on this one. Let me know if this works for you.

@e-kotov
Copy link

e-kotov commented Jan 6, 2025

Testing with {spanishoddata}:

library(spanishoddata)
library(duckdb)
library(tidyverse)


x_dates <- c("2022-01-01", "2022-01-02", "2022-01-03", "2022-01-04")
x <- spod_get(type = "od", zones = "distr", dates = x_dates)

dbGetQuery(x$src$con, "SELECT current_setting('enable_progress_bar');")
dbSendQuery(x$src$con, "SET enable_progress_bar = true;")
dbGetQuery(x$src$con, "SELECT current_setting('enable_progress_bar');")


progress <- function(x) {
  if (cli::cli_progress_num() == 0) {
    cli::cli_progress_bar("Duckdb SQL", total = 100, .envir = .GlobalEnv)
  }
  cli::cli_progress_update(set = x, .envir = .GlobalEnv)
  if (x > 100) {
    cli::cli_progress_done(.envir = .GlobalEnv)
  }
}

options("duckdb.progress_display" = progress)
duckdb::dbSendQuery(x$src$con, "SET progress_bar_time = 0;")

xx <- x |> group_by(id_origin, date, activity_origin) |> summarise(mean_trips = mean(n_trips)) |> collect()
Screenshot 2025-01-07 at 00 31 57

And it works!

@meztez do we have to manually define the progress function though...? what is the final idea of this PR? I would expect that progress bar just 'magically' appears as soon as we do:

dbGetQuery(x$src$con, "SELECT current_setting('enable_progress_bar');")

p.s. in my case x$src$con is because spod_get returns tbl_duckdb_connection, so you have to reach out to the connection itself.

@meztez
Copy link
Author

meztez commented Jan 6, 2025

It could provide a dummy default. It's just a function(x) called with progress percentage from within duckdb-r.

I'm not the package maintainer and I just needed it for a deliverable, so whatever works is fine by me.

@krlmlr
Copy link
Collaborator

krlmlr commented Jan 7, 2025

Thanks for the PR!

Looking at the implementation, I think the callback function should be a slot in the connection object. There could be basic reporting (opt-out, in interactive mode only) in the duckdb R package, and more sophisticated progress in duckplyr.

@e-kotov
Copy link

e-kotov commented Jan 7, 2025

I'm not the package maintainer and I just needed it for a deliverable, so whatever works is fine by me.

@meztez totally makes sense. Thanks for the work in the internals to make this possible!

Looking forward for this to be merged!

@HenrikBengtsson
Copy link

In the above examples, (x > 100) indicates that the processing is complete. Shouldn't that be (x >= 100)? I think it's more common to consider 100% to indicate "done" than "still processing".

@meztez
Copy link
Author

meztez commented Jan 7, 2025

In the above examples, (x > 100) indicates that the processing is complete. Shouldn't that be (x >= 100)? I think it's more common to consider 100% to indicate "done" than "still processing".

progress <- function(x) {
  if (x < 100 && cli::cli_progress_num() == 0) {
    cli::cli_progress_bar("Duckdb SQL", total = 100, .envir = .GlobalEnv, )
  }
  cli::cli_progress_update(set = x, .envir = .GlobalEnv)
}

options("duckdb.progress_display" = progress)

@e-kotov
Copy link

e-kotov commented Jan 9, 2025

I have done some more testing here: rOpenSpain/spanishoddata#124 (comment).

To summarize, it seems like at the moment the progress bar behavior is dependent on the data size and if you are filtering to any particular part of the data. That is, if you have 100GB of data, and your query is running on the data that is stored somewhere in the beginning of the file (I used duckdb file format), then you will get some progress from 1% to 3%, and then it will just jump to 100%. Similarly, if you filter to the data that is somewhere in the end of the database file, it will jump to 70% or 90% from the very beginning of the query.

So at the moment the progress bar implementation is not very informative and useful.

The question is if this is an upstream problem (and normal behavior for DuckDB), or if this is an artifact of how the progress bar reporting was implemented in the R package in this PR.

@meztez @krlmlr do you have any insights if what I'm describing is expected behavior for the progress bar, and if not if this can be fixed?

@meztez
Copy link
Author

meztez commented Jan 9, 2025

@e-kotov Try the same thing with duckdb cli and see if you get the same behavior.
https://duckdb.org/docs/installation/index?version=stable&environment=cli&platform=win&download_method=package_manager&architecture=x86_64

I really do not see anything from the R package side that would make this behavior any different. Also this quick search in duckdb/duckdb issues : duckdb/duckdb#12454

@e-kotov
Copy link

e-kotov commented Jan 9, 2025

@e-kotov Try the same thing with duckdb cli and see if you get the same behavior.
https://duckdb.org/docs/installation/index?version=stable&environment=cli&platform=win&download_method=package_manager&architecture=x86_64

So far tried with Python module, and got similar behaviour.

Also this quick search in duckdb/duckdb issues : duckdb/duckdb#12454

I saw this one, but they said this particular one was fixed. There were a few more similar issues in the upstream repo, but nothing of the sort that I described.

I will report this upstream.

@meztez
Copy link
Author

meztez commented Jan 9, 2025

Thanks for the PR!

Looking at the implementation, I think the callback function should be a slot in the connection object. There could be basic reporting (opt-out, in interactive mode only) in the duckdb R package, and more sophisticated progress in duckplyr.

@krlmlr
If its a slot on the connection class, how would it determine in rapi_connect if it needs to set create_display_func?
Did you mean a slot on the Driver class? I don't grok the logic you had in mind for the slot. Sorry.

@krlmlr
Copy link
Collaborator

krlmlr commented Jan 20, 2025

@hannes: Can we guarantee that the progress update functions are always called from the "main" thread (the one that initiated the execution)?

@krlmlr
Copy link
Collaborator

krlmlr commented Jan 20, 2025

It does look like the calls are always coming from the same thread ID, but better to be sure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants