-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
duckplyr_df_from_csv
: overriding auto-detect for a single column apparently not available
#223
Labels
feature
a feature request or enhancement
Comments
Thanks, Ryan, this is helpful! This looks like it should work, here are a few more experiments: text <- "tconst\taverageRating\tnumVotes\ntt0000001\t5.7\t2051\ntt0000002\t5.7\t274\ntt0000003\t6.5\t2005"
writeLines(text, "ratings.tsv")
# Works but need to specify redundant type for the `averageRating` column
duckplyr::duckplyr_df_from_csv(
path = "ratings.tsv",
options = list(
delim = "\t",
types = list(c("VARCHAR", "DOUBLE", "INTEGER"))
)
)
#> No duckplyr fallback reports ready for upload.
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> read_csv_auto(ratings.tsv)
#>
#> ---------------------
#> -- Result Columns --
#> ---------------------
#> - tconst (VARCHAR)
#> - averageRating (DOUBLE)
#> - numVotes (INTEGER)
#>
#> # A tibble: 3 × 3
#> tconst averageRating numVotes
#> <chr> <dbl> <int>
#> 1 tt0000001 5.7 2051
#> 2 tt0000002 5.7 274
#> 3 tt0000003 6.5 2005
# Doesn't change `numVotes`
duckplyr::duckplyr_df_from_csv(
path = "ratings.tsv",
options = list(
delim = "\t",
types = list(c(tconst = "VARCHAR", numVotes = "INTEGER"))
)
)
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> read_csv_auto(ratings.tsv)
#>
#> ---------------------
#> -- Result Columns --
#> ---------------------
#> - tconst (VARCHAR)
#> - averageRating (INTEGER)
#> - numVotes (BIGINT)
#>
#> # A tibble: 3 × 3
#> tconst averageRating numVotes
#> <chr> <int> <dbl>
#> 1 tt0000001 6 2051
#> 2 tt0000002 6 274
#> 3 tt0000003 7 2005
# Fails
duckplyr::duckplyr_df_from_csv(
path = "ratings.tsv",
options = list(
delim = "\t",
types = c(tconst = "VARCHAR", numVotes = "INTEGER")
)
)
#> Error: rel_from_table_function: Need scalar parameter
# How is a struct mapped back to R?
con <- DBI::dbConnect(duckdb::duckdb())
as_tibble(dbGetQuery(
con,
"SELECT {'FlightDate': 'DATE', 'Origin': 'VARCHAR', 'Dest': 'VARCHAR', 'DepTime': 'INTEGER', 'ArrTime': 'INTEGER'} AS a"
))
#> Error in as_tibble(dbGetQuery(con, "SELECT {'FlightDate': 'DATE', 'Origin': 'VARCHAR', 'Dest': 'VARCHAR', 'DepTime': 'INTEGER', 'ArrTime': 'INTEGER'} AS a")): could not find function "as_tibble"
# Trying a data frame
duckplyr::duckplyr_df_from_csv(
path = "ratings.tsv",
options = list(
delim = "\t",
types = data.frame(tconst = "VARCHAR", numVotes = "INTEGER")
)
)
#> Error: rel_from_table_function: Need scalar parameter
duckplyr::duckplyr_df_from_csv(
path = "ratings.tsv",
options = list(
delim = "\t",
types = list(data.frame(tconst = "VARCHAR", numVotes = "INTEGER"))
)
)
#> Error: {"exception_type":"Binder","exception_message":"read_csv_auto types requires a list of types (varchar) as input"} Created on 2024-08-13 with reprex v2.1.0 |
With duckdb/duckdb-r#379, the following works, but other unintended usages still give surprises. text <- "tconst\taverageRating\tnumVotes\ntt0000001\t5.7\t2051\ntt0000002\t5.7\t274\ntt0000003\t6.5\t2005"
writeLines(text, "ratings.tsv")
# Works now
duckplyr::duckplyr_df_from_csv(
path = "ratings.tsv",
options = list(
delim = "\t",
types = data.frame(tconst = "VARCHAR", numVotes = "INTEGER")
)
)
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> read_csv_auto(ratings.tsv)
#>
#> ---------------------
#> -- Result Columns --
#> ---------------------
#> - tconst (VARCHAR)
#> - averageRating (DOUBLE)
#> - numVotes (INTEGER)
#>
#> # A tibble: 3 × 3
#> tconst averageRating numVotes
#> <chr> <dbl> <int>
#> 1 tt0000001 5.7 2051
#> 2 tt0000002 5.7 274
#> 3 tt0000003 6.5 2005
# Unexpected result
duckplyr::duckplyr_df_from_csv(
path = "ratings.tsv",
options = list(
delim = "\t",
types = list(a = "VARCHAR", b = "INTEGER")
)
)
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> read_csv_auto(ratings.tsv)
#>
#> ---------------------
#> -- Result Columns --
#> ---------------------
#> - tconst (VARCHAR)
#> - averageRating (DOUBLE)
#> - numVotes (BIGINT)
#>
#> # A tibble: 3 × 3
#> tconst averageRating numVotes
#> <chr> <dbl> <dbl>
#> 1 tt0000001 5.7 2051
#> 2 tt0000002 5.7 274
#> 3 tt0000003 6.5 2005 Created on 2024-09-14 with reprex v2.1.0 |
Now in the same PR: text <- "tconst\taverageRating\tnumVotes\ntt0000001\t5.7\t2051\ntt0000002\t5.7\t274\ntt0000003\t6.5\t2005"
writeLines(text, "ratings.tsv")
# Works now
duckplyr::duckplyr_df_from_csv(
path = "ratings.tsv",
options = list(
delim = "\t",
types = data.frame(tconst = "VARCHAR", numVotes = "INTEGER")
)
)
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> read_csv_auto(ratings.tsv)
#>
#> ---------------------
#> -- Result Columns --
#> ---------------------
#> - tconst (VARCHAR)
#> - averageRating (DOUBLE)
#> - numVotes (INTEGER)
#>
#> # A tibble: 3 × 3
#> tconst averageRating numVotes
#> <chr> <dbl> <int>
#> 1 tt0000001 5.7 2051
#> 2 tt0000002 5.7 274
#> 3 tt0000003 6.5 2005
# Unexpected result
duckplyr::duckplyr_df_from_csv(
path = "ratings.tsv",
options = list(
delim = "\t",
types = list(a = "VARCHAR", b = "INTEGER")
)
)
#> Error: rel_from_table_function: Need scalar parameter Created on 2024-09-14 with reprex v2.1.0 |
Does the current development version work for you? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Attended posit::conf 2024 workshop and ran into this issue with
duckplyr_df_from_csv
:Create example data
using first three rows of IMDB data
Default behavior
automatic read in is fine (setting delim not required) but does not most accurately cast last column as integer
Manual fix
This is a correct way to read in this data, however, it requires manually re-specifying the first two column types. From https://duckdb.org/docs/data/csv/auto_detection,
types = {'numVotes': 'INTEGER'}
would be the SQL approachKirill's attempt
We could not find a way to specify just a single column. Here's one of Kirill's attempts:
The text was updated successfully, but these errors were encountered: