Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More robust type inference for exceptional CSV files with DuckDB data explorer backend #5746

Closed
wesm opened this issue Dec 15, 2024 · 2 comments
Assignees
Labels
area: data explorer Issues related to Data Explorer category. area: duckdb Issues related to positron-duckdb enhancement New feature or request sharp-edge

Comments

@wesm
Copy link
Contributor

wesm commented Dec 15, 2024

As described in #5707, certain large CSV files can pose type inference difficulties for the default settings with DuckDB's read_csv scan function.

The Tad data viewer (MIT licensed) uses DuckDB and will disable data sampling for type inference if loading the whole CSV file fails after the initial attempt at type inference. We can probably take a similar approach: https://github.com/antonycourtney/tad/blob/master/packages/reltab-duckdb/src/csvimport.ts#L78.

@wesm wesm added area: data explorer Issues related to Data Explorer category. area: duckdb Issues related to positron-duckdb enhancement New feature or request sharp-edge labels Dec 15, 2024
@wesm wesm added this to the 2025.02.0 Pre-Release milestone Dec 15, 2024
@wesm wesm self-assigned this Dec 15, 2024
@wesm
Copy link
Contributor Author

wesm commented Dec 16, 2024

In the meantime, I also opened duckdb/duckdb#15376 also to see if this is the behavior that DuckDB intends, or whether there might be a missing option to disable numeric type inference for quoted fields in the CSV reader.

wesm added a commit that referenced this issue Dec 18, 2024
…mns having difficult-to-infer data types (#5764)

Addresses #5746. In very large CSV files, it can be possible for DuckDB
to infer an integer type for a column and then subsequently have an
error when attempting to convert the rest of the data file to integers.
One such data file is found at
https://s3.amazonaws.com/data.patentsview.org/download/g_patent.tsv.zip.

This PR changes the CSV importing to fall back on `sample_size=-1`
(which uses the entire file to do type inference, rather than a sample
of rows) in these exceptional cases. This means it takes longer to load
the file, but this is better than completely failing.

I made a couple of other incidental changes:

* Always use `CREATE TABLE` when importing CSV files, which gives better
performance at the exchange of memory use (we can wait for people to
complain about memory use problems before working more on this -- using
a temporary local DuckDB database file instead of an in-memory one is
one way around this potentially). I made sure that file live updates
weren't broken by these changes.
* Always use `CREATE VIEW` with Parquet files, since single-threaded
DuckDB is plenty snappy without converting the Parquet file to its own
internal data format.

## QA Notes

Loading this 1GB TSV file into the data explorer takes 10s of seconds
because duckdb-wasm is single-threaded, so just wait! It will eventually
load.
@testlabauto
Copy link
Contributor

Verified Fixed

Positron Version(s) : 2025.01.0-135
OS Version          : OSX

Test scenario(s)

https://s3.amazonaws.com/data.patentsview.org/download/g_patent.tsv.zip
loads without issue

Link(s) to TestRail test cases run or created:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: data explorer Issues related to Data Explorer category. area: duckdb Issues related to positron-duckdb enhancement New feature or request sharp-edge
Projects
None yet
Development

No branches or pull requests

2 participants