More robust type inference for exceptional CSV files with DuckDB data explorer backend #5746

wesm · 2024-12-15T20:54:22Z

As described in #5707, certain large CSV files can pose type inference difficulties for the default settings with DuckDB's read_csv scan function.

The Tad data viewer (MIT licensed) uses DuckDB and will disable data sampling for type inference if loading the whole CSV file fails after the initial attempt at type inference. We can probably take a similar approach: https://github.com/antonycourtney/tad/blob/master/packages/reltab-duckdb/src/csvimport.ts#L78.

The text was updated successfully, but these errors were encountered:

wesm · 2024-12-16T19:49:16Z

In the meantime, I also opened duckdb/duckdb#15376 also to see if this is the behavior that DuckDB intends, or whether there might be a missing option to disable numeric type inference for quoted fields in the CSV reader.

…mns having difficult-to-infer data types (#5764) Addresses #5746. In very large CSV files, it can be possible for DuckDB to infer an integer type for a column and then subsequently have an error when attempting to convert the rest of the data file to integers. One such data file is found at https://s3.amazonaws.com/data.patentsview.org/download/g_patent.tsv.zip. This PR changes the CSV importing to fall back on `sample_size=-1` (which uses the entire file to do type inference, rather than a sample of rows) in these exceptional cases. This means it takes longer to load the file, but this is better than completely failing. I made a couple of other incidental changes: * Always use `CREATE TABLE` when importing CSV files, which gives better performance at the exchange of memory use (we can wait for people to complain about memory use problems before working more on this -- using a temporary local DuckDB database file instead of an in-memory one is one way around this potentially). I made sure that file live updates weren't broken by these changes. * Always use `CREATE VIEW` with Parquet files, since single-threaded DuckDB is plenty snappy without converting the Parquet file to its own internal data format. ## QA Notes Loading this 1GB TSV file into the data explorer takes 10s of seconds because duckdb-wasm is single-threaded, so just wait! It will eventually load.

testlabauto · 2024-12-20T20:51:47Z

Verified Fixed

Positron Version(s) : 2025.01.0-135
OS Version          : OSX

Test scenario(s)

https://s3.amazonaws.com/data.patentsview.org/download/g_patent.tsv.zip
loads without issue

Link(s) to TestRail test cases run or created:

wesm added area: data explorer Issues related to Data Explorer category. area: duckdb Issues related to positron-duckdb enhancement New feature or request sharp-edge labels Dec 15, 2024

wesm added this to the 2025.02.0 Pre-Release milestone Dec 15, 2024

wesm self-assigned this Dec 15, 2024

wesm mentioned this issue Dec 16, 2024

Data Explorer: Fix failures with DuckDB CSV reader on files with columns having difficult-to-infer data types #5764

Merged

wesm modified the milestones: 2025.02.0 Pre-Release, 2025.01.0 Pre-Release Dec 16, 2024

testlabauto closed this as completed Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More robust type inference for exceptional CSV files with DuckDB data explorer backend #5746

More robust type inference for exceptional CSV files with DuckDB data explorer backend #5746

wesm commented Dec 15, 2024

wesm commented Dec 16, 2024

testlabauto commented Dec 20, 2024

More robust type inference for exceptional CSV files with DuckDB data explorer backend #5746

More robust type inference for exceptional CSV files with DuckDB data explorer backend #5746

Comments

wesm commented Dec 15, 2024

wesm commented Dec 16, 2024

testlabauto commented Dec 20, 2024

Verified Fixed

Test scenario(s)