Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Data Explorer: Fix failures with DuckDB CSV reader on files with colu…
…mns having difficult-to-infer data types (#5764) Addresses #5746. In very large CSV files, it can be possible for DuckDB to infer an integer type for a column and then subsequently have an error when attempting to convert the rest of the data file to integers. One such data file is found at https://s3.amazonaws.com/data.patentsview.org/download/g_patent.tsv.zip. This PR changes the CSV importing to fall back on `sample_size=-1` (which uses the entire file to do type inference, rather than a sample of rows) in these exceptional cases. This means it takes longer to load the file, but this is better than completely failing. I made a couple of other incidental changes: * Always use `CREATE TABLE` when importing CSV files, which gives better performance at the exchange of memory use (we can wait for people to complain about memory use problems before working more on this -- using a temporary local DuckDB database file instead of an in-memory one is one way around this potentially). I made sure that file live updates weren't broken by these changes. * Always use `CREATE VIEW` with Parquet files, since single-threaded DuckDB is plenty snappy without converting the Parquet file to its own internal data format. ## QA Notes Loading this 1GB TSV file into the data explorer takes 10s of seconds because duckdb-wasm is single-threaded, so just wait! It will eventually load.
- Loading branch information