Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

Feature list while using ArrowFileFormat read or write parquet #1171

Open
4 of 15 tasks
jackylee-ch opened this issue Nov 23, 2022 · 0 comments
Open
4 of 15 tasks

Feature list while using ArrowFileFormat read or write parquet #1171

jackylee-ch opened this issue Nov 23, 2022 · 0 comments
Labels
enhancement New feature or request

Comments

@jackylee-ch
Copy link
Contributor

jackylee-ch commented Nov 23, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In #1161 , we are trying to use ArowFileFormat to read or write parquet file. But we meet some suite test failed, some of them are due to the lack of ArrowFileFormat functionality. The feature list is below:

Highest priority

  • Support Parquet Schema merge in infer_schema. read from parquet files with changing schema, Enabling/disabling merging partfiles when merging parquet schema, SPARK-10005 Schema merging for nested struct, alter datasource table add columns - parquet, alter datasource table add columns - partitioned - parquet, SPARK-10301 requested schema clipping - requested schema contains physical schema, schema mismatch failure error message for parquet vectorized reader
  • Support writing with other codec. compression codec
  • Fix: read and write timestamp with wrong value. store and retrieve column stats in different time zones,analyze column command,writing with aggregation,Migration from INT96 to TIMESTAMP_MICROS timestamp type
  • Fix: Throw RuntimeException when reading duplicate fields in case-insensitive mode

Low priority

  • Support writing empty record to a metadata-only file. SPARK-26709: OptimizeMetadataOnlyQuery does not handle empty records correctly
  • Support invalid charactor in hdfs path. SPARK-33593: Vector reader got incorrect data with binary partition value, SPARK-21167: encode and decode path correctly, special characters in output path
  • Support write out the metadata to parquet file. SPARK-15804: write out the metadata to parquet file, SPARK-15895 summary files in non-leaf partition directories, SPARK-11044 Parquet writer version fixed as version1
  • Support read struct data from parquet. SPARK-10005 Schema merging for nested struct,SPARK-10301 requested schema clipping - requested schema contains physical schema,SPARK-10301 requested schema clipping - physical schema contains requested schema,SPARK-10301 requested schema clipping - schemas overlap but don't contain each other,SPARK-10301 requested schema clipping - deeply nested struct,SPARK-10301 requested schema clipping - out of order,SPARK-10301 requested schema clipping - schema merging,Standard mode - SPARK-10301 requested schema clipping - UDT, Legacy mode - SPARK-10301 requested schema clipping - UDT
  • Fix: Failed to read data while col Equal or GreaterThanOrEqual NAN. cases when literal is max
  • Support filter push down for filter timestamp or decimal. The related tests is filter pushdown - timestamp, filter pushdown - decimal, filter pushdown - date
  • Support ignoreMissingFiles. Enabling/disabling ignoreMissingFiles using parquet
  • Support NullType. SPARK-24204 error handling for unsupported Null data types - csv, parquet, orc
  • Fix: writing data out metrics different from vanilla. writing data out metrics: parquet
  • Pass parquet options. Read row group containing both dictionary and plain encoded pages
  • Support extra commit class config. SPARK-8121: spark.sql.parquet.output.committer.class shouldn't be overridden, SPARK-7837 Do not close output writer twice when commitTask() fails
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant