-
Notifications
You must be signed in to change notification settings - Fork 2.2k
[Feature][Connector-V2][HdfsFile] Support true large-file split for parallel read #10332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
...l/connectors/seatunnel/file/hdfs/source/split/HdfsFileAccordingToSplitSizeSplitStrategy.java
Outdated
Show resolved
Hide resolved
...ache/seatunnel/connectors/seatunnel/file/hdfs/source/split/HdfsFileSplitStrategyFactory.java
Outdated
Show resolved
Hide resolved
...ache/seatunnel/connectors/seatunnel/file/hdfs/source/split/HdfsParquetFileSplitStrategy.java
Outdated
Show resolved
Hide resolved
…trategies in base and fix restore
...va/org/apache/seatunnel/connectors/seatunnel/file/source/split/ParquetFileSplitStrategy.java
Show resolved
Hide resolved
|
hi,It seems that a parquetd use case is still missing |
|
Thanks for implementing this important feature! The overall approach looks solid, but I found 2 CRITICAL issues that should be addressed before merging: 1. Poor error observability (GEN-002) 2. Severe performance regression for large file splits (GEN-003) Note on COR-001: The idempotency concern in |
The repair has been completed as required. |
Carl-Zhou-CN
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| try (ParquetFileReader reader = | ||
| ParquetFileReader.open(HadoopInputFile.fromPath(path, conf))) { | ||
| return reader.getFooter().getBlocks(); | ||
| if (hadoopFileSystemProxy == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under what circumstances will hadoopFileSystemProxy be null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hadoopFileSystemProxy is only null when ParquetFileSplitStrategy is instantiated via the constructor without HadoopConf (mainly for unit tests / direct usage). In the normal connector path when file split is enabled, FileSplitStrategyFactory requires a non-null HadoopConf and always creates ParquetFileSplitStrategy with it, so the proxy won’t be null at runtime. The null branch just falls back to using a default Configuration without any Hadoop auth.
Purpose of this pull request
#10326
HdfsFile source currently uses “one file = one split”, which limits parallelism when there are only a few huge files.
Does this PR introduce any user-facing change?
yes
text/csv/json: split by file_split_size and align split end to the next row_delimiter (HDFS seek-based implementation for large files).
parquet: split by RowGroup (logical split) and read footer metadata using HadoopConf-backed Configuration (works with Kerberos/HA/NameService).
How was this patch tested?
Unit tests
HdfsFileAccordingToSplitSizeSplitStrategyTest#testReadBySplitsShouldMatchFullRead
E2E
HdfsFileIT#testHdfsTextReadWithFileSplit
Check list
New License Guide
incompatible-changes.mdto describe the incompatibility caused by this PR.