Skip to content

Conversation

@yzeng1618
Copy link
Contributor

@yzeng1618 yzeng1618 commented Jan 13, 2026

Purpose of this pull request

#10326
HdfsFile source currently uses “one file = one split”, which limits parallelism when there are only a few huge files.

Does this PR introduce any user-facing change?

yes

  1. This PR adds enable_file_split / file_split_size to HdfsFile source and wires HDFS-specific split strategies:
  • text/csv/json: split by file_split_size and align split end to the next row_delimiter (HDFS seek-based implementation for large files).

  • parquet: split by RowGroup (logical split) and read footer metadata using HadoopConf-backed Configuration (works with Kerberos/HA/NameService).

  1. Example:
enable_file_split = true
file_split_size = 268435456
row_delimiter = "\n" (for text/csv/json)

How was this patch tested?

  1. Unit tests
    HdfsFileAccordingToSplitSizeSplitStrategyTest#testReadBySplitsShouldMatchFullRead

  2. E2E
    HdfsFileIT#testHdfsTextReadWithFileSplit

Check list

@yzeng1618 yzeng1618 requested a review from chl-wxp January 14, 2026 02:10
@Carl-Zhou-CN
Copy link
Member

hi,It seems that a parquetd use case is still missing

@zhangshenghang
Copy link
Member

Thanks for implementing this important feature! The overall approach looks solid, but I found 2 CRITICAL issues that should be addressed before merging:

1. Poor error observability (GEN-002)
AccordingToSplitSizeSplitStrategy.java:107-109,115-117 - All IOExceptions are mapped to FILE_READ_FAILED, losing distinction between FileNotFoundException, permission denied, network timeout, etc. This makes troubleshooting difficult. Consider catching specific exception types and mapping them to appropriate error codes.

2. Severe performance regression for large file splits (GEN-003)
AbstractReadStrategy.java:510-521 - safeSlice() uses InputStream.skip() in a loop, which reads byte-by-byte. For HDFS files with non-first splits (e.g., start=10GB), this causes massive network/disk I/O just to reach the offset. HDFS supports FSDataInputStream.seek(), which should be used instead. Without this fix, enabling splits on large HDFS files will be slower than not splitting at all.

Note on COR-001: The idempotency concern in addSplitsBack() couldn't be fully verified from code review. The FileSourceSplit.equals() implementation suggests duplicates would be detected, but runtime behavior should be validated post-merge.

@yzeng1618
Copy link
Contributor Author

Thanks for implementing this important feature! The overall approach looks solid, but I found 2 CRITICAL issues that should be addressed before merging:

1. Poor error observability (GEN-002) AccordingToSplitSizeSplitStrategy.java:107-109,115-117 - All IOExceptions are mapped to FILE_READ_FAILED, losing distinction between FileNotFoundException, permission denied, network timeout, etc. This makes troubleshooting difficult. Consider catching specific exception types and mapping them to appropriate error codes.

2. Severe performance regression for large file splits (GEN-003) AbstractReadStrategy.java:510-521 - safeSlice() uses InputStream.skip() in a loop, which reads byte-by-byte. For HDFS files with non-first splits (e.g., start=10GB), this causes massive network/disk I/O just to reach the offset. HDFS supports FSDataInputStream.seek(), which should be used instead. Without this fix, enabling splits on large HDFS files will be slower than not splitting at all.

Note on COR-001: The idempotency concern in addSplitsBack() couldn't be fully verified from code review. The FileSourceSplit.equals() implementation suggests duplicates would be detected, but runtime behavior should be validated post-merge.

The repair has been completed as required.

Copy link
Member

@Carl-Zhou-CN Carl-Zhou-CN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

try (ParquetFileReader reader =
ParquetFileReader.open(HadoopInputFile.fromPath(path, conf))) {
return reader.getFooter().getBlocks();
if (hadoopFileSystemProxy == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what circumstances will hadoopFileSystemProxy be null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hadoopFileSystemProxy is only null when ParquetFileSplitStrategy is instantiated via the constructor without HadoopConf (mainly for unit tests / direct usage). In the normal connector path when file split is enabled, FileSplitStrategyFactory requires a non-null HadoopConf and always creates ParquetFileSplitStrategy with it, so the proxy won’t be null at runtime. The null branch just falls back to using a default Configuration without any Hadoop auth.

@yzeng1618 yzeng1618 requested a review from chl-wxp January 19, 2026 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants