[Feature][Connector-V2][HdfsFile] Support true large-file split for parallel read #10332

yzeng1618 · 2026-01-13T09:22:19Z

Purpose of this pull request

#10326
HdfsFile source currently uses “one file = one split”, which limits parallelism when there are only a few huge files.

Does this PR introduce any user-facing change?

yes

This PR adds enable_file_split / file_split_size to HdfsFile source and wires HDFS-specific split strategies:

text/csv/json: split by file_split_size and align split end to the next row_delimiter (HDFS seek-based implementation for large files).
parquet: split by RowGroup (logical split) and read footer metadata using HadoopConf-backed Configuration (works with Kerberos/HA/NameService).

Example:

enable_file_split = true
file_split_size = 268435456
row_delimiter = "\n" (for text/csv/json)

How was this patch tested?

Unit tests
HdfsFileAccordingToSplitSizeSplitStrategyTest#testReadBySplitsShouldMatchFullRead
E2E
HdfsFileIT#testHdfsTextReadWithFileSplit

Check list

[ * ] If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
[ * ] If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If necessary, please update incompatible-changes.md to describe the incompatibility caused by this PR.
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config

…arallel read

...l/connectors/seatunnel/file/hdfs/source/split/HdfsFileAccordingToSplitSizeSplitStrategy.java

...ache/seatunnel/connectors/seatunnel/file/hdfs/source/split/HdfsFileSplitStrategyFactory.java

...ache/seatunnel/connectors/seatunnel/file/hdfs/source/split/HdfsParquetFileSplitStrategy.java

…trategies in base and fix restore

...va/org/apache/seatunnel/connectors/seatunnel/file/source/split/ParquetFileSplitStrategy.java

Carl-Zhou-CN · 2026-01-15T03:07:34Z

hi,It seems that a parquetd use case is still missing

docs/en/connectors/source/HdfsFile.md

zhangshenghang · 2026-01-15T13:37:16Z

Thanks for implementing this important feature! The overall approach looks solid, but I found 2 CRITICAL issues that should be addressed before merging:

1. Poor error observability (GEN-002)
AccordingToSplitSizeSplitStrategy.java:107-109,115-117 - All IOExceptions are mapped to FILE_READ_FAILED, losing distinction between FileNotFoundException, permission denied, network timeout, etc. This makes troubleshooting difficult. Consider catching specific exception types and mapping them to appropriate error codes.

2. Severe performance regression for large file splits (GEN-003)
AbstractReadStrategy.java:510-521 - safeSlice() uses InputStream.skip() in a loop, which reads byte-by-byte. For HDFS files with non-first splits (e.g., start=10GB), this causes massive network/disk I/O just to reach the offset. HDFS supports FSDataInputStream.seek(), which should be used instead. Without this fix, enabling splits on large HDFS files will be slower than not splitting at all.

Note on COR-001: The idempotency concern in addSplitsBack() couldn't be fully verified from code review. The FileSourceSplit.equals() implementation suggests duplicates would be detected, but runtime behavior should be validated post-merge.

…able

yzeng1618 · 2026-01-16T07:18:03Z

Thanks for implementing this important feature! The overall approach looks solid, but I found 2 CRITICAL issues that should be addressed before merging:

1. Poor error observability (GEN-002) AccordingToSplitSizeSplitStrategy.java:107-109,115-117 - All IOExceptions are mapped to FILE_READ_FAILED, losing distinction between FileNotFoundException, permission denied, network timeout, etc. This makes troubleshooting difficult. Consider catching specific exception types and mapping them to appropriate error codes.

2. Severe performance regression for large file splits (GEN-003) AbstractReadStrategy.java:510-521 - safeSlice() uses InputStream.skip() in a loop, which reads byte-by-byte. For HDFS files with non-first splits (e.g., start=10GB), this causes massive network/disk I/O just to reach the offset. HDFS supports FSDataInputStream.seek(), which should be used instead. Without this fix, enabling splits on large HDFS files will be slower than not splitting at all.

Note on COR-001: The idempotency concern in addSplitsBack() couldn't be fully verified from code review. The FileSourceSplit.equals() implementation suggests duplicates would be detected, but runtime behavior should be validated post-merge.

The repair has been completed as required.

Carl-Zhou-CN

+1

chl-wxp · 2026-01-19T07:46:15Z

...va/org/apache/seatunnel/connectors/seatunnel/file/source/split/ParquetFileSplitStrategy.java

-        try (ParquetFileReader reader =
-                ParquetFileReader.open(HadoopInputFile.fromPath(path, conf))) {
-            return reader.getFooter().getBlocks();
+        if (hadoopFileSystemProxy == null) {


Under what circumstances will hadoopFileSystemProxy be null?

hadoopFileSystemProxy is only null when ParquetFileSplitStrategy is instantiated via the constructor without HadoopConf (mainly for unit tests / direct usage). In the normal connector path when file split is enabled, FileSplitStrategyFactory requires a non-null HadoopConf and always creates ParquetFileSplitStrategy with it, so the proxy won’t be null at runtime. The null branch just falls back to using a default Configuration without any Hadoop auth.

[Feature][Connector-V2][HdfsFile] Support true large-file split for p…

21a4243

…arallel read

github-actions bot added document connectors-v2 e2e file labels Jan 13, 2026

chl-wxp reviewed Jan 13, 2026

View reviewed changes

yzeng1618 requested a review from chl-wxp January 14, 2026 02:10

zengyi added 2 commits January 14, 2026 15:38

[Improve][connectors-v2/connector-file] Unify Local/HDFS file split s…

3f4c1df

…trategies in base and fix restore

[Improve][connectors-v2/connector-file] fix ci

d612815

Carl-Zhou-CN reviewed Jan 15, 2026

View reviewed changes

...va/org/apache/seatunnel/connectors/seatunnel/file/source/split/ParquetFileSplitStrategy.java Show resolved Hide resolved

Carl-Zhou-CN reviewed Jan 15, 2026

View reviewed changes

docs/en/connectors/source/HdfsFile.md Show resolved Hide resolved

zengyi added 2 commits January 15, 2026 11:51

Merge branch 'dev' into dev-hdfs-split

675143c

[Feature][Connector-V2][HdfsFile] add parquet e2e test

d826f55

yzeng1618 requested a review from Carl-Zhou-CN January 15, 2026 06:41

[Feature][Connector-V2][HdfsFile] add FileConnectorErrorCode and Seek…

b57806d

…able

Carl-Zhou-CN approved these changes Jan 17, 2026

View reviewed changes

github-actions bot added approved reviewed labels Jan 17, 2026

chl-wxp reviewed Jan 19, 2026

View reviewed changes

yzeng1618 requested a review from chl-wxp January 19, 2026 10:29

chl-wxp approved these changes Jan 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature][Connector-V2][HdfsFile] Support true large-file split for parallel read #10332

[Feature][Connector-V2][HdfsFile] Support true large-file split for parallel read #10332

Uh oh!

yzeng1618 commented Jan 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Carl-Zhou-CN commented Jan 15, 2026

Uh oh!

Uh oh!

zhangshenghang commented Jan 15, 2026

Uh oh!

yzeng1618 commented Jan 16, 2026

Uh oh!

Carl-Zhou-CN left a comment

Uh oh!

chl-wxp Jan 19, 2026

Uh oh!

yzeng1618 Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Feature][Connector-V2][HdfsFile] Support true large-file split for parallel read #10332

Are you sure you want to change the base?

[Feature][Connector-V2][HdfsFile] Support true large-file split for parallel read #10332

Uh oh!

Conversation

yzeng1618 commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Carl-Zhou-CN commented Jan 15, 2026

Uh oh!

Uh oh!

zhangshenghang commented Jan 15, 2026

Uh oh!

yzeng1618 commented Jan 16, 2026

Uh oh!

Carl-Zhou-CN left a comment

Choose a reason for hiding this comment

Uh oh!

chl-wxp Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

yzeng1618 Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yzeng1618 commented Jan 13, 2026 •

edited

Loading