Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50607][SQL] Passdown fileLength when reading parquet footer to avoid call HDFS namenode in executor side #49225

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

WangGuangxin
Copy link
Contributor

What changes were proposed in this pull request?

When reading parquet footers, we need to know the file length, by requesting hdfs namenode in executor side for each task.

// ParquetFooterReader.java
public static ParquetMetadata readFooter(Configuration configuration,
    Path file, ParquetMetadataConverter.MetadataFilter filter) throws IOException {
    return readFooter(HadoopInputFile.fromPath(file, configuration), filter);
}

-----------------------------

public static HadoopInputFile fromPath(Path path, Configuration conf) throws IOException {
  FileSystem fs = path.getFileSystem(conf);
  return new HadoopInputFile(fs, fs.getFileStatus(path), conf);
}

But in fact, the file length is already known in driver side from PartitionedFile. By pass down this, we can avoid lots of namenode requests

Why are the changes needed?

Reduce hdfs namenode requests, optimize the latency

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually

Was this patch authored or co-authored using generative AI tooling?

No

@WangGuangxin WangGuangxin changed the title [SPARK-50607][SQK] Passdown fileLength when reading parquet footer to avoid call HDFS namenode in executor side [SPARK-50607][SQL] Passdown fileLength when reading parquet footer to avoid call HDFS namenode in executor side Dec 18, 2024
@github-actions github-actions bot added the SQL label Dec 18, 2024
@WangGuangxin WangGuangxin force-pushed the reduce_nn_call_in_parquet_reader branch from 151a2a4 to 4053617 Compare December 18, 2024 08:12
@WangGuangxin
Copy link
Contributor Author

@cloud-fan Can you please help review this

@WangGuangxin WangGuangxin force-pushed the reduce_nn_call_in_parquet_reader branch from 4053617 to b65aee9 Compare December 18, 2024 13:00
@WangGuangxin WangGuangxin force-pushed the reduce_nn_call_in_parquet_reader branch from b65aee9 to f0567b3 Compare December 18, 2024 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant