Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[openhouse] Implement TableOperations-specific FileIO instance #241

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

raymondlam12
Copy link

Summary

[Issue][(https://github.com/linkedin/openhouse/issues/#nnn)] Briefly discuss the summary of the changes made in this
pull request in 2-3 lines.](#240)

In order to implement FileIOs that are TableOperations-specific, creating a FileIO instance on a per TableOperations-basis is required.

This is similar to how other Iceberg-compliant Catalogs are implemented: apache/iceberg#10893 and implemented in the same way (taken from the HiveCatalog / GlueCatalog implementations before it got refactored into FileIOTracker)

This feature is needed to implement data access tokens.

Changes

  • Client-facing API Changes
  • [ x ] Internal API Changes
  • Bug Fixes
  • [ x ] New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

TODO

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

Copy link
Collaborator

@HotSushi HotSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome to the OH community! Aligned on PR direction but left some comments

@@ -88,6 +92,8 @@ public class OpenHouseCatalog extends BaseMetastoreCatalog

protected Map<String, String> properties;

private Cache<TableOperations, FileIO> fileIOCloser;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these look like temporary duplication of code that will be available in upstream iceberg with version bump. Can we add a todo identifiying the PR and version of iceberg that introduced the fileIOcloser and instructions on removing the code duplication later?

.snapshotApi(snapshotApi)
.cluster(cluster)
.build();
FileIO tableOperationsLocalFileIO = loadFileIO(properties);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also change the signature of loadFileIO to following:

protected FileIO loadFileIO(TableIdentifier tableIdentifier, Map<String, String> properties) {

reason:
Each table can have its own storage, table1 -> hdfs, table2 -> s3, table3 -> adls . We'd need to instantiate/load the right FileIO per table and properties.

@@ -592,4 +603,25 @@ private TableMetadata createStagedMetadata() {
return new StaticTableOperations(tableLocation, fileIO).refresh();
}
}

@Override
public void close() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add unit tests to validate this logic?

@@ -41,6 +42,7 @@ dependencies {
exclude group: 'com.zaxxer', module: 'HikariCP-java7'
exclude group: 'org.apache.commons', module: 'commons-lang3'
}
implementation("com.github.ben-manes.caffeine:caffeine:" + caffeineVersion)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also do output jar check to see if the newly added library is appropriately relocated? the output jar should look as follows:
image

@@ -2,6 +2,9 @@

import static com.linkedin.openhouse.javaclient.OpenHouseTableOperations.*;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -88,6 +92,8 @@ public class OpenHouseCatalog extends BaseMetastoreCatalog

protected Map<String, String> properties;

private Cache<TableOperations, FileIO> fileIOCloser;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious. How long the entries are present in the cache? Wondering if the cache will be populated with too many objects if there are more tables operation done in a short period of time. Do you think is going to consume more memory?

private Cache<TableOperations, FileIO> newFileIOCloser() {
return Caffeine.newBuilder()
.weakKeys()
.removalListener(
Copy link
Member

@abhisheknath2011 abhisheknath2011 Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the listener is invoked to perform close?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants