-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CTAS for non-hdfs storages, also fixes multi storage cases #256
base: main
Are you sure you want to change the base?
Conversation
e15e1dc
to
1a49f14
Compare
services/tables/src/main/java/com/linkedin/openhouse/tables/utils/TableUUIDGenerator.java
Outdated
Show resolved
Hide resolved
services/tables/src/main/java/com/linkedin/openhouse/tables/utils/TableUUIDGenerator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with Storage API changes. One feedback on not introducing iceberg at rest layer.
...ter/storage/src/main/java/com/linkedin/openhouse/cluster/storage/hdfs/HdfsStorageClient.java
Show resolved
Hide resolved
services/tables/src/main/java/com/linkedin/openhouse/tables/utils/TableUUIDGenerator.java
Outdated
Show resolved
Hide resolved
cluster/storage/src/main/java/com/linkedin/openhouse/cluster/storage/BaseStorage.java
Outdated
Show resolved
Hide resolved
cluster/storage/src/main/java/com/linkedin/openhouse/cluster/storage/StorageClient.java
Outdated
Show resolved
Hide resolved
cluster/storage/src/main/java/com/linkedin/openhouse/cluster/storage/BaseStorage.java
Outdated
Show resolved
Hide resolved
cluster/storage/src/main/java/com/linkedin/openhouse/cluster/storage/Storage.java
Outdated
Show resolved
Hide resolved
cluster/storage/src/main/java/com/linkedin/openhouse/cluster/storage/StorageClient.java
Outdated
Show resolved
Hide resolved
...ter/storage/src/main/java/com/linkedin/openhouse/cluster/storage/adls/AdlsStorageClient.java
Show resolved
Hide resolved
...ter/storage/src/main/java/com/linkedin/openhouse/cluster/storage/hdfs/HdfsStorageClient.java
Show resolved
Hide resolved
cluster/storage/src/main/java/com/linkedin/openhouse/cluster/storage/s3/S3StorageClient.java
Outdated
Show resolved
Hide resolved
services/tables/src/main/java/com/linkedin/openhouse/tables/utils/TableUUIDGenerator.java
Outdated
Show resolved
Hide resolved
services/tables/src/main/java/com/linkedin/openhouse/tables/utils/TableUUIDGenerator.java
Outdated
Show resolved
Hide resolved
@HotSushi @jainlavina Addressed the comments. Some changes in the latest commit
|
cluster/storage/src/main/java/com/linkedin/openhouse/cluster/storage/StorageClient.java
Outdated
Show resolved
Hide resolved
cluster/storage/src/main/java/com/linkedin/openhouse/cluster/storage/StorageClient.java
Outdated
Show resolved
Hide resolved
cluster/storage/src/main/java/com/linkedin/openhouse/cluster/storage/StorageManager.java
Outdated
Show resolved
Hide resolved
cluster/storage/src/main/java/com/linkedin/openhouse/cluster/storage/StorageManager.java
Outdated
Show resolved
Hide resolved
cluster/storage/src/main/java/com/linkedin/openhouse/cluster/storage/StorageManager.java
Outdated
Show resolved
Hide resolved
...ter/storage/src/main/java/com/linkedin/openhouse/cluster/storage/hdfs/HdfsStorageClient.java
Outdated
Show resolved
Hide resolved
cluster/storage/src/main/java/com/linkedin/openhouse/cluster/storage/StorageClient.java
Outdated
Show resolved
Hide resolved
public Storage getStorageFromPath(String path) { | ||
for (Storage storage : storages) { | ||
if (storage.isConfigured()) { | ||
if (StorageType.LOCAL.equals(storage.getType())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be fallback only if path does not start with any other configured endpoint?
What if the path is an S3 storage path but local storage is also configured for some other tables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
➕ lets have if (StorageType.LOCAL.equals(storage.getType())) {
in the fallback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. Please check
...ter/storage/src/main/java/com/linkedin/openhouse/cluster/storage/adls/AdlsStorageClient.java
Show resolved
Hide resolved
...ter/storage/src/main/java/com/linkedin/openhouse/cluster/storage/hdfs/HdfsStorageClient.java
Outdated
Show resolved
Hide resolved
cluster/storage/src/main/java/com/linkedin/openhouse/cluster/storage/s3/S3StorageClient.java
Outdated
Show resolved
Hide resolved
...ter/storage/src/main/java/com/linkedin/openhouse/cluster/storage/hdfs/HdfsStorageClient.java
Outdated
Show resolved
Hide resolved
...ter/storage/src/main/java/com/linkedin/openhouse/cluster/storage/hdfs/HdfsStorageClient.java
Show resolved
Hide resolved
...ter/storage/src/main/java/com/linkedin/openhouse/cluster/storage/hdfs/HdfsStorageClient.java
Outdated
Show resolved
Hide resolved
cluster/storage/src/main/java/com/linkedin/openhouse/cluster/storage/s3/S3StorageClient.java
Outdated
Show resolved
Hide resolved
...lcatalog/src/main/java/com/linkedin/openhouse/internal/catalog/OpenHouseInternalCatalog.java
Outdated
Show resolved
Hide resolved
extractFromTblPropsIfExists(databaseId + "." + tableId, tableProperties, DB_RAW_KEY); | ||
String tblIdFromProps = | ||
extractFromTblPropsIfExists(databaseId + "." + tableId, tableProperties, TBL_RAW_KEY); | ||
String tableURI = String.format("%s.%s", databaseId, tableId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rohitkum2506 can you check if the table reinstatement logic for replication is still intact? especially at:
String tableLocation = extractFromTblPropsIfExists(tableURI, tableProperties, TBL_LOC_RAW_KEY);
services/tables/src/main/java/com/linkedin/openhouse/tables/utils/TableUUIDGenerator.java
Outdated
Show resolved
Hide resolved
services/tables/src/test/java/com/linkedin/openhouse/tables/e2e/h2/SnapshotsControllerTest.java
Outdated
Show resolved
Hide resolved
public Storage getStorageFromPath(String path) { | ||
for (Storage storage : storages) { | ||
if (storage.isConfigured()) { | ||
if (StorageType.LOCAL.equals(storage.getType())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
➕ lets have if (StorageType.LOCAL.equals(storage.getType())) {
in the fallback
cluster/storage/src/main/java/com/linkedin/openhouse/cluster/storage/StorageClient.java
Outdated
Show resolved
Hide resolved
3cecc48
to
ea7a771
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For extractUUIDFromTableProperties
, the logic before would check for directory existence thats missing now. We should be able to incorporate it as part of Storage.isPathValid()
.
For extractUUIDFromExistingManifestListPath
we end up calling storage.getClient().getEndpoint()
, we should introduce better layering/ or redefine the logic. But i'm ok pursuing this in a future PR.
* Checks if the provided path is a valid path for this storage type. It defaults to checking if | ||
* the path starts with the endpoint (scheme) specified in cluster.yaml | ||
* | ||
* @param path path to a file/object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you add example of what a path string looks like for HDFS + s3, similar to StorageClient doc. Also add a doc explicitly that says prefix check wouldn't work, only object existance check works
storageManager, dbIdFromProps, tblIdFromProps, tableUUIDProperty); | ||
if (TableType.REPLICA_TABLE != tableType && !doesPathExist(previousPath)) { | ||
log.error("Previous tableLocation: {} doesn't exist", previousPath); | ||
if (TableType.REPLICA_TABLE != tableType && !storage.getClient().exists(tableLocation)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
storage.getClient().exists(tableLocation)
-> storage.isPathValid(tableLocation)
? after incorporating existence check feedback.
* @return true if endpoint is specified in cluster.yaml else false | ||
*/ | ||
default boolean isPathValid(String path) { | ||
return path.startsWith(getClient().getEndpoint()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldnt we also check for object existence here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isPathValid should check three things:
- its appropriate for the right storage (ie: prefix matches s3://)
- its directory structure is valid
- rootprefix is intact (ie. /data/openhouse/db/tb/metadata.json has rootPrefix data/openhouse)
- db/table directory is intact (ie. /data/openhouse/db/tb/metadata.json has directory structure thats created by storage.allocateTableLocation(), /db/tb-uuid)
- and object exists(ie. /data/openhouse/db/tb/metadata.json);
To achieve this, you might need to change the signature to: boolean isPathValid(String path, dbname, tbname, tbuuid)
*/ | ||
@Override | ||
public boolean isPathValid(String path) { | ||
return (super.isPathValid(path) || path.startsWith("/")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll also need to validate rootprefix as well, basically this method needs to be equivalent to:
java.nio.file.Path previousPath =
InternalRepositoryUtils.constructTablePath(
storageManager, dbIdFromProps, tblIdFromProps, tableUUIDProperty);
!doesPathExist(previousPath)
in TableUUIDGenerator
Preconditions.checkArgument( | ||
path.startsWith(getEndpoint()), String.format("Invalid S3 URL format %s", path)); | ||
try { | ||
URI uri = new URI(path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use s3 utils to get bucket, key information?
import software.amazon.awssdk.services.s3.S3Client;
import software.amazon.awssdk.services.s3.S3Utilities;
import software.amazon.awssdk.services.s3.model.S3Uri;
public class S3UriParser {
public static void main(String[] args) {
S3Utilities s3Utilities = s3Client.utilities();
String s3UriString = "s3://my-bucket/path/to/file.txt";
S3Uri s3Uri = s3Utilities.parseUri(URI.create(s3UriString));
String bucket = s3Uri.bucket().orElse(null);
String key = s3Uri.key().orElse(null);
System.out.println("Bucket: " + bucket);
System.out.println("Key: " + key);
}
}
String manifestListPathString = | ||
new Gson().fromJson(snapshotStr, JsonObject.class).get(manifestListKey).getAsString(); | ||
manifestListPathString = | ||
StringUtils.removeStart(manifestListPathString, storage.getClient().getEndpoint()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a better way to achieve this without calling storage.getClient().getEndpoint
, I'm ok with deferring this part to future pr.
Summary
This PR introduces the following changes
While extracting UUID from a snapshot, the code constructs the database path excluding the endpoint (scheme) when checking if it is a prefix of the manifestList that is part of the snapshot
Example
ManifestList (from snapshot): s3://bucket-name/database/table-uuid/file.avro
Database prefix: bucket-name/database (not a prefix of above)
Fix: Strip the endpoint(scheme) from the manifest list by resolving the correct storage from the tableLocation
After fix
ManifestList stripped : bucket-name/database/table-uuid/file.avro
Database prefix: bucket-name/database (is a prefix of above)
There are assumptions in code that storage is always cluster default. This fails when the default is a storage without scheme (hdfs) and the db.table is being stored in a storage with scheme (S3, BlobFs)
Fix: Resolve the correct storage for by extracting the tableLocation from table props and checking the scheme (endpoint)
Changes
For all the boxes checked, please include additional details of the changes made in this pull request.
Testing Done
For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.
Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.