Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add REST Catalog tests to Spark 3.5 integration test #11093

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

haizhou-zhao
Copy link
Contributor

For issue: #11079

Haizhou Zhao added 2 commits September 18, 2024 15:44
Add REST Catalog tests to Spark 3.4 integration test

tmp save

Fix integ tests

Revert "Add REST Catalog tests to Spark 3.4 integration test"

This reverts commit d052416.

unneeded changes

fix test

retrigger checks

Fix integ test

Fix port already in use

Fix unmatched validation catalog

spotless

Fix sqlite related test failures
@@ -521,7 +524,7 @@ public void testFilesTableTimeTravelWithSchemaEvolution() throws Exception {
optional(3, "category", Types.StringType.get())));

spark.createDataFrame(newRecords, newSparkSchema).coalesce(1).writeTo(tableName).append();

table.refresh();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this refresh only needed for REST?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, RESTTableOperations and other TableOperations has different mechanisms of refreshing metadata.

// JdbcCatalog, then different jdbc connections could provide different views of table
// status even belonging to the same catalog. Reference:
// https://www.sqlite.org/inmemorydb.html
System.setProperty(CatalogProperties.CLIENT_POOL_SIZE, "1");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this strictly needed to make tests pass? I don't think we set this in any other tests for that specific purpose

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is needed for test to pass. I attempted to run test without this line and here're the result: https://github.com/apache/iceberg/actions/runs/11297060489/job/31423193928?pr=11093

System.setProperty(CatalogProperties.CLIENT_POOL_SIZE, "1");
restServer.start(false);
restCatalog = RCKUtils.initCatalogClient();
System.clearProperty("rest.port");
Copy link
Contributor

@nastra nastra Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should pass a Map<String, String> to the REST server rather than having to set system properties (which then also need to be cleared again). @danielcweeks thoughts on this?

@BeforeEach
public void before() {
this.validationCatalog =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if we don't do any changes to how the validation catalog is configured?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like you created a table on REST catalog, while validate against Hive catalog.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is that I think all tests should be passing when we don't do any changes to the validation catalog

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the original validationCatalog is set to a HadoopCatalog if catalogName of the catalog being tested is named testhadoop; otherwise, make the validationCatalog the same as catalog (which is strictly a HiveCatalog, as defined by TestBase class).

That will work in the old days, as we only have 2 types catalogs, either Hadoop or Hive, being tested - if the test subject is not a Hadoop Catalog, then setting the validation catalog to Hive Catalog will suffice the validation purpose. However, with the introduction of a 3rd type, REST catalog: when the test subject is a REST catalog, without changing how validationCatalog is initialized, the validationCatalog will be set to a Hive catalog. In this case, conducting test behaviors on REST catalog while validating the status post-change on Hive catalog won't work.

Unless, you are suggesting that we should make changes to TestBase class where the catalog being tested does not strictly need to be a HiveCatalog, and can be any type of catalog.

Map<String, String> catalogProperties = RCKUtils.environmentCatalogConfig();
Map<String, String> catalogProperties = Maps.newHashMap();
catalogProperties.putAll(RCKUtils.environmentCatalogConfig());
catalogProperties.putAll(Maps.fromProperties(System.getProperties()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned in a comment further above, maybe we should consider passing a Map<String, String> to RESTCatalogServer rather than relying on system properties (which have to be cleared after they were configured)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that means we can revert this line here: catalogProperties.putAll(Maps.fromProperties(System.getProperties()));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted.

// JdbcCatalog, then different jdbc connections could provide different views of table
// status even belonging to the same catalog. Reference:
// https://www.sqlite.org/inmemorydb.html
System.setProperty(CatalogProperties.CLIENT_POOL_SIZE, "1");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was this conversation getting lost (not sure why): #11093 (comment)

new RESTServerExtension(
Map.of(
RESTCatalogServer.REST_PORT, String.valueOf(RCKUtils.findFreePort()),
CatalogProperties.WAREHOUSE_LOCATION, warehouse.getAbsolutePath(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented on this yesterday but can't find the comment anymore. The warehouse is currently only used for Hadoop and the REST server will create a separate temp directory for the REST catalog's warehouse, so we can remove this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

port = socket.getLocalPort();
socket.close();
} catch (IOException e) {
throw new RuntimeException(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This could use try-with-resources and we should probably use UncheckedIOException rather than RE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, since RCKUtils is being package protected again, and this findFreePort util is needed on the Spark integ side, I just moved this util out of RCKUtils.

}

private static void initRESTCatalog() {
restCatalog = RCKUtils.initCatalogClient(REST_SERVER_EXTENSION.config());
Copy link
Contributor

@danielcweeks danielcweeks Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So rather than making the client like theis and making the RESTUtils public. I think you can actually create the client in the extension and just grab it out of the context:

// In the RESTServerExtension class
@Override
public void beforeAll(ExtensionContext extensionContext) throws Exception {
   // ... other init
   // store the client
    Store store = context.getStore(Namespace.GLOBAL);
    store.put("rest-client", RCKUtils.initCatalogClient(...));
  }
}

In this class then just use

@Override
public void beforeEach(ExtensionContext context) {
  // get the rest catalog client from the context store
  Store store = context.getStore(Namespace.GLOBAL);
  this.restCatalog = store.get("rest-client");
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made RCKUtils package protected again in the latest commit. A little bit different from what you suggested above is: the Spark integ test rely on REST client being initialized in static context, so initializing restCatalog in beforeEach might be hard. Solved the problem by making RESTServerExtention taking control over client life cycle along side server life cycle. Anyway, in the end, RCKUtils is package protected again.

@@ -76,15 +78,22 @@ static Map<String, String> environmentCatalogConfig() {
HashMap::new));
}

static RESTCatalog initCatalogClient() {
public static RESTCatalog initCatalogClient() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can get away with leaving this as package protected and not expose the inner workings of the RCK and Extension. Take a look at my example below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@danielcweeks
Copy link
Contributor

Hey @haizhou-zhao this is looking really close. Some minor comments on how I think we can improve the Extension handling to better isolate the test, but other than that, I think it looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants