Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement InMemoryCatalog as a subclass of SqlCatalog #1140

Merged
merged 3 commits into from
Feb 10, 2025

Conversation

hussein-awala
Copy link
Member

@hussein-awala hussein-awala commented Sep 5, 2024

closes: #1110

This PR implement a new catalog InMemoryCatalog as a subclass of SqlCatalog with SQLite in-memory.

tests/catalog/test_base.py Outdated Show resolved Hide resolved
tests/catalog/test_base.py Outdated Show resolved Hide resolved
@hussein-awala hussein-awala changed the title replace InMemoryCatalog by a subclass of SqlCatalog with SQLite in-memory feat: implement InMemoryCatalog as a subclass of SqlCatalog Sep 6, 2024
@hussein-awala
Copy link
Member Author

@kevinjqliu I applied what you suggested in the comment above, could you recheck it now?

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some nit comments. Thanks for working on this! The in-memory catalog is used at a bunch of places in the tests, so changing it has cascading effects

This is useful for test, demo, and playground but not in production as data is not persisted.
"""

def __init__(self, name: str, warehouse: str = "file:///tmp/warehouse", **kwargs: str) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's use something like /tmp/iceberg/warehouse to not conflict with other tmp directories. Also I'm not sure if this works when the warehouse directory is not created yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i'd like to keep this as test_base because I want to parameterize all tests to make sure all the catalogs have the same behaviors (see #813)

Comment on lines 69 to 70
DROP_NOT_EXISTING_NAMESPACE_ERROR = "Namespace does not exist: \\('com', 'organization', 'department'\\)"
NO_SUCH_NAMESPACE_ERROR = "Namespace com.organization.department does not exists"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, merge these two

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done in a separate PR

tests/catalog/test_memory.py Outdated Show resolved Hide resolved
@@ -806,7 +831,7 @@ def test_json_properties_get_table_does_not_exist(catalog: InMemoryCatalog) -> N
runner = CliRunner()
result = runner.invoke(run, ["--output=json", "properties", "get", "table", "doesnotexist"])
assert result.exit_code == 1
assert result.output == """{"type": "NoSuchTableError", "message": "Table does not exist: ('doesnotexist',)"}\n"""
assert result.output == """{"type": "ValueError", "message": "Empty namespace identifier"}\n"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shouldnt this be NoSuchTableError? maybe the namespace needs to be created first

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some nit comments. Thanks for working on this! The in-memory catalog is used at a bunch of places in the tests, so changing it has cascading effects

@kevinjqliu
Copy link
Contributor

@Fokko wydt of this change? i remember we had past discussions on adding a "new" catalog implementation

@Fokko
Copy link
Contributor

Fokko commented Oct 28, 2024

@hussein-awala Thanks for working on this 🚀 @kevinjqliu Regarding the new catalogs, my main concern was a proliferation of new catalogs, and that they would lack maintenance. I do like this change for two reasons:

  • It moves out of the InMemoryCatalog that's specific to tests. We want to have the catalog as part of the tests, otherwise we're testing a catalog that's not part of the normal code-path.
  • It merges the InMemory catalog into the SqlCatalog. This way, when new features are released, such as support for views, multi-table transactions, etc. we have fewer places where we need to implement them.

I'm positive about this change. The only consideration I could make is that we hide the SqlCatalog behind the InMemoryCatalog. Maybe it is interesting for folks to know that they can easily switch to a persistent catalog. What are your thoughts?

@kevinjqliu
Copy link
Contributor

I'm positive about this change. The only consideration I could make is that we hide the SqlCatalog behind the InMemoryCatalog. Maybe it is interesting for folks to know that they can easily switch to a persistent catalog. What are your thoughts?

I think it would be good to document the InMemoryCatalog, perhaps in the catalog section of the configuration page.
We can mention that it uses the SqlCatalog under the hood and to use another catalog implementation to persist the catalog metadata

@kevinjqliu kevinjqliu added this to the PyIceberg 0.9.0 release milestone Oct 30, 2024
@kevinjqliu
Copy link
Contributor

hey @hussein-awala would you like to make the above changes on docs? This PR is almost ready!

@hussein-awala
Copy link
Member Author

hey @hussein-awala would you like to make the above changes on docs? This PR is almost ready!

yes, I will make it ready ASAP

@hussein-awala hussein-awala force-pushed the remove_InMemoryCatalog branch from fa2321e to 7f274a0 Compare February 7, 2025 21:26
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! And the great docs.

I like that we can replace the old implementation in test but I'm on the fence about whether we should expose/advertise this as a catalog type. It is useful for certain situations and for testing, but im not sure how much value there is to allow users to do

load_catalog()

and

catalog:
  default:
    type: in-memory
    warehouse: /tmp/pyiceberg/warehouse

WDYT @Fokko ?

@Fokko
Copy link
Contributor

Fokko commented Feb 10, 2025

Or just:

catalog = load_catalog('default', 'type'='in-memory', 'warehouse'='/tmp/pyiceberg/warehouse')

I agree that this catalog impl is mostly focussed on testing/demonstration. If you would use a Jupyter notebook, each time you restart the kernel, then you end up with a fresh catalog (don't have to clean up any old stuff lingering around).

@kevinjqliu kevinjqliu merged commit 17e9110 into apache:main Feb 10, 2025
8 checks passed
@kevinjqliu
Copy link
Contributor

Thanks for the contribution @hussein-awala and thanks for the review @Fokko

@kevinjqliu kevinjqliu mentioned this pull request Feb 10, 2025
Fokko pushed a commit that referenced this pull request Feb 10, 2025
CI is failing on main branch 

https://github.com/apache/iceberg-python/actions/runs/13247077313/job/36976096982

Caused by merge conflict after #1140. `InMemoryCatalog` now does not
automatically create the namespace.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove InMemoryCatalog from the test-codebase
3 participants