Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate returning ndarray when lazy_load=False #1787

Open
braingram opened this issue May 25, 2024 · 8 comments
Open

Investigate returning ndarray when lazy_load=False #1787

braingram opened this issue May 25, 2024 · 8 comments

Comments

@braingram
Copy link
Contributor

The main purpose of NDArrayType to to allow "lazy loading" of array data. When this feature is enabled, the converter "realizes" the underlying array but still returns a NDArrayType:

if not ctx._blocks._lazy_load:
instance._make_array()
return instance

Investigate if returning the ndarray (instead of the NDArrayType) causes any downstream issues.

@ketozhang
Copy link

ketozhang commented Jun 13, 2024

+1, I have a use case where this is important for validation on read:

arr = np.arange(10)
af = asdf.AsdfFile({"array": np.arange(10)})
af.write_to("array.asdf")

with asdf.open("array.asdf") as af:
    assert isinstance(af['array'], type(arr)), f"{type(af['array'])} is not {type(arr)}"

# AssertionError: <class 'asdf.tags.core.ndarray.NDArrayType'> is not <class 'numpy.ndarray'>

Although I would not implement validation like this, this is typically the assumption of serializers (e.g., Pydantic).

@braingram
Copy link
Contributor Author

Would a type union work?

@ketozhang
Copy link

Yes it would, but it exposes NDArrayType specifics to the model unnecessarily. When I write ASDF model I really only want to think about !ndarray-1.0.0 is the same thing as np.ndarray.

If there are other lazy types, it won't scale well if I have to union every lazy-able tags.

@braingram
Copy link
Contributor Author

braingram commented Jun 13, 2024

Thanks for bringing up the point about using a separate lazy type having scalability issues.

I am currently hopeful that the lazy_tree feature will allow us to remove NDArrayType (by only converting the ndarray-x.x tagged dict to a np.ndarray when it's accessed from the tree). However this might be a worse option for pydantic since any runtime "validation" will likely require accessing the tree (which will trigger the array to be loaded, losing any benefit of the "lazy tree").

@ketozhang
Copy link

ketozhang commented Jun 14, 2024

Thanks for the suggestion, I agree lazy tree would work but the lazy benefits are lost with pydantic.

Much appreciate the informative response at #1789 (comment). My takeaway is Pydantic validates on instantiation, instantiation occurs on read, ndarray tagged objects gets converted to NDArrayType, and NDArrayType is not a subclass of ndarray.

If we can change the last part—NDArrayType is a subclass of ndarray— then it wouldn't matter if I had lazy loading on/off.


If NDArrayType is a subclass of ndarray, then these two scenarios shows where it is appropriate for users to think about laziness.

If written with ndarray, the class read in must be a subclass of ndarray

af = asdf.AsdfFile({"array": np.arange(10)})
af.write_to("array.asdf")

with asdf.open("array.asdf") as af:
    assert isinstance(af['array'], np.ndarray)

Users never think about laziness, could use NDArrayType as if it's ndarray.

May copy an ndarray from one tree to another, but lazy objects must be materialized before writing

copy_af = asdf.AsdfFile()

with asdf.open("array.asdf") as af:
    array = af.tree['array']   # BAD is NDArrayType causes OSError: ASDF file has already been closed
    assert isinstance(af['array'], np.ndarray)

    array = af.tree['array'].copy()  # OK is ndarray
    assert isinstance(af['array'], np.ndarray)

    copy_af.tree['array'] = array

copy_af.write_to('copy_array.asdf')

Current behavior raises OSError on the write_to() method. This is problematic since the user needs to remember which lines had lazy objects. A warning should be made during copy_af.tree['array'] = array. Otherwise, users are force to use write_to() inside the original file's context.

@braingram
Copy link
Contributor Author

Thanks for the suggestion.

I don't think making NDArrayType a subclass is required for returning a ndarray when lazy_load=False.

Would you give this branch a test to see if it works with your pydantic code (without any type union for NDArrayType):
https://github.com/braingram/asdf/tree/non_lazy_ndarray

It is far from "complete" as a solution and constitutes a breaking change (so wouldn't be possible until asdf 4, coming this fall at the soonest) but I would be curious to hear if it makes the pydantic integration easier.

@ketozhang
Copy link

Sorry! I did not realize I didn't update this. This change does pass this test

This test in v3 would've failed at asdf.open due to Pydantic seeing NDArrayType != ndarray.

@ketozhang
Copy link

After spending quite a bit of time in this, I'm convinced some sort of union types is needed rather than forcing users to lazy load to guarantee an ndarray is returned.

Would it be appropriate for this asdf project to maintain a list of types (e.g., SomeNameType = NDArrayType | np.ndarray)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants