Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint state invalid for both Zarr and OCDBT formats #1484

Open
plra opened this issue Jan 13, 2025 · 0 comments
Open

Checkpoint state invalid for both Zarr and OCDBT formats #1484

plra opened this issue Jan 13, 2025 · 0 comments

Comments

@plra
Copy link

plra commented Jan 13, 2025

I'm using jax 0.4.34, flax 0.9.0 and orbax 0.7.0. Until recently I was using orbax 0.4.1. Certain checkpoints created with v0.4.1 have the following directory structure:

path/to/old_ckpt/
  0/
    commit_success.txt
    default/
      _sharding
      checkpoint
      commit_success.txt
      d/
        <hash>
        ...
      manifest.ocdbt
      <myparam>/
        kernel/
          d/
            <hash>
          manifest.ocdbt
      ...

For at least some of these checkpoints, when I try to restore with a PyTreeCheckpointHandler I get

ValueError: NOT_FOUND: Error opening "cast" driver: Error opening "zarr" driver:
Metadata at "<myparam>/kernel/scale/.zarray" in OCDBT database at
gs://<checkpoints>/<model>/<run>/<step>/default/ does not exist

Downgrading orbax back to 0.4.1 results in the same error. Did I corrupt my checkpoint state somehow? How can I rehabilitate these checkpoints?

For reference, my modern checkpoint dirs look like this:

path/to/new_ckpt/
  0/
    _CHECKPOINT_METADATA
    commit_success.txt
    default/
      _METADATA
      _sharding
      commit_success.txt
      d/
        <hash>
      manifest.ocdbt
      ocdbt.process_0/
        d/
          <hash>
          ...
        manifest.ocdbt

and I can load them just fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant