Skip to content

Conversation

longshuicy
Copy link

@longshuicy longshuicy commented Jan 30, 2025

Update:

  • Ran into error: pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 4380140751, update the the create_geotiff_images_dataset to read individual files and do repartition

  • Ran into

File "/home/cwang138/miniconda3/envs/maple_py310_ray/lib/python3.10/site-packages/osgeo/gdal.py", line 3600, in VSIFWriteL
    return _gdal.VSIFWriteL(*args)
RuntimeError: too large buffer (>2GB)

Fix by update the the gdal write file method to write in chunks


The latest Ray 2.41.0 introduces error: pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays, consider casting input from `binary` to `large_binary` first.

The workaround for now is pinning Ray to 2.32.0.


Some of the related notes:

  • Suspect the stitching shapefile step has large binary. I’m guessing the stitched shapefile might have a very large binary field (> 2GB) that triggered the error.
data_per_image = inferenced_dataset.groupby(
        "image_name").map_groups(ray_tile_and_stitch_util.stitch_shapefile, concurrency=args.concurrency)

@tcnichol
Copy link
Contributor

tcnichol commented Feb 3, 2025

I have checked and with this version pinned everything runs. Doing some more review, will leave more comments. Just wanted to let you know that this now runs with no errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants