Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do you write a range request? #59

Open
ahalota opened this issue Jan 5, 2023 · 14 comments
Open

How do you write a range request? #59

ahalota opened this issue Jan 5, 2023 · 14 comments

Comments

@ahalota
Copy link

ahalota commented Jan 5, 2023

I'm trying to take advantage of COG format to write a python script to download only the data inside of a state border for example "california.shp". The data I want to download is hosted on AWS as requester pays, so it's important that I only download exactly what I need.

I have gone several pages in on google, and despite all the pages saying how great COG is because of the range requests, I just can't find a single real example showing how to do this.

So, given a shapefile, and an aws data COG data source (for example, naip-imagery), how would you do this? It would be great to have example scripts on this website (or anywhere, really).

This is the code I have so far. (Along with some creative suggestions from ChatGPT that didn't work out, I'm assuming it has just as little information to feed into the model as I did when trying to figure it out...)

import boto3
import shapefile
from shapely.geometry import Polygon, Point

# Read in shapefile, which is in WGS84
california_border = Polygon(shapefile.Reader("california.shp")).shapes()[0].points)
   
# Connect to AWS
client = boto3.client(
    "s3",
    aws_access_key_id=<MY_ACCESS_KEY>,
    aws_secret_access_key=<MY_SECRET_ACCESS_KEY>,
    region_name="us-west-2"
)

# List objects in the "naip-analytic" dataset
objects = client.list_objects_v2(
    Bucket="naip-analytic",
    Prefix="ca/2020/60cm/rgbir_cog/32114", #One out of 6 folders available for California, this holds 43 .tif files
    MaxKeys=1000,
    RequestPayer="requester"
)

for obj in objects["Contents"]:
    print(obj["Key"])
    # Get object metadata
    metadata = client.head_object(
        Bucket="naip-analytic",
        Key=obj["Key"],
        RequestPayer="requester" 
    )
    
    coords = metadata["Metadata"] #This doesn't actually hold any information (suggested by ChatGPT). Full output below

    #Everything below is non-working suggestions from ChatGPT, left in case they inspire any good ideas
    xmin = float(coords["s3:xmin"])
    ymin = float(coords["s3:ymin"])
    xmax = float(coords["s3:xmax"])
    ymax = float(coords["s3:ymax"])

    # Check if image intersects the California border
    if california_border.intersects(Polygon([(xmin, ymin), (xmax, ymin), (xmax, ymax), (xmin, ymax)])):
        # Calculate the overlap between the image and the California border
        overlap = california_border.intersection(Polygon([(xmin, ymin), (xmax, ymin), (xmax, ymax), (xmin, ymax)]))
        # Calculate the start and end bytes of the overlap
        start_x, start_y = overlap.exterior.coords[0]
        end_x, end_y = overlap.exterior.coords[2]
        start_byte = int((start_x - xmin) / (xmax - xmin) * obj["Size"]) #This was ChatGPT's guess. I am not sure if this is approrpiate
        end_byte = int((end_x - xmin) / (xmax - xmin) * obj["Size"])
        # Create a range request for the image
        url = f"https://naip-analytic.s3.amazonaws.com/{obj['Key']}?AWSAccessKeyId={client.meta.config.access_key}&Signature={client.meta.config.secret_key}&Expires={client.meta.config.signature_version}"
        headers = {
            "Range": f"bytes={start_byte}-{end_byte}"
        }
        print(url)

My metadata response from AWS looks like this:

{'ResponseMetadata': {'RequestId': 'BCZWDXWE7BJ1444B', 'HostId': 'BnfupuW09mBf8wBWNCYt2sLFLl78sY3niZ+TKZx6UkGZqCIFk3uNLCqPRG76I3qvAEWpkLLhR3s=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': 'BnfupuW09mBf8wBWNCYt2sLFLl78sY3niZ+TKZx6UkGZqCIFk3uNLCqPRG76I3qvAEWpkLLhR3s=', 'x-amz-request-id': 'BCZWDXWE7BJ1444B', 'date': 'Thu, 05 Jan 2023 17:23:55 GMT', 'x-amz-request-charged': 'requester', 'last-modified': 'Fri, 28 Jan 2022 16:31:34 GMT', 'etag': '"3357a147f64f23e7b8471760b7faf242-44"', 'accept-ranges': 'bytes', 'content-type': 'image/tiff', 'server': 'AmazonS3', 'content-length': '361935973'}, 'RetryAttempts': 0}, 'AcceptRanges': 'bytes', 'LastModified': datetime.datetime(2022, 1, 28, 16, 31, 34, tzinfo=tzutc()), 'ContentLength': 361935973, 'ETag': '"3357a147f64f23e7b8471760b7faf242-44"', 'ContentType': 'image/tiff', 'Metadata': {}, 'RequestCharged': 'requester'}

I'm hoping someone here is familiar enough with the format to point me in the right direction, it's disappointing that something so simple isn't described anywhere.

I'm not sure where to find:

  1. The AWS-hosted COG's bounding box coordinates (the ones above ["s3:xmin"] etc. do not appear when I run the code)
  2. How do I calculate which bytes I need?
  3. Can I get more specific and only get the exact bytes within my border, or do I have to do a bounding box?

Thanks! I really appreciate any help.

@vincentsarago
Copy link
Member

vincentsarago commented Jan 5, 2023

checkout
https://rasterio.readthedocs.io/en/latest/topics/masking-by-shapefile.html
https://cogeotiff.github.io/rio-tiler/advanced/feature/

But that will only explain how to read a raster (COG). You have other issues to solve:

  • NAIP has multiple years of acquisition (your 6 sub-directories), so you have to select 1
  • for each years you have multiple small COG so you need to create a mosaic first. The easiest will be to create a VRT document with all the COG and then use it to fetch the part you need (from the shape)

You don't need to write the range-request, rasterio/GDAL to do that for you

ref:
https://github.com/kylebarron/naip-cogeo-mosaic
https://github.com/developmentseed/cogeo-mosaic/blob/main/docs/src/examples/Create_a_Dynamic_RtreeBackend.ipynb

@ahalota
Copy link
Author

ahalota commented Jan 5, 2023

Thanks for the links, I hadn't seen those before.

I'm trying to understand this in the most basic form possible (the naip-cogeo-mosaic link looks interesting, but I'd prefer not to get into another file format such as the mosaic.json described there).

After looking through those links, I'm still not clear on how I would proceed here.

I used the code above to grab my list of urls, and saved to "ca_urls.txt". A subset of a few here:

https://naip-analytic.s3.amazonaws.com/ca/2020/60cm/rgbir_cog/32114/m_3211419_nw_11_060_20200524.tif
https://naip-analytic.s3.amazonaws.com/ca/2020/60cm/rgbir_cog/32114/m_3211420_ne_11_060_20200524.tif
https://naip-analytic.s3.amazonaws.com/ca/2020/60cm/rgbir_cog/32114/m_3211420_nw_11_060_20200524.tif

I see several options for what should go in place of "https://" such as "s3://" or "/vsis3/", but don't know what the difference is.

Now I'm still not sure what to do with this. I looked at gdalbuildvrt, but couldn't find a configuration that actually generated a vrt.
gdalbuildvrt -input_file_lst ca_urls.txt ca_mosaic.vrt returns this error for each line: "Warning 1: Can't open . Skipping it"

What is the overlap with the vrt and the mosaic.json project? They sound similar, so if I could just use the gdal vrt format instead that would be preferrable to minimize extra libraries.

Would I then need to feed the vrt file into rio_tiler Reader, where it is treated like a regular file and I can then follow the example you linked (https://cogeotiff.github.io/rio-tiler/advanced/feature/) at which point my file will be downloaded and saved?

Once I have my vrt file, should a command like this work?
gdalwarp -cutline INPUT.shp INPUT.tif OUTPUT.tif (at this point I would expect to actually download the data and pay for it)

@vincentsarago
Copy link
Member

I see several options for what should go in place of "https://" such as "s3://" or "/vsis3/", but don't know what the difference is.

because the naip bucket is requester pays, you'll need to use the s3:// prefix and set AWS_REQUEST_PAYER= requester in you environment.

This should fix the "Warning 1: Can't open . Skipping it" issue.

What is the overlap with the vrt and the mosaic.json project? They sound similar, so if I could just use the gdal vrt format instead that would be preferrable to minimize extra libraries.

MosaicJSON is kinda like a VRT but optimized for dynamic tiling. For your use case VRT will be enough.

Would I then need to feed the vrt file into rio_tiler Reader, where it is treated like a regular file

Yes

and I can then follow the example you linked (https://cogeotiff.github.io/rio-tiler/advanced/feature/) at which point my file will be downloaded and saved?

Rio-tiler is not meant to save output file, you'll better served with gdalwarp command line as you mentioned.

@ahalota
Copy link
Author

ahalota commented Jan 5, 2023

Thanks for your help, it feels so close! I really appreciate it.

I now have my list of urls in "ca_32113_urls.txt":

s3://naip-analytic.s3.amazonaws.com/ca/2020/60cm/rgbir_cog/32114/m_3211419_nw_11_060_20200524.tif
s3://naip-analytic.s3.amazonaws.com/ca/2020/60cm/rgbir_cog/32114/m_3211420_ne_11_060_20200524.tif
s3://naip-analytic.s3.amazonaws.com/ca/2020/60cm/rgbir_cog/32114/m_3211420_nw_11_060_20200524.tif

And my command

gdalbuildvrt -input_file_list .\ca_32114_urls.txt ca_32114_urls.vrt 
--config AWS_REGION "us-west-2" 
--config AWS_ACCESS_KEY_ID <my_access_key> 
--config AWS_SECRET_ACCESS_KEY <my_secret_key> 
--config AWS_REQUEST_PAYER "requester"

But am still getting "Warning 1: Can't open . Skipping it"
When you say said to set this in the environment, does it have to be a standard windows environment variable, or would you expect it to work as shown in the command above

@vincentsarago
Copy link
Member

try with :

s3://naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211419_nw_11_060_20200524.tif
s3://naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_ne_11_060_20200524.tif
s3://naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_nw_11_060_20200524.tif

@ahalota
Copy link
Author

ahalota commented Jan 5, 2023

That one didn't work either.

@vincentsarago
Copy link
Member

🤦 my bad

it should be

/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211419_nw_11_060_20200524.tif
/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_ne_11_060_20200524.tif
/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_nw_11_060_20200524.tif

GDAL uses /vsis3/ prefix and not s3://

@ahalota
Copy link
Author

ahalota commented Jan 5, 2023

Ah, I tried that one already as well, didn't work unfortunately.

@vincentsarago
Copy link
Member

it worked for me 🤷‍♂️

@ahalota
Copy link
Author

ahalota commented Jan 5, 2023

Ah, that's a good start!

Did you run the same command I had listed above, or do you have the environment variables set locally?
Are you using Windows or Linux?

@vincentsarago
Copy link
Member

$ cat test.txt 
/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211419_nw_11_060_20200524.tif
/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_ne_11_060_20200524.tif
/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_nw_11_060_20200524.tif

$ gdalbuildvrt -input_file_list test.txt ca_32114_urls.vrt --config AWS_REGION "us-west-2" --config AWS_REQUEST_PAYER "requester"
0...10...20...30...40...50...60...70...80...90...100 - done.

$ cat ca_32114_urls.vrt 
<VRTDataset rasterXSize="39990" rasterYSize="13100">
  <SRS dataAxisToSRSAxisMapping="1,2">PROJCS["NAD83 / UTM zone 11N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-117],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26911"]]</SRS>
  <GeoTransform>  7.1059800000000000e+05,  5.9999999999999998e-01,  0.0000000000000000e+00,  3.6265380000000000e+06,  0.0000000000000000e+00, -5.9999999999999998e-01</GeoTransform>
  <VRTRasterBand dataType="Byte" band="1">
    <ColorInterp>Red</ColorInterp>
    <SimpleSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211419_nw_11_060_20200524.tif</SourceFilename>
      <SourceBand>1</SourceBand>
      <SourceProperties RasterXSize="10680" RasterYSize="12440" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10680" ySize="12440" />
      <DstRect xOff="0" yOff="660" xSize="10680" ySize="12440" />
    </SimpleSource>
    <ComplexSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_ne_11_060_20200524.tif</SourceFilename>
      <SourceBand>1</SourceBand>
      <SourceProperties RasterXSize="10710" RasterYSize="12450" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10710" ySize="12450" />
      <DstRect xOff="29280" yOff="0" xSize="10710" ySize="12450" />
      <NODATA>0</NODATA>
    </ComplexSource>
    <SimpleSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_nw_11_060_20200524.tif</SourceFilename>
      <SourceBand>1</SourceBand>
      <SourceProperties RasterXSize="10700" RasterYSize="12440" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10700" ySize="12440" />
      <DstRect xOff="19520" yOff="230" xSize="10700" ySize="12440" />
    </SimpleSource>
  </VRTRasterBand>
  <VRTRasterBand dataType="Byte" band="2">
    <ColorInterp>Green</ColorInterp>
    <SimpleSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211419_nw_11_060_20200524.tif</SourceFilename>
      <SourceBand>2</SourceBand>
      <SourceProperties RasterXSize="10680" RasterYSize="12440" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10680" ySize="12440" />
      <DstRect xOff="0" yOff="660" xSize="10680" ySize="12440" />
    </SimpleSource>
    <ComplexSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_ne_11_060_20200524.tif</SourceFilename>
      <SourceBand>2</SourceBand>
      <SourceProperties RasterXSize="10710" RasterYSize="12450" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10710" ySize="12450" />
      <DstRect xOff="29280" yOff="0" xSize="10710" ySize="12450" />
      <NODATA>0</NODATA>
    </ComplexSource>
    <SimpleSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_nw_11_060_20200524.tif</SourceFilename>
      <SourceBand>2</SourceBand>
      <SourceProperties RasterXSize="10700" RasterYSize="12440" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10700" ySize="12440" />
      <DstRect xOff="19520" yOff="230" xSize="10700" ySize="12440" />
    </SimpleSource>
  </VRTRasterBand>
  <VRTRasterBand dataType="Byte" band="3">
    <ColorInterp>Blue</ColorInterp>
    <SimpleSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211419_nw_11_060_20200524.tif</SourceFilename>
      <SourceBand>3</SourceBand>
      <SourceProperties RasterXSize="10680" RasterYSize="12440" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10680" ySize="12440" />
      <DstRect xOff="0" yOff="660" xSize="10680" ySize="12440" />
    </SimpleSource>
    <ComplexSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_ne_11_060_20200524.tif</SourceFilename>
      <SourceBand>3</SourceBand>
      <SourceProperties RasterXSize="10710" RasterYSize="12450" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10710" ySize="12450" />
      <DstRect xOff="29280" yOff="0" xSize="10710" ySize="12450" />
      <NODATA>0</NODATA>
    </ComplexSource>
    <SimpleSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_nw_11_060_20200524.tif</SourceFilename>
      <SourceBand>3</SourceBand>
      <SourceProperties RasterXSize="10700" RasterYSize="12440" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10700" ySize="12440" />
      <DstRect xOff="19520" yOff="230" xSize="10700" ySize="12440" />
    </SimpleSource>
  </VRTRasterBand>
  <VRTRasterBand dataType="Byte" band="4">
    <SimpleSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211419_nw_11_060_20200524.tif</SourceFilename>
      <SourceBand>4</SourceBand>
      <SourceProperties RasterXSize="10680" RasterYSize="12440" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10680" ySize="12440" />
      <DstRect xOff="0" yOff="660" xSize="10680" ySize="12440" />
    </SimpleSource>
    <ComplexSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_ne_11_060_20200524.tif</SourceFilename>
      <SourceBand>4</SourceBand>
      <SourceProperties RasterXSize="10710" RasterYSize="12450" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10710" ySize="12450" />
      <DstRect xOff="29280" yOff="0" xSize="10710" ySize="12450" />
      <NODATA>0</NODATA>
    </ComplexSource>
    <SimpleSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_nw_11_060_20200524.tif</SourceFilename>
      <SourceBand>4</SourceBand>
      <SourceProperties RasterXSize="10700" RasterYSize="12440" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10700" ySize="12440" />
      <DstRect xOff="19520" yOff="230" xSize="10700" ySize="12440" />
    </SimpleSource>
  </VRTRasterBand>
  <OverviewList resampling="nearest">2 4 8 16 32</OverviewList>
</VRTDataset>

I'm on Mac OS, I have couple env set but it shouldn't make any difference for accessing the files

$ env | grep "GDAL"
GDAL_CACHEMAX=75%
GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR
GDAL_HTTP_MERGE_CONSECUTIVE_RANGES=YES
GDAL_HTTP_MULTIPLEX=YES
GDAL_INGESTED_BYTES_AT_OPEN=32768
GDAL_DATA=/opt/homebrew/share/gdal
GDAL_HTTP_VERSION=2

GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR is maybe the most important, but this will only speed up the process

@ahalota
Copy link
Author

ahalota commented Jan 5, 2023

Do you have additional AWS environment variables set? I saw you omitted AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, so I'm guessing they are set as environment variables.

@vincentsarago
Copy link
Member

I'm guessing they are set as environment variables.

yes they are in my env by default so I omitted those

@ahalota
Copy link
Author

ahalota commented Jan 5, 2023

I tried a few more things, and it seems something was corrupted with the my input file list, "ca_32114_urls.txt".

I had created that one by copy and pasting the text in from prior command line output, then saving it via Notepad++. I created a new file, test.txt and copied and pasted in the same text from the previous file and it worked. I also tried out a few different configurations as far as the file name (underscores, digits, etc), but none of these seemed to recreate the issue. Renaming the original file did not resolve the error on it, no matter what I did I cannot make that one file work. If I make a copy of the original file, it also does not work.

With the new, uncorrupted urls.txt file, I was able to create my ca_mosaic.vrt
Then I can pass this into gdalwarp to download exactly the subset I am looking for. I ran it at a very low resolution initially to see if it was providing the data I wanted.

gdalwarp -cutline "california_boundary.shp" ca_32114_urls.vrt ca_32114_3pics.tiff 
-tr 600 -600 
--config AWS_REGION_NAME "us-west-2" 
--config AWS_REQUEST_PAYER "requester"
--config AWS_ACCESS_KEY_ID <my_access_key> 
--config AWS_SECRET_ACCESS_KEY <my_secret_key> 

Thank you so much for your help with this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants