Add footprint finder code #127

snbianco · 2024-09-18T14:38:54Z

This is a draft PR that adds the footprint finder code from TESScut. Leaving as draft since I haven't added tests yet, but I'd welcome any feedback on the actual implementation.

I tried to make things as general as possible in terms of functions and variable names, but some things are more TESS specific and will need to be pulled out into a wrapper later when we generalize. Namely, _extract_sequence_information, _create_sequence_list, and _get_cube_files_from_sequence_obs are more TESS-specific as of now. The same is true about certain parts of cube_cut_from_footprint, mainly variables.

Something I was unsure about is how to best handle multiprocessing. The cube_cut_from_footprint function takes a threads parameter to use as max_workers when using cube_cut on the gathered cube files. However, the cube_cut function also takes its own threads parameter. Should these be two separate parameters to cube_cut_from_footprint, or the same? Should threads in cube_cut just be set to 'auto'?

snbianco · 2024-09-18T14:42:02Z

astrocut/cube_cut_from_footprint.py

+ def single_intersect(ffi, polygon):
+ return ffi.intersects_poly(polygon)
+
+ return np.vectorize(single_intersect)(ffi_list["polygon"], polygon)


I tried added multiprocessing to this function, but found that there was no performance improvement (if anything, execution time went up because of thread overhead.) In any case, the ra_dec_crossmatch function takes < 1 sec to run.

when you say threads, do you actually mean threads or processes?

This is CPU bound, and currently, with the GIL, there will not be any improvement using threads. You would need to use multiple processes.

I did mean threads, but I just tried with ProcessPoolExecutor, and oddly enough, timing is a bit worse. It takes around 1 second longer with the process pool vs. vectorize. This is the code I'm using for comparison:

def single_intersect(ffi, polygon): return ffi.intersects_poly(polygon) def _ffi_intersect_process(ffi_list: Table, polygon: SphericalPolygon, workers=None): """ Vectorizing the spherical_coordinate intersects_polygon function """ with ProcessPoolExecutor(max_workers=workers) as executor: # Map the function across the ffi_list["polygon"] results = list(executor.map(single_intersect, ffi_list["polygon"], [polygon] * len(ffi_list))) return np.array(results)

I tried your attached code and couldn't get it to work in a different file without some modifications. I also am not sure the above was correct, with the arguments you've specified in map.

However, I can confirm that using the process pool executor is no faster, and a tiny bit worse, no matter what number of processes I use. There is overhead to starting up subprocesses in Python, so perhaps that's the issue we're running into. Also, SphericalGeometry might have some multiprocessing itself, and so we're just getting in the way of that by starting multiple processes?

Running the below code reports the intersect taking just over 1 second on my machine, whereas the version in this PR takes around a second or just under a second.

import functools import sys import time from concurrent.futures import ProcessPoolExecutor from multiprocessing import freeze_support import numpy as np from astropy.coordinates import SkyCoord from astropy.table import Table from spherical_geometry.polygon import SphericalPolygon import astrocut.cube_cut_from_footprint ccff = sys.modules["astrocut.cube_cut_from_footprint"] def single_intersect(polygon, ffi_poly): return ffi_poly.intersects_poly(polygon) def _ffi_intersect_process(ffi_list: Table, polygon: SphericalPolygon, workers=None): """ Vectorizing the spherical_coordinate intersects_polygon function """ start_t = time.monotonic() with ProcessPoolExecutor(max_workers=4) as executor: # Map the function across the ffi_list["polygon"] single_intersect_poly = functools.partial(single_intersect, polygon) results = executor.map(single_intersect_poly, ffi_list["polygon"], chunksize=64) print("took:", time.monotonic() - start_t) return np.array(list(results)) if __name__ == "__main__": # monkeypatch our version in ccff._ffi_intersect = _ffi_intersect_process footprint_file = "s3://tesscut-ops-footprints/tess_ffi_footprint_cache.json" all_ffis = ccff._get_s3_ffis(footprint_file, as_table=True, load_polys=True) coordinates = "217.42893801 -62.67949189" coordinates = SkyCoord(coordinates, unit="deg") cutout_size = ccff.parse_size_input(10) cone_results = ccff._ra_dec_crossmatch(all_ffis, coordinates, cutout_size, ccff.TESS_ARCSEC_PER_PX) print(cone_results)

scfleming · 2024-09-18T15:21:24Z

My initial reaction is to say let's setup a single n_threads parameter for multi-threading of any function that can use it. While on paper it might be nice to say "use 8 threads for this one but 16 for that one", that sounds like over-engineering to me at this stage of the process, and it would be much simpler to have a single n_threads parameter used globally.

astrocut/cube_cut_from_footprint.py

falkben · 2024-09-18T19:49:51Z

astrocut/cube_cut_from_footprint.py

+
+ # Generate cutout from each cube file
+ cutout_files = []
+ if threads == 'auto' or threads > 1:


I am not sure threads here will help, and it may in fact hurt. cube_cut already uses threads for accessing the S3 data, so with this change, each cutout file would spawn that many threads. In my testing, there are diminishing returns after 8 threads. Since this could end up creating many times that number of threads, I expect we'd see thread contention here.

If you set threads to 0, versus setting threads to "auto" or "8" what are the results here?

Testing on my machine, I do see a performance improvement with a larger number of threads. It's more apparent when a sector isn't provided and more than 1 cutout is being generated. For example, these commands each generate 7 cutouts.

cube_cut_from_footprint('130 30', cutout_size=50, threads=0) --> 1 min, 23.4 sec
cube_cut_from_footprint('130 30', cutout_size=50, threads=8) --> 57.6 sec
cube_cut_from_footprint('130 30', cutout_size=50, threads='auto') --> 46.4 sec

is this before or after you made the change to use the same threads variable to pass into the cube_cut function?

Looks like I ran the test before, when threads was set to auto for cube_cut. When using the same threads variable, the call with threads=0 takes a lot longer, which makes sense. I also see that there is less than a second difference between threads=8 and threads=auto.

Maybe it would be best to keep threads for cube_cut constant at some value, like 'auto' or 8? I think that using threads in cube_cut_from_footprint is still worthwhile for the performance improvement when making several cutouts at once, but performance does seem to stagnate after a certain point.

I'm also thinking that the default value for threads in cube_cut_from_footprint should be set to 8 rather than 1, since performance is consistently better.

falkben · 2024-09-18T19:56:32Z

astrocut/__init__.py

@@ -28,3 +28,4 @@ class UnsupportedPythonError(Exception):
 from .cutout_processing import (path_to_footprints, center_on_path, # noqa
 CutoutsCombiner, build_default_combine_function) # noqa
 from .asdf_cutouts import asdf_cut, get_center_pixel # noqa
+ from .cube_cut_from_footprint import cube_cut_from_footprint # noqa


this hides the module, is that intended?

I think it would be better to name the module differently from the function that you are intending to export

Good point! I renamed the module to footprint_cutouts.

I'm not sure if I know what you mean by hiding the module. You can still import the module on its own.

without the change to the name of the module, you couldn't easily import the module from another file.

you had to resort to some trickery. e.g.:

import astrocut.cube_cut_from_footprint ccff = sys.modules["astrocut.cube_cut_from_footprint"]

note that the first import above doesn't import the module, it imported the function. but the module does get added to sys.modules when you import the function, so you can extract it from there

I see, that makes sense!

falkben · 2024-09-18T19:59:04Z

astrocut/cube_cut_from_footprint.py

+ sequences : int, List[int], optional
+ Default None. Sequence(s) from which to generate cutouts. Can provide a single
+ sequence number as an int or a list of sequence numbers. If not specified, 
+ cutouts will be generated from all sequences that contain the cutout.


Maybe you were trying to generalize this but I'm not sure what sequences are. Are these sectors?

Yes, they refer to sectors! I was trying to make the parameter more general and borrowed "sequence" from the CAOM field descriptions: https://mast.stsci.edu/api/v0/_c_a_o_mfields.html

i guess for a user, it might be a little unclear that for TESS this is sectors, and not cameras or ccds or anything like that. so maybe some documentation or examples would help here.

Added some more info and an example to the docstring in the latest commit! I'll also be updating the documentation at some point (probably next sprint) and will definitely include examples there.

snbianco added 3 commits September 15, 2024 22:21

progress on tesscut code

2c61ca6

send warnings, return array of filepaths

c6b3e0b

More generalization, print messages

509d766

snbianco commented Sep 18, 2024

View reviewed changes

snbianco added 3 commits September 18, 2024 10:42

comment fix

ccb34ca

style fixes

03d1fe6

fix type annotation

be097ff

snbianco requested review from falkben, havok2063 and dr-rodriguez September 18, 2024 14:56

falkben reviewed Sep 18, 2024

View reviewed changes

astrocut/cube_cut_from_footprint.py Outdated Show resolved Hide resolved

falkben reviewed Sep 18, 2024

View reviewed changes

snbianco added 3 commits September 18, 2024 22:08

rename module, use same thread number for cube_cut

abd5eb9

Use json.load()

f708758

documentation, set threads to 8 for cube_cut

ece5f3c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add footprint finder code #127

Add footprint finder code #127

snbianco commented Sep 18, 2024 •

edited

Loading

snbianco Sep 18, 2024

falkben Sep 18, 2024

snbianco Sep 18, 2024 •

edited by falkben

Loading

falkben Sep 18, 2024 •

edited

Loading

scfleming commented Sep 18, 2024 •

edited

Loading

falkben Sep 18, 2024

snbianco Sep 19, 2024 •

edited

Loading

falkben Sep 19, 2024

snbianco Sep 19, 2024 •

edited

Loading

falkben Sep 18, 2024

snbianco Sep 19, 2024

falkben Sep 19, 2024

falkben Sep 19, 2024

snbianco Sep 19, 2024

falkben Sep 18, 2024

snbianco Sep 19, 2024

falkben Sep 19, 2024

snbianco Sep 19, 2024 •

edited

Loading

Add footprint finder code #127

Are you sure you want to change the base?

Add footprint finder code #127

Conversation

snbianco commented Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snbianco Sep 18, 2024 • edited by falkben Loading

Choose a reason for hiding this comment

falkben Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

scfleming commented Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

snbianco Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snbianco Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snbianco Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

snbianco commented Sep 18, 2024 •

edited

Loading

snbianco Sep 18, 2024 •

edited by falkben

Loading

falkben Sep 18, 2024 •

edited

Loading

scfleming commented Sep 18, 2024 •

edited

Loading

snbianco Sep 19, 2024 •

edited

Loading

snbianco Sep 19, 2024 •

edited

Loading

snbianco Sep 19, 2024 •

edited

Loading