Programatic Monthly Reindexing Distributions #1048

CarsonDavis · 2024-10-02T19:49:52Z

Description

When new collections are added, we need to ensure they are brought into the reindexing fold. There may be several ways to ensure this happens.

manually updating each scraper and indexing job file to contain the correct dates
COSMOS assigns each collection a date based on some criteria, and writes this into the job file when it is created

Implementation Considerations

see comments for some mock code

Deliverable

arst

Dependencies

depends on

CarsonDavis · 2024-10-15T22:19:25Z

class Collection:
    def __init__(self, config_name: str, scrape_time: int):
        self.config_name = config_name
        self.scrape_time = scrape_time

    def __repr__(self):
        return f"Collection({self.config_name}, scrape_time={self.scrape_time})"


def segment_reindexing_lists(collections: list[Collection], hours_per_list: float) -> list[list[Collection]]:
    # Convert hours to seconds for the max capacity per bin
    max_capacity = hours_per_list * 3600

    # Sort collections by scrape_time in descending order for efficient packing
    sorted_collections = sorted(collections, key=lambda x: x.scrape_time, reverse=True)

    # List to store bins (each bin is a list of collections)
    bins = []

    for collection in sorted_collections:
        placed = False
        # Try to place the collection in an existing bin
        for b in bins:
            if sum(item.scrape_time for item in b) + collection.scrape_time <= max_capacity:
                b.append(collection)
                placed = True
                break

        # If it can't be placed in any existing bin, create a new one
        if not placed:
            bins.append([collection])

    return bins


# Example usage
collections = [
    Collection("Collection1", scrape_time=400),
    Collection("Collection2", scrape_time=8000),
    Collection("Collection3", scrape_time=300),
    Collection("Collection4", scrape_time=900),
    Collection("Collection5", scrape_time=100),
    Collection("Collection6", scrape_time=4000),
    Collection("Collection7", scrape_time=500),
    Collection("Collection8", scrape_time=800),
    Collection("Collection9", scrape_time=1000),
    Collection("Collection10", scrape_time=700),
    Collection("Collection11", scrape_time=15),
    Collection("Collection12", scrape_time=200),
    Collection("Collection13", scrape_time=15),
    Collection("Collection14", scrape_time=8000),
    Collection("Collection15", scrape_time=100),
    Collection("Collection16", scrape_time=4000),
    Collection("Collection17", scrape_time=500),
    Collection("Collection18", scrape_time=1000),
    Collection("Collection19", scrape_time=400),
    Collection("Collection20", scrape_time=300),
]

hours_per_list = 2  # Maximum time per list in hours
result = segment_reindexing_lists(collections, hours_per_list)

for i, bin in enumerate(result):
    print(f"Bin {i+1}: {bin}, Total scrape_time: {sum(item.scrape_time for item in bin)} seconds")

CarsonDavis assigned CarsonDavis and dhanur-sharma Oct 2, 2024

CarsonDavis mentioned this issue Oct 2, 2024

Update COSMOS to Create Jobs for Scrapers and Indexers #1052

Open

CarsonDavis removed their assignment Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Programatic Monthly Reindexing Distributions #1048

Programatic Monthly Reindexing Distributions #1048

CarsonDavis commented Oct 2, 2024 •

edited

Loading

CarsonDavis commented Oct 15, 2024

Programatic Monthly Reindexing Distributions #1048

Programatic Monthly Reindexing Distributions #1048

Comments

CarsonDavis commented Oct 2, 2024 • edited Loading

Description

Implementation Considerations

Deliverable

Dependencies

CarsonDavis commented Oct 15, 2024

CarsonDavis commented Oct 2, 2024 •

edited

Loading