Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Programatic Monthly Reindexing Distributions #1048

Open
CarsonDavis opened this issue Oct 2, 2024 · 1 comment
Open

Programatic Monthly Reindexing Distributions #1048

CarsonDavis opened this issue Oct 2, 2024 · 1 comment
Assignees

Comments

@CarsonDavis
Copy link
Collaborator

CarsonDavis commented Oct 2, 2024

Description

When new collections are added, we need to ensure they are brought into the reindexing fold. There may be several ways to ensure this happens.

  • manually updating each scraper and indexing job file to contain the correct dates
  • COSMOS assigns each collection a date based on some criteria, and writes this into the job file when it is created

Implementation Considerations

  • see comments for some mock code

Deliverable

arst

Dependencies

depends on

@CarsonDavis
Copy link
Collaborator Author

class Collection:
    def __init__(self, config_name: str, scrape_time: int):
        self.config_name = config_name
        self.scrape_time = scrape_time

    def __repr__(self):
        return f"Collection({self.config_name}, scrape_time={self.scrape_time})"


def segment_reindexing_lists(collections: list[Collection], hours_per_list: float) -> list[list[Collection]]:
    # Convert hours to seconds for the max capacity per bin
    max_capacity = hours_per_list * 3600

    # Sort collections by scrape_time in descending order for efficient packing
    sorted_collections = sorted(collections, key=lambda x: x.scrape_time, reverse=True)

    # List to store bins (each bin is a list of collections)
    bins = []

    for collection in sorted_collections:
        placed = False
        # Try to place the collection in an existing bin
        for b in bins:
            if sum(item.scrape_time for item in b) + collection.scrape_time <= max_capacity:
                b.append(collection)
                placed = True
                break

        # If it can't be placed in any existing bin, create a new one
        if not placed:
            bins.append([collection])

    return bins


# Example usage
collections = [
    Collection("Collection1", scrape_time=400),
    Collection("Collection2", scrape_time=8000),
    Collection("Collection3", scrape_time=300),
    Collection("Collection4", scrape_time=900),
    Collection("Collection5", scrape_time=100),
    Collection("Collection6", scrape_time=4000),
    Collection("Collection7", scrape_time=500),
    Collection("Collection8", scrape_time=800),
    Collection("Collection9", scrape_time=1000),
    Collection("Collection10", scrape_time=700),
    Collection("Collection11", scrape_time=15),
    Collection("Collection12", scrape_time=200),
    Collection("Collection13", scrape_time=15),
    Collection("Collection14", scrape_time=8000),
    Collection("Collection15", scrape_time=100),
    Collection("Collection16", scrape_time=4000),
    Collection("Collection17", scrape_time=500),
    Collection("Collection18", scrape_time=1000),
    Collection("Collection19", scrape_time=400),
    Collection("Collection20", scrape_time=300),
]

hours_per_list = 2  # Maximum time per list in hours
result = segment_reindexing_lists(collections, hours_per_list)

for i, bin in enumerate(result):
    print(f"Bin {i+1}: {bin}, Total scrape_time: {sum(item.scrape_time for item in bin)} seconds")

@CarsonDavis CarsonDavis removed their assignment Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants