Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload and AssetBlob garbage collection implementation #2087

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

mvandenburgh
Copy link
Member

Implements the design outlined in #2068.

@mvandenburgh mvandenburgh marked this pull request as ready for review December 3, 2024 18:12


class GarbageCollectionEvent(models.Model):
created = models.DateTimeField(auto_now_add=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super minor but I feel like timestamp would be a better word to describe this. If we're ever going through these events for some reason, seeing the term "created" thrown about might cause some confusion, as really things are being deleted (except for this record of such).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree timestamp is clearer.

type = models.CharField(
max_length=255, help_text='The model name of the records that were garbage collected.'
)
records = models.JSONField(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason you chose to store a collection of records in one "Event" object, rather than one row per object? I.e. records would become record, and garbage_collection_event_id could be a foreign key to a separate table that stores the events.

One thing that it seems would be easier in that model is searching for a particular garbage collected upload or asset blob.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I realize this is how it's outlined in the design doc, but this question is just coming to me now.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original intention was to just have a single GarbageCollectionEvent when garbage collection runs, and just shove the opaque serialized model into a JSONField for simplicity. But then the issue of JSONField size limits came up, so I switched to the chunking approach.

I agree that just making it one row per object makes more sense, I think I just didn't consider it because the starting point was having everything in one column.

Comment on lines 36 to 37
# Chunk the queryset to avoid creating a single
# GarbageCollectionEvent with a huge JSONField
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may also be something that's easier with an Event <- Record model, as I mentioned in my other comment. You could have many records pointing to one event, which would allow for just one event record to be stored per "actual" event. In the current case, there could be 10 GarbageCollectionEvent rows for what's really one "event', i.e. one invocation of _garbage_collect_uploads or _garbage_collect_asset_blobs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would also allow you to point both asset blobs and uploads back to the same single event record, rather than two separate records.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants