Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support for RS2 Downsampler #465
feat: Support for RS2 Downsampler #465
Changes from 10 commits
de5c118
5a4be02
98680f0
2c8ad08
4e44116
77e86aa
7324ab5
cf26dc4
b971373
7848dfc
7033596
25d5af0
240dbf4
0e11701
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each call to
inform_samples
should be provided with a different set ofsample_ids
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? That would not be the case in the trainer server / pytorch trainer due to the nature of downsampling and also it will not make a difference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it does not make a difference here, as we just test the shape. But naturally they should be different because,
In
sample_and_batch
. In thepytorch_trainer.py
, we first iterate over the dataloader and keep informing each batch in_iterate_dataloader_and_compute_scores
modyn/modyn/trainer_server/internal/trainer/pytorch_trainer.py
Line 836 in ac66eaf
the
sample_ids
come from the dataloader and should be naturally distinct right? (they are keys of the samples)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they are not. What differs is the model output (on which true downsamplers sample), but the list of samples is always the same, since the trigger training set from the selector does not change between epochs. Since RS2 only relies on the IDs, it should not matter. The IDs will in all cases be identical across epochs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have a misunderstanding. I am saying the consecutive calls to
inform_samples
within twoselect_points
call boundaries should contain different sample ids.I copy the code of
_iterate_dataloader_and_compute_scores
here:You see: We load one batch after another from the dataloader. One
inform_samples
call does not contain the entire dataset data but just one batch. The first batch must have different sample ids than the second batch's sample ids. That means if we do not callselect_points
in the middle, then theinform_samples
call should contain different sample idsI am not talking about sample ids across epochs. Those definitely do not change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. if we have
Then the first inform_samples call can have the same sample ids as the second inform_samples.
But when we do
Suppose the whole dataset contains two batches. Then the first two
inform_samples
calls should contain differentsample_ids
.In this unit test, we only keep calling
inform_samples(...)
without callingselect_points(...)
, so each call should contain distinct sample_ids.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, I think it does not really make a difference here to use different sample ids. But I still do think consecutive
inform_samples
calls (withoutselect_points
call in the middle) should contain distinct sample ids.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your point and agree with your description, but I still don't understand why you are suggesting it here :D The code is this
so the loop is the epoch loop (!). Since
sample_ids = list(range(10))
we don't have duplicate samples in the same epoch and consistent samples across epochs. This is exactly like you describe. I am not sure if I am missing something or you just confused this loop with something else. I am merging this for now and happy to do a follow up PR in case I am missing something hereThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
" this unit test, we only keep calling inform_samples(...) without calling select_points(...),"
i dont get it. isn't it directly below :D?