-
Notifications
You must be signed in to change notification settings - Fork 956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom Types DataLoader #3008
base: main
Are you sure you want to change the base?
Custom Types DataLoader #3008
Conversation
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
04187eb
to
2ed1d56
Compare
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Signed-off-by: Alex-Brooks <[email protected]>
2ed1d56
to
d1582e3
Compare
To me, it's still not clear what we want to achieve here, even after reading the associated issue. Is the plan really to allow users to pass arbitrary iterables to |
Good questions @BenjaminBossan! To be honest, after implementing it, I have mixed feelings too. My thoughts are:
Those are mostly the reasons it's currently implemented the way it is, where you can pass it either as an iterable or as a wrapped dataloader - I think that it is interesting and could imagine the class itself being useful somewhere, but either way feels a bit awkward to expose with respect to the current accelerator API. I'd be really interesting in hearing @muellerzr's thought as well. In any case, if there is a common enough use-case that it's compelling to add this feature that we can elucidate, maybe it would also be useful to add a demo somewhere showing how it's intended to be leveraged, either as part of this PR or a follow-up! |
From an API perspective, what might make more sense would be something like:
For example... ...
from accelerate.data_loader import CustomTypesDataLoader
class MyIterableType:
def __init__(self, data):
self.data = data
def __iter__(self):
return iter(self.data)
# Need to think about what the dataloaderconfiguration would be here still
accelerator = Accelerator(...)
some_iterable = MyIterableType(list(range(16)))
# Wraps as an `iterabledataset`, does not handle device placement at all
# If you want to use this, you need to build it yourself and pass it to prepare();
# you can't make it through prepare() with any dataloaderconfig
custom_types_loader = CustomTypesDataLoader(some_iterable)
# Gives a DataLoaderDeviceMover or something whose only function is to get stuff
# from the provided dataloader and put it on the device
dl = accelerator.prepare(custom_types_loader) This would preserve the capability to do the same thing without expanding the dataloaderconfig/prepare() in a weird direction, reduces the likelihood of bugs with how |
Hi! Thanks so much for your first attempt at this, I had to trace back through the issues to figure out what really I was after with this "generic only place on device" dataloader, and it was from this FR for WebLoader: #2083 The general idea at the time was to support a generic iterable (which would be a dataloader type class) which we can then pipe into custom implementations easier such as WebLoader. IMO it should just place the outputs on the device, and have all the mixins our dataloaders have, to make it work with things like gradient accumulation properly (since that's how we check for it). Let me know if you want to rethink your PR/simplify it at all/etc given this new information. It's a great effort, and I agree I did not really expound too much on the FR issue at the time. Let me know if this path forward makes sense to you both and we can continue 🚀 (I'm not 100% certain if doing sharding makes sense for dataloaders like this, hence the "Just move to device and make sure gradient accumulation doesn't break") |
Thanks for the guidance @muellerzr! It makes sense to try to rework this PR a bit so that we can discuss further - I think assuming that everything that should be prepared as a
could be a recommended solution for data-loader like classes that aren't instances, which is a bit gross looking, but is at least gross looking outside of accelerate. If I have the bandwidth, I'll try to reproduce the webloader issue to see what it would take for the revisions in this PR is able to fix it, it would be nice to validate its usefulness 🤞 |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
What does this PR do?
Fixes #2975
This PR adds a
CustomTypesDataLoader
, which allows for the passing of custom iterable types (either under a PyTorch DataLoader, which would normally throw a TypeError once you start to iterate over it, or directly) to the Accelerator state's device. This adds two args to theDataLoaderConfiguration
class:custom_types: bool
; indicates whether or not an instance of this class should be created whenprepare()
is invoked on a dataloader or iterable typecustom_type_batch_size: int
; batch size to be used for theCustomTypesDataLoader
; if the iterable is already held under a PyTorch dataloader, this is optional, and the batch size will be pulled off of the data loader.Minimal example:
Running with
export ACCELERATE_TORCH_DEVICE=cuda
provides the following output:Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@muellerzr @BenjaminBossan @SunMarc