-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partition epoch as a multi-GPU dataset distribution method #712
base: master
Are you sure you want to change the base?
Partition epoch as a multi-GPU dataset distribution method #712
Conversation
This comment has been minimized.
This comment has been minimized.
Why? What number? This sounds wrong. If the array of offsets fits into memory (which must be the case with our implementation), it should be fast to shuffle this. It is this code:
If I would assume that The API of But anyway, maybe you don't refer to "random" seg order but to "laplace"? This is slow because it needs to call However, this can also be solved in a different way, more like TF is doing it. We can do the laplace sorting on-the-fly (just like TF does bucketing on-the-fly). So you use "random" seq order on the |
You mean because you stick to "default" seq order for the But if you use "random" seq order for the |
But this doesn't need to be the case for Currently it does Instead, we could do this only if no batch slicing ( Otherwise, the And then it calls With caching (which doesn't make sense though), or for some other datasets, maybe it needs the small changes I described. So, in any case, I don't really see an advantage of the proposed |
f40d178
to
69dacf4
Compare
You mentioned exactly the things I have been working on: 😄
Yes, besides the sequence order, this is the other But regardless of these improvements, if you have a huge dataset and you want to shuffle it you will always need to store an
Oh, I didn't think of that. But then using CombinedDataset (or actually any CachedDataset2) is "the problem", as it actually loads and caches all the sequences from start to end in |
Ok, with #568, you get rid of the But what is And then shuffling this also not. |
Ah yes. But with my proposed changes for Alternatively, we could also change |
Btw, I just saw this Twitter post on large data shuffling behavior, which might be relevant for you. |
Let's say 100M sequences. This is very-high ressource, but definitely realistic for MT data nowadays. I used this script to simulate what is currently done in
For me, it runs 53 minutes and uses 3.8 GB of RAM. (For 10M sequences it is 5 minutes, 0.4GB) #568 proposes I will also try your fix for |
For reference, using Numpy, uint32:
Numpy, uint64:
(Faster but more memory.) This seems to be fast enough. If you want it even faster, there are shuffling algorithms which would work more on-the-fly. In the extreme case, we could also simply introduce another "random" sort scheme where you just do
(Probably needs to be slightly more complex to cover some edge cases, but you get the idea...) (If you want to implement and test this, please make this a separate PR, so we do not mix up things.) |
Please also make this as a separate PR. |
The main commit is 704ee2b. I propose to reuse the partition epoch logic to distribute different parts of the data to different GPUs. The reason I needed an alternative to the
random_seed_offset
method is that this one does not work in cases where the sequence ordering is not, or at least in some aspects not random. In my case this was true for the following setup: severalHDFDatasets
in default sequence ordering (too costly to shuffle at run-time because of huge number of sequences), combined withCombinedDataset
withlaplace
sequence ordering, using thesampling_sizes
parameter, i.e. taking a fixed number of sequences from the HDFDatasets per epoch.random_seed_offset
has the desired effect on theCombinedDataset
level, however the sequences that are sampled from the HDFDatasets are the same for all GPUs, which is bad.With the new
partition
method, the sampling and shuffling will be done identical for all GPUs, but then a different partition is selected per GPU. This is done using the sequence ordering, so no data is loaded and then thrown away as with theshard
method. An additional advantage overrandom_seed_offset
is that the original meaning of an epoch is preserved.Implementation-wise this is not 100% optimal, because it only works for datasets using
get_seq_order_for_epoch()
.Also, I had to add a nasty
disable_horovod_partition
attribute to only apply the partitioning onCombinedDataset
level and not in the sub-datasets. But this is a very special case for this sampling sequence ordering, I think all other current meta dataset configurations should work with partitioning being done in the sub-datasets.