HDF Dataset - General Design #434
Replies: 2 comments 9 replies
-
I think the "Next-Gen" HDF tried to encode different data in "streams" instead of this fixed input/target distinction, but I guess this format is not really used. I am not familiar with what the HDF format supports, and what is efficient and what is not, so I can not make a guess what can be done. It is definitely an issue I often come across, and having to introduce a fake time-axis for fixed data and removing it again in the network is definitely not convenient. So I am happy to contribute to any changes if planned, I just did not find the time yet to start looking into this issue. |
Beta Was this translation helpful? Give feedback.
-
All that is already supported by the main hdf_dump tool. I forgot again, what was the purpose of having this specific hdf_dump_translation_dataset tool? I think this is bad design, to have a separate tool just for the translation dataset. There should not be a reason for it (or if there is, we should fix that).
First of all, this is not a difference. This is just arbitrary. You can simply ignore this, or redefine it as you like.
Historical reason. There is no real reason and also no real difference conceptually. From the interface point of view, all are just data streams.
Yes, but that is just a technical internal detail. It doesn't really matter at all.
Why does that matter? You can configure it in any way. I.e. what you use in the network. Or if you really don't like it, you could simply remap it via
Again, it's just a technical internal detail. It should not matter at all.
Or you ignore the inputs (keep it empty), and just use the "targets".
Yes, but you could simply use a dummy dimension there. Maybe we could have a internal flag in the HDF such that in the network, you will not get this dummy dimension. I think it should be pretty simple to extend the format for that.
Yea, but why is the artificial dimension a problem? But as said, it should be extremely simple to extend HDF for that.
This is already supported.
Yes, the HDFDataset internal cache. Also note that there is another cache in the There are two cases you need to distinguish: If the HDF is small enough that it could be loaded completely into the cache (into memory), then you can use the cache, and this is probably faster (only the initial loading time is slow, because it doesn't fill the cache in background, you have to wait for it). If it is too big to be loaded fully into memory, I very much recommend to not use the HDFDataset internal cache. The way it is implemented is extremely inefficient and buggy as well. It will read N chunks, then do nothing for the next N chunks until cache is empty, then again block and load N chunks. This is much slower than just doing it async in the background, like the
Yea, but why is that a problem? The access should be extremely cheap. And reading from disk is extremely fast. And it is already done async in the background with the cache of
Yea. Also just because the logic was somehow kept separate between input/targets, and so it was too complicated and extra effort, and @doetsch was too lazy to do it. :)
Before you really implement sth, you should first really understand where the bottleneck is, and why. As said, there is already a cache in
I don't understand. Why not just improve |
Beta Was this translation helpful? Give feedback.
-
We plan to use HDF Dataset even more in the future than we do now, expanding
tools/hdf_dump_translation_dataset.py
such that it supports adding basically any type of data (dense/sparse, scalar/vector/time-sequence) from text files, not just source/target for MT. Before we commit to that, I wanted to discuss some general design aspects:Inputs vs. Targets
@JackTemaki seems to use HDF dataset without a target. In parallel to his recent PR I actually implemented using HDF dataset with targets only. Is there even a fundamental reason why HDF Dataset distinguishes between inputs and targets? (Storing labels is one difference...) The current HDF file structure has exactly one input. In case you have multiple, you have to define all of them but one as targets. Also, the data key of the inputs is hardcoded to "data" which is annoying in case none of the inputs can be considered the "main" one.
So, is it a good idea to only use the targets? I guess it won't be possible to rename the "targets" field, but at least then everything is on one level.
Just for reference, this is the structure of our hdf files:
inputs
should ideally be "source_factor0" and the other source factors should not be "targets".Data Dimensions
See #331. The data is stored in the HDF file as one flat big sequence (for each data key). Which is efficient and I don't think we want to or can change that. But currently, because of that, the data must have one time dimension, right? I think a feature dimension works too, but definitely scalars are not supported without adding an artificial dimension. Supporting N-dimensional data is not relevant for us right now, but would also be a nice to have.
I think no high-level discussion needed here, but I still wanted to put it here.
Caching
@albertz, I see you often recommend turning caching off, this is what we do for our MT system trainings and it works well. However, I trained small models where reading the HDF data was actually the bottle neck. If I profiled correctly, it takes more than 10 minutes to load 100k sequences with an HDFDataset (where the data for each sequence was actually tiny). This is because the HDF file is accessed 100k times, once per sequence (actually times the number of data streams I think). I wanted to try out caching because of that, only to find out that actually only the "inputs" of the HDF file are cached, the "targets" are loaded into memory as a whole, right? I assume this is because the dataset was originally used only for the ASR case where targets are comparably small in size? (Maybe @doetsch can leave a short comment.)
Would it be desirable to implement caching for all data? I actually tried once but stopped when I saw it's quite an effort. But if I did't do something completely wrong it seems to be necessary for small/fast to compute networks.
By the way, the inverse of this - writing one sequence at a time to HDF - makes hdf_dump pretty slow. One reason why we implemented a separate HDF creation script.
Beta Was this translation helpful? Give feedback.
All reactions