HDF Dataset - General Design #434

patrick-wilken · 2021-02-05T17:35:49Z

patrick-wilken
Feb 5, 2021
Collaborator

We plan to use HDF Dataset even more in the future than we do now, expanding tools/hdf_dump_translation_dataset.py such that it supports adding basically any type of data (dense/sparse, scalar/vector/time-sequence) from text files, not just source/target for MT. Before we commit to that, I wanted to discuss some general design aspects:

Inputs vs. Targets

@JackTemaki seems to use HDF dataset without a target. In parallel to his recent PR I actually implemented using HDF dataset with targets only. Is there even a fundamental reason why HDF Dataset distinguishes between inputs and targets? (Storing labels is one difference...) The current HDF file structure has exactly one input. In case you have multiple, you have to define all of them but one as targets. Also, the data key of the inputs is hardcoded to "data" which is annoying in case none of the inputs can be considered the "main" one.
So, is it a good idea to only use the targets? I guess it won't be possible to rename the "targets" field, but at least then everything is on one level.

Just for reference, this is the structure of our hdf files:

$ h5ls -r train.hdf
/                        Group
/inputs                  Dataset {528307381/Inf}
/seqLengths              Dataset {34275630, 6}
/seqTags                 Dataset {34275630}
/targets                 Group
/targets/data            Group
/targets/data/classes    Dataset {496285005/Inf}
/targets/data/source_factor1 Dataset {528307381/Inf}
/targets/data/source_factor2 Dataset {528307381/Inf}
/targets/data/target_factor1 Dataset {496285005/Inf}
/targets/data/target_factor2 Dataset {496285005/Inf}
/targets/labels          Group
/targets/labels/classes  Dataset {35847}
/targets/labels/source_factor1 Dataset {5}
/targets/labels/source_factor2 Dataset {3}
/targets/labels/target_factor1 Dataset {5}
/targets/labels/target_factor2 Dataset {3}
/targets/size            Group

inputs should ideally be "source_factor0" and the other source factors should not be "targets".

Data Dimensions
See #331. The data is stored in the HDF file as one flat big sequence (for each data key). Which is efficient and I don't think we want to or can change that. But currently, because of that, the data must have one time dimension, right? I think a feature dimension works too, but definitely scalars are not supported without adding an artificial dimension. Supporting N-dimensional data is not relevant for us right now, but would also be a nice to have.
I think no high-level discussion needed here, but I still wanted to put it here.

Caching
@albertz, I see you often recommend turning caching off, this is what we do for our MT system trainings and it works well. However, I trained small models where reading the HDF data was actually the bottle neck. If I profiled correctly, it takes more than 10 minutes to load 100k sequences with an HDFDataset (where the data for each sequence was actually tiny). This is because the HDF file is accessed 100k times, once per sequence (actually times the number of data streams I think). I wanted to try out caching because of that, only to find out that actually only the "inputs" of the HDF file are cached, the "targets" are loaded into memory as a whole, right? I assume this is because the dataset was originally used only for the ASR case where targets are comparably small in size? (Maybe @doetsch can leave a short comment.)
Would it be desirable to implement caching for all data? I actually tried once but stopped when I saw it's quite an effort. But if I did't do something completely wrong it seems to be necessary for small/fast to compute networks.

By the way, the inverse of this - writing one sequence at a time to HDF - makes hdf_dump pretty slow. One reason why we implemented a separate HDF creation script.

JackTemaki · 2021-02-05T18:28:44Z

JackTemaki
Feb 5, 2021
Maintainer

@JackTemaki seems to use HDF dataset without a target.
Yes, I often store only specific inputs in HDFs, so that e.g. I have one OggZipDataset, and then one or more HDFs with only the "input" part set combined in a meta dataset. Because each feature might be generated from different sources (hdf_dump, forward pass, custom scripts), this is much easier than combining it all in one HDF.

I think the "Next-Gen" HDF tried to encode different data in "streams" instead of this fixed input/target distinction, but I guess this format is not really used.

I am not familiar with what the HDF format supports, and what is efficient and what is not, so I can not make a guess what can be done.

It is definitely an issue I often come across, and having to introduce a fake time-axis for fixed data and removing it again in the network is definitely not convenient. So I am happy to contribute to any changes if planned, I just did not find the time yet to start looking into this issue.

1 reply

albertz Feb 5, 2021
Maintainer

I think the "Next-Gen" HDF tried to encode different data in "streams" instead of this fixed input/target distinction, but I guess this format is not really used.

Yes, I think as well, but again, this is only a technical internal detail. Which is nicer and more clean of course. But I think there are no real features which are supported by the NextGenHDFDataset which are missing in HDFDataset. Also, there is no script like hdf-dump for it. Thus it didn't catch up.

If that dataset is better though for some reason, there is no reason why you should maybe not use that.

Or maybe we also should just dump all our data into some TFRecord files, which are more directly supported by TF. That probably will also speed things up, when everything is TF, and it doesn't go through Python. That also has caching.

I am not familiar with what the HDF format supports, and what is efficient and what is not, so I can not make a guess what can be done.

There is no real restriction.

It is definitely an issue I often come across, and having to introduce a fake time-axis for fixed data and removing it again in the network is definitely not convenient. So I am happy to contribute to any changes if planned, I just did not find the time yet to start looking into this issue.

This should be extremely trivial to add.

albertz · 2021-02-05T19:19:10Z

albertz
Feb 5, 2021
Maintainer

... expanding tools/hdf_dump_translation_dataset.py such that it supports adding basically any type of data (dense/sparse, scalar/vector/time-sequence)

All that is already supported by the main hdf_dump tool. I forgot again, what was the purpose of having this specific hdf_dump_translation_dataset tool? I think this is bad design, to have a separate tool just for the translation dataset. There should not be a reason for it (or if there is, we should fix that).

Inputs vs. Targets

First of all, this is not a difference. This is just arbitrary. You can simply ignore this, or redefine it as you like.

@JackTemaki seems to use HDF dataset without a target. In parallel to his recent PR I actually implemented using HDF dataset with targets only. Is there even a fundamental reason why HDF Dataset distinguishes between inputs and targets? (Storing labels is one difference...)

Historical reason. There is no real reason and also no real difference conceptually. From the interface point of view, all are just data streams.

The current HDF file structure has exactly one input. In case you have multiple, you have to define all of them but one as targets.

Yes, but that is just a technical internal detail. It doesn't really matter at all.

Also, the data key of the inputs is hardcoded to "data" which is annoying in case none of the inputs can be considered the "main" one.

Why does that matter? You can configure it in any way. I.e. what you use in the network. Or if you really don't like it, you could simply remap it via MetaDataset.

So, is it a good idea to only use the targets? I guess it won't be possible to rename the "targets" field, but at least then everything is on one level.

Again, it's just a technical internal detail. It should not matter at all.

inputs should ideally be "source_factor0" and the other source factors should not be "targets".

Or you ignore the inputs (keep it empty), and just use the "targets".

Data Dimensions
See #331. The data is stored in the HDF file as one flat big sequence (for each data key). Which is efficient and I don't think we want to or can change that. But currently, because of that, the data must have one time dimension, right?

Yes, but you could simply use a dummy dimension there. Maybe we could have a internal flag in the HDF such that in the network, you will not get this dummy dimension. I think it should be pretty simple to extend the format for that.

I think a feature dimension works too, but definitely scalars are not supported without adding an artificial dimension.

Yea, but why is the artificial dimension a problem? But as said, it should be extremely simple to extend HDF for that.

Supporting N-dimensional data is not relevant for us right now, but would also be a nice to have.

This is already supported.

Caching
@albertz, I see you often recommend turning caching off, this is what we do for our MT system trainings and it works well.

Yes, the HDFDataset internal cache. Also note that there is another cache in the FeedDataProvider (or how it is called).

There are two cases you need to distinguish:

If the HDF is small enough that it could be loaded completely into the cache (into memory), then you can use the cache, and this is probably faster (only the initial loading time is slow, because it doesn't fill the cache in background, you have to wait for it).

If it is too big to be loaded fully into memory, I very much recommend to not use the HDFDataset internal cache. The way it is implemented is extremely inefficient and buggy as well. It will read N chunks, then do nothing for the next N chunks until cache is empty, then again block and load N chunks. This is much slower than just doing it async in the background, like the FeedDataProvider is already doing it anyway.

However, I trained small models where reading the HDF data was actually the bottle neck. If I profiled correctly, it takes more than 10 minutes to load 100k sequences with an HDFDataset (where the data for each sequence was actually tiny). This is because the HDF file is accessed 100k times, once per sequence (actually times the number of data streams I think).

Yea, but why is that a problem? The access should be extremely cheap. And reading from disk is extremely fast. And it is already done async in the background with the cache of FeedDataProvider.

I wanted to try out caching because of that, only to find out that actually only the "inputs" of the HDF file are cached, the "targets" are loaded into memory as a whole, right? I assume this is because the dataset was originally used only for the ASR case where targets are comparably small in size? (Maybe @doetsch can leave a short comment.)

Yea. Also just because the logic was somehow kept separate between input/targets, and so it was too complicated and extra effort, and @doetsch was too lazy to do it. :)

Would it be desirable to implement caching for all data? I actually tried once but stopped when I saw it's quite an effort. But if I did't do something completely wrong it seems to be necessary for small/fast to compute networks.

Before you really implement sth, you should first really understand where the bottleneck is, and why. As said, there is already a cache in FeedDataProvider. Why is an individual HDF read access slow? That should not be the case. Or you should verify that this is really the case. Better really profile it, using PyFlame / PySpy or so (see here).

By the way, the inverse of this - writing one sequence at a time to HDF - makes hdf_dump pretty slow. One reason why we implemented a separate HDF creation script.

I don't understand. Why not just improve hdf_dump directly?

8 replies

albertz Feb 8, 2021
Maintainer

I think the discussion is going a little bit into wrong directions here, I thought we should discuss here what the current limitations of the HDFDataset are, and what can be improved instead of the details of the TranslationDataset.

I think the discussions around TranslationDataset are indirectly about the hdf-dump tool. I was just trying to point out that there are no real problems in hdf-dump, and how you can eventually solve any related problems with TranslationDataset + hdf-dump or whatever. But if you feel this is off-topic here, let's move the discussion elsewhere. Actually, I don't think much needs to be discussed. Consider this a bug if sth doesn't work nicely there, and just open a new issue, and then let's fix that (or we can maybe discuss how it can be fixed).

Inputs vs. Targets

First of all, this is not a difference. This is just arbitrary. You can simply ignore this, or redefine it as you like.

Also, the data key of the inputs is hardcoded to "data" which is annoying in case none of the inputs can be considered the "main" one.

Why does that matter? You can configure it in any way. I.e. what you use in the network. Or if you really don't like it, you could simply remap it via MetaDataset.

I think this is one mayor problem, that in many cases your config will not work depending on how you set the mapping (= missing 'data' key).

What exactly do you mean? That should not be the case. I have no idea what case you are describing. Consider this a bug, and let's just fix that. (Should be trivial to fix, or not? Whatever the problem is...)

So I guess what we should do is to find a collection of examples where the user is forced to follow some specific mapping or have the data structured in a specific way.

No such example should exist.

But anyway, if you just prefer to have your network look nicer and have your custom data keys in there in whatever way you like, you can always use MetaDataset to remap the data keys in any way you want. (If that doesn't work, again consider this a separate bug. And again this should not be difficult to fix.)

In most cases I did not write issues about this, as this was usually fixed with slight modifications of the data or the mapping. But it would be really nice if we could get rid of the internal assumptions on how the data keys are named at some point.

There should be no such internal assumptions. I don't really know what you mean.
(And if there are, please do write issues about that, or fix it directly.)

Also, I am not sure if really all data formats are supported yet (at least the fixed vectors do not work right now without workaround).

Yes, we talked about this one case a couple of times now. As said, this would be simple to add. I guess no-one really needed it yet, or everyone who did was too lazy to implement it.

The second point is that I would like to have an external interface for writing HDFs, that is similar to the SimpleHDFWriter, but intended for outside use.

What's the problem with SimpleHDFWriter? There is also another class (forgot the name).
If you think you can simplify the API, please do so, or open an issue about it. This shouldn't really need a discussion here. In general, if you think some API can be improved, then we should do it, and not discuss about it. If at all we maybe should discuss technical details about how to improve it. And it shouldn't really matter whether it is for internal or external use. It should be nice, clean, straight-forward in any case.

So it that no matter what scripts or software you have, you can directly write this properly into an HDF file, without going over the de-tour of creating a dataset first and then running hdf_dump (of course only for cases, where the data can directly be put into the network as-is, otherwise it makes more sense to write a new dataset).

Why is creating a dataset + using hdf-dump so much a problem? It should be extremely easy to create a dataset. (If not, again consider this a problem, make an issue about it, and let's discuss.) And then this approach would be extremely straight-forward, or not?

I assume this should probably be even simpler than writing your own custom dump script. I think this approach just have advantages over a custom dump script (e.g. you can directly use it also together with other RETURNN tools, e.g. dump-dataset, etc).

I don't really see any advantage of having a custom dump script.

Right now I am often using the SimpleHDFWriter for that, but that is not really a good way.

Why? But as said, just open an issue about that, or improve it directly.

JackTemaki Feb 10, 2021
Maintainer

Inputs vs. Targets

First of all, this is not a difference. This is just arbitrary. You can simply ignore this, or redefine it as you like.

Also, the data key of the inputs is hardcoded to "data" which is annoying in case none of the inputs can be considered the "main" one.

Why does that matter? You can configure it in any way. I.e. what you use in the network. Or if you really don't like it, you could simply remap it via MetaDataset.

I think this is one mayor problem, that in many cases your config will not work depending on how you set the mapping (= missing 'data' key).

What exactly do you mean? That should not be the case. I have no idea what case you are describing. Consider this a bug, and let's just fix that. (Should be trivial to fix, or not? Whatever the problem is...)

So I guess what we should do is to find a collection of examples where the user is forced to follow some specific mapping or have the data structured in a specific way.

No such example should exist.

But anyway, if you just prefer to have your network look nicer and have your custom data keys in there in whatever way you like, you can always use MetaDataset to remap the data keys in any way you want. (If that doesn't work, again consider this a separate bug. And again this should not be difficult to fix.)

In most cases I did not write issues about this, as this was usually fixed with slight modifications of the data or the mapping. But it would be really nice if we could get rid of the internal assumptions on how the data keys are named at some point.

There should be no such internal assumptions. I don't really know what you mean.
(And if there are, please do write issues about that, or fix it directly.)

Okay, I will try to start to fix this. If this would have looked to be easy to me, I would have already done it. But "data" is really set in a lot of locations, and some default behavior, e.g. the sequence length computation of the MetaDataset relies on having the "data" key. I think also for windowing this is still fixed and can not be changed, but I never used that.

So in relation to the actual topic here, I would suggest that the "input" part of the HDF files becomes optional (any script or dataset using HDF should still work then this is missing) and only keeping this for backward compatiblity. The Dataset class would be extended by a "default_data_key" or "input_data_key" (which would be "data" per default to not break old stuff), and could then be freely chosen to determine which data stream should be used as input, meaning this is what is used for sequence lengths, windowing, has "available_for_inference=True" per default etc... (all which is fixed to "data" right now and can not be changed).

Also, I am not sure if really all data formats are supported yet (at least the fixed vectors do not work right now without workaround).

Yes, we talked about this one case a couple of times now. As said, this would be simple to add. I guess no-one really needed it yet, or everyone who did was too lazy to implement it.

It is true that I was too lazy to do it, but I thought this should be done at the same time as changing the content of "num_outputs" for all datasets, which in the current design can not store this information. And this might not be simple.

The second point is that I would like to have an external interface for writing HDFs, that is similar to the SimpleHDFWriter, but intended for outside use.

What's the problem with SimpleHDFWriter? There is also another class (forgot the name).
If you think you can simplify the API, please do so, or open an issue about it. This shouldn't really need a discussion here. In general, if you think some API can be improved, then we should do it, and not discuss about it. If at all we maybe should discuss technical details about how to improve it. And it shouldn't really matter whether it is for internal or external use. It should be nice, clean, straight-forward in any case.

Most of the problems I see in the SimpleHDFWriter are related to the above problems, and if they are solved this should be simple to fix.

So it that no matter what scripts or software you have, you can directly write this properly into an HDF file, without going over the de-tour of creating a dataset first and then running hdf_dump (of course only for cases, where the data can directly be put into the network as-is, otherwise it makes more sense to write a new dataset).

Why is creating a dataset + using hdf-dump so much a problem? It should be extremely easy to create a dataset. (If not, again consider this a problem, make an issue about it, and let's discuss.) And then this approach would be extremely straight-forward, or not?

I assume this should probably be even simpler than writing your own custom dump script. I think this approach just have advantages over a custom dump script (e.g. you can directly use it also together with other RETURNN tools, e.g. dump-dataset, etc).

I don't really see any advantage of having a custom dump script.

This is not about a custom dump script (which I agree on that this should not exist), but just any kind of arbitrary data generation. When I mix my speaker embeddings and play around with in some scripts, I do not want to write a "ShuffleAndInterpolateSpeakerEmbeddingsDataset" and use the dump tool, but directly write this into an HDF. In some cases I might want to do this interactively from an IPython console, so having an API to write properly into HDFs for RETURNN usage is certainly a nice thing to have.

albertz Feb 10, 2021
Maintainer

[... assumption that "data" is hardcoded somewhere ...]

There should be no such internal assumptions. I don't really know what you mean.
(And if there are, please do write issues about that, or fix it directly.)

Okay, I will try to start to fix this. If this would have looked to be easy to me, I would have already done it. But "data" is really set in a lot of locations, and some default behavior, e.g. the sequence length computation of the MetaDataset relies on having the "data" key.

Having a default is not necessarily a bad idea, though. You either need to have a default, or need to require it to be explicitly defined by the user. Both are ok sometimes. However, we also should not break old setups now by removing a default.

Having a default is different from having something hardcoded though.

I think also for windowing this is still fixed and can not be changed, but I never used that.

Windowing is probably anyway broken. It was already there before I joined. I have never used it.
I'm not sure if it is worth it to even generalize, fix or extend that. There could be a dedicated dataset just for such operation. Or you could simply do it in the network. Or if we push the TF dataset a bit more (for future use), this would be much more natural anyway.

If it is indeed broken, this means it was not used by anyone since a long time. So we can just remove it.

So in relation to the actual topic here, I would suggest that the "input" part of the HDF files becomes optional (any script or dataset using HDF should still work then this is missing) and only keeping this for backward compatiblity.

Isn't it already optional? If not, this should really be trivial.

The Dataset class would be extended by a "default_data_key" or "input_data_key" (which would be "data" per default to not break old stuff), and could then be freely chosen to determine which data stream should be used as input, meaning this is what is used for sequence lengths, windowing, has "available_for_inference=True" per default etc... (all which is fixed to "data" right now and can not be changed).

If you want to extend that, why not directly do it in a generic way? Why should there be just a single input again? available_for_inference can be a flag for every data-key. Why would you want to configure a "default data key"?

It is true that I was too lazy to do [support for scalars in HDFDataset], but I thought this should be done at the same time as changing the content of "num_outputs" for all datasets, which in the current design can not store this information. And this might not be simple.

Dataset.num_outputs is mostly deprecated and mostly just used internally. All the information should be accessed via the API (all the get_... functions) instead. I don't think it would be difficult to remove.

This is not about a custom dump script (which I agree on that this should not exist), but just any kind of arbitrary data generation. When I mix my speaker embeddings and play around with in some scripts, I do not want to write a "ShuffleAndInterpolateSpeakerEmbeddingsDataset" and use the dump tool, but directly write this into an HDF.

I don't understand. We have MetaDataset to combine whatever data you want. You don't need to write any code to combine multiple data sources. Writing code to directly write into a HDF sounds actually like more work (and less maintainable).

In some cases I might want to do this interactively from an IPython console, so having an API to write properly into HDFs for RETURNN usage is certainly a nice thing to have.

Yea, but we already have that.

patrick-wilken Feb 10, 2021
Collaborator Author

So in relation to the actual topic here, I would suggest that the "input" part of the HDF files becomes optional (any script or dataset using HDF should still work then this is missing) and only keeping this for backward compatiblity.

Isn't it already optional? If not, this should really be trivial.

No, needs some changes, but I have that code lying around, just need to find the time to test it once more before making a PR. Or @JackTemaki, if you have a use case at hand I could already make it available so you can try.

patrick-wilken Feb 10, 2021
Collaborator Author

Nevermind, there you go: #438 😅

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDF Dataset - General Design #434

{{title}}

Replies: 2 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

HDF Dataset - General Design #434

patrick-wilken Feb 5, 2021 Collaborator

Replies: 2 comments · 9 replies

JackTemaki Feb 5, 2021 Maintainer

albertz Feb 5, 2021 Maintainer

albertz Feb 5, 2021 Maintainer

albertz Feb 8, 2021 Maintainer

JackTemaki Feb 10, 2021 Maintainer

albertz Feb 10, 2021 Maintainer

patrick-wilken Feb 10, 2021 Collaborator Author

patrick-wilken Feb 10, 2021 Collaborator Author

patrick-wilken
Feb 5, 2021
Collaborator

Replies: 2 comments 9 replies

JackTemaki
Feb 5, 2021
Maintainer

albertz Feb 5, 2021
Maintainer

albertz
Feb 5, 2021
Maintainer

albertz Feb 8, 2021
Maintainer

JackTemaki Feb 10, 2021
Maintainer

albertz Feb 10, 2021
Maintainer

patrick-wilken Feb 10, 2021
Collaborator Author

patrick-wilken Feb 10, 2021
Collaborator Author