-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change warning for placeholder size to exception #1143
base: master
Are you sure you want to change the base?
Conversation
Thanks for the PR. There are some issues with it. As a starting point for contributing, see here: https://github.com/rwth-i6/returnn/blob/master/CONTRIBUTING.md
|
Can you see the output of the failing tests? For example:
|
Thanks for the feedback, working on it. |
Btw, I just introduced a new behavior version, so you should rebase here.
|
See my updated comment above. As an example on the test cases: def test_Data_get_common_data_extra_static_spatial():
d1 = Data(name='t', shape=(None, 32, 128), dtype='float32', auto_create_placeholders=True)
d2 = Data(name='r', shape=(None, 32, 128), dtype='float32', auto_create_placeholders=True)
d2.get_size_dim_tag(0).declare_same_as(d1.get_size_dim_tag(0))
common = Data.get_common_data([d1, d2])
assert d1.shape == common.shape So it makes sense that this fails now because It might work if you just remove the This is a tricky example. We need to understand what this was actually testing. I think this was specifically testing the behavior for some older setups. If my suggestion above did not work, you could also wrap this all up in sth like: orig_behavior_version = BehaviorVersion._behavior_version
try:
BehaviorVersion._behavior_version = 0
... # now the test
finally:
BehaviorVersion._behavior_version = orig_behavior_version However, we should be hesitant to use this too often. If possible, we should always update the test case properly. Only if we are pretty sure the test is there to ensure some old behavior, some old setup or so keeps working, and updating the test otherwise would then miss some crucial parts of the test, only then I think it is valid to use this. |
I'm running into this issue when trying to execute tests. I'm working on a Mac M1, any idea how I can fix it ?
|
You don't need to run all the tests. You can also just check them here in the issue, or otherwise only run |
It seems like we have to go for that implementation
Although, I'm not familiar with the code base. Should I had this section to every failing test ? |
Why? As I said, I really would try to avoid this. Did you try my other suggestion with removing |
When I try to run the code with the changes you suggested this is the error that I get trying to run test_TFUtil.py Executing: test_Data_copy_template_excluding_time_dim_two_time_dims
EXCEPTION
Traceback (most recent call last):
File "/Users/darkrock/returnn/tests/test_TFUtil.py", line 3941, in <module>
line: v()
locals:
v = <local> <function test_Data_copy_template_excluding_time_dim_two_time_dims at 0x165fe1c60>
File "/Users/darkrock/returnn/tests/test_TFUtil.py", line 202, in test_Data_copy_template_excluding_time_dim_two_time_dims
line: assert set(data.size_placeholder.keys()) == {0, 1}
locals:
set = <builtin> <class 'set'>
data = <local> Data{'ref_att_weights_output', [B?,T|'time:var-unk:ref_att_weights_output'[?],'spatial1:var-unk:ref_att_weights_output'[?],F|F'feature:ref_att_weights_output'(1)]}
data.size_placeholder = <local> {}, len = 0
data.size_placeholder.keys = <local> <bound method _SizePlaceholderProxy.keys of {}>
AssertionError |
My suggested changes were just for the function |
If I do not change it there, then the exception implemented is raised when running the tests. RequirementNotSatisfied: Dim tags are same with different size placeholders: <tf.Tensor 'extern_data/placeholders/r/r_dim0_size:0' shape=(?,) dtype=int32> vs <tf.Tensor 'extern_data/placeholders/t/t_dim0_size_1:0' shape=(?,) dtype=int32> please check external_data (required since behavior_version >= 14) |
Can you show the full error with stacktrace?
|
Executing: test_Data_get_common_data_extra2_static_spatial
EXCEPTION
Traceback (most recent call last):
File "/Users/dark-rock/returnn/tests/test_TFUtil.py", line 3941, in <module>
line: v()
locals:
v = <local> <function test_Data_get_common_data_extra2_static_spatial at 0x1530c1ea0>
File "/Users/dark-rock/returnn/tests/test_TFUtil.py", line 696, in test_Data_get_common_data_extra2_static_spatial
line: d2.get_size_dim_tag(0).declare_same_as(d1.get_size_dim_tag(0))
locals:
d2 = <local> Data{'r', [B?,T|'time:var:extern_data:t'[B?],'spatial1:static32:r'(32),'spatial2:static32:r'(32),F|F'feature:r'(128)]}
d2.get_size_dim_tag = <local> <bound method Data.get_size_dim_tag of Data{'r', [B?,T|'time:var:extern_data:t'[B?],'spatial1:static32:r'(32),'spatial2:static32:r'(32),F|F'feature:r'(128)]}>
declare_same_as = <not found>
d1 = <local> Data{'t', [B?,T|'time:var:extern_data:t'[B?],'spatial1:static32:t'(32),'spatial2:static32:t'(32),F|F'feature:t'(128)]}
d1.get_size_dim_tag = <local> <bound method Data.get_size_dim_tag of Data{'t', [B?,T|'time:var:extern_data:t'[B?],'spatial1:static32:t'(32),'spatial2:static32:t'(32),F|F'feature:t'(128)]}>
File "/Users/dark-rock/returnn/returnn/tf/util/data.py", line 987, in Dim.declare_same_as
line: BehaviorVersion.require(False,"Dim tags are same with different size placeholders: %r vs %r please check external_data" % (self.dyn_size, other_same_base.dyn_size),14)
locals:
BehaviorVersion = <global> <class 'returnn.util.basic.BehaviorVersion'>
BehaviorVersion.require = <global> <bound method BehaviorVersion.require of <class 'returnn.util.basic.BehaviorVersion'>>
self = <local> Dim{'time:var:extern_data:t'[B?]}
self.dyn_size = <local> <tf.Tensor 'extern_data/placeholders/r/r_dim0_size:0' shape=(?,) dtype=int32>
other_same_base = <local> Dim{'time:var:extern_data:t'[B?]}
other_same_base.dyn_size = <local> <tf.Tensor 'extern_data/placeholders/t/t_dim0_size_1:0' shape=(?,) dtype=int32>
File "/Users/dark-rock/returnn/returnn/util/basic.py", line 302, in BehaviorVersion.require
line: raise BehaviorVersion.RequirementNotSatisfied(
"%s (required since behavior_version >= %i)" % (message, version))
locals:
BehaviorVersion = <global> <class 'returnn.util.basic.BehaviorVersion'>
BehaviorVersion.RequirementNotSatisfied = <global> <class 'returnn.util.basic.BehaviorVersion.RequirementNotSatisfied'>
message = <local> "Dim tags are same with different size placeholders: <tf.Tensor 'extern_data/placeholders/r/r_dim0_size:0' shape=(?,) dtype=int32> vs <tf.Tensor 'extern_data/placeholders/t/t_dim0_size_1:0' shape=(?,) dtype=int32> please check external_data", len = 239
version = <local> 14
RequirementNotSatisfied: Dim tags are same with different size placeholders: <tf.Tensor 'extern_data/placeholders/r/r_dim0_size:0' shape=(?,) dtype=int32> vs <tf.Tensor 'extern_data/placeholders/t/t_dim0_size_1:0' shape=(?,) dtype=int32> please check external_data (required since behavior_version >= 14) |
But I was only speaking about test_Data_get_common_data_extra_static_spatial. |
Here's the stacktrace EXCEPTION
Traceback (most recent call last):
File "/Users/darkrock/returnn/tests/test_TFUtil.py", line 3932, in <module>
line: v()
locals:
v = <local> <function test_Data_get_common_data_extra_static_spatial at 0x1688c1630>
File "/Users/darkrock/returnn/tests/test_TFUtil.py", line 672, in test_Data_get_common_data_extra_static_spatial
line: d2.get_size_dim_tag(0).declare_same_as(d1.get_size_dim_tag(0))
locals:
d2 = <local> Data{'r', [B?,T|'time:var:extern_data:t'[B?],'spatial1:static32:r'(32),F|F'feature:r'(128)]}
d2.get_size_dim_tag = <local> <bound method Data.get_size_dim_tag of Data{'r', [B?,T|'time:var:extern_data:t'[B?],'spatial1:static32:r'(32),F|F'feature:r'(128)]}>
declare_same_as = <not found>
d1 = <local> Data{'t', [B?,T|'time:var:extern_data:t'[B?],'spatial1:static32:t'(32),F|F'feature:t'(128)]}
d1.get_size_dim_tag = <local> <bound method Data.get_size_dim_tag of Data{'t', [B?,T|'time:var:extern_data:t'[B?],'spatial1:static32:t'(32),F|F'feature:t'(128)]}>
File "/Users/darkrock/returnn/returnn/tf/util/data.py", line 987, in Dim.declare_same_as
line: BehaviorVersion.require(False,"Dim tags are same with different size placeholders: %r vs %r please check external_data" % (self.dyn_size, other_same_base.dyn_size),14)
locals:
BehaviorVersion = <global> <class 'returnn.util.basic.BehaviorVersion'>
BehaviorVersion.require = <global> <bound method BehaviorVersion.require of <class 'returnn.util.basic.BehaviorVersion'>>
self = <local> Dim{'time:var:extern_data:t'[B?]}
self.dyn_size = <local> <tf.Tensor 'extern_data/placeholders/r/r_dim0_size:0' shape=(?,) dtype=int32>
other_same_base = <local> Dim{'time:var:extern_data:t'[B?]}
other_same_base.dyn_size = <local> <tf.Tensor 'extern_data/placeholders/t/t_dim0_size_1:0' shape=(?,) dtype=int32>
File "/Users/darkrock/returnn/returnn/util/basic.py", line 302, in BehaviorVersion.require
line: raise BehaviorVersion.RequirementNotSatisfied(
"%s (required since behavior_version >= %i)" % (message, version))
locals:
BehaviorVersion = <global> <class 'returnn.util.basic.BehaviorVersion'>
BehaviorVersion.RequirementNotSatisfied = <global> <class 'returnn.util.basic.BehaviorVersion.RequirementNotSatisfied'>
message = <local> "Dim tags are same with different size placeholders: <tf.Tensor 'extern_data/placeholders/r/r_dim0_size:0' shape=(?,) dtype=int32> vs <tf.Tensor 'extern_data/placeholders/t/t_dim0_size_1:0' shape=(?,) dtype=int32> please check external_data", len = 239
version = <local> 14
RequirementNotSatisfied: Dim tags are same with different size placeholders: <tf.Tensor 'extern_data/placeholders/r/r_dim0_size:0' shape=(?,) dtype=int32> vs <tf.Tensor 'extern_data/placeholders/t/t_dim0_size_1:0' shape=(?,) dtype=int32> please check external_data (required since behavior_version >= 14) |
My suggestion was: In |
Here's the stacktrace Executing: test_Data_get_common_data_extra_static_spatial
EXCEPTION
Traceback (most recent call last):
File "/Users/darkrock/returnn/tests/test_TFUtil.py", line 3933, in <module>
line: v()
locals:
v = <local> <function test_Data_get_common_data_extra_static_spatial at 0x1551c1cf0>
File "/Users/darkrock/returnn/tests/test_TFUtil.py", line 672, in test_Data_get_common_data_extra_static_spatial
line: d2.get_size_dim_tag(0).declare_same_as(d1.get_size_dim_tag(0))
locals:
d2 = <local> Data{'r', [B?,T|'time:var-unk:r'[?],'spatial1:static32:r'(32),F|F'feature:r'(128)]}
d2.get_size_dim_tag = <local> <bound method Data.get_size_dim_tag of Data{'r', [B?,T|'time:var-unk:r'[?],'spatial1:static32:r'(32),F|F'feature:r'(128)]}>
declare_same_as = <not found>
d1 = <local> Data{'t', [B?,T|'time:var-unk:t'[?],'spatial1:static32:t'(32),F|F'feature:t'(128)]}
d1.get_size_dim_tag = <local> <bound method Data.get_size_dim_tag of Data{'t', [B?,T|'time:var-unk:t'[?],'spatial1:static32:t'(32),F|F'feature:t'(128)]}>
File "/Users/darkrock/returnn/returnn/tf/util/data.py", line 5415, in Data.get_size_dim_tag
line: axis_wo_batch = sorted(self.size_placeholder.keys())[number]
locals:
axis_wo_batch = <not found>
sorted = <builtin> <built-in function sorted>
self = <local> Data{'r', [B?,T|'time:var-unk:r'[?],'spatial1:static32:r'(32),F|F'feature:r'(128)]}
self.size_placeholder = <local> {}, len = 0
self.size_placeholder.keys = <local> <bound method _SizePlaceholderProxy.keys of {}>
number = <local> 0
IndexError: list index out of range |
Sorry, that was an unrelated bug in |
I think you did not correctly merge or rebase. GitHub still complains: "This branch cannot be rebased due to conflicts" |
Don't first create a new size placeholder and then later call declare_same_as. Esp this is required when declare_same_as becomes stricter (#1143).
Don't first create a new size placeholder and then later call declare_same_as. Esp this is required when declare_same_as becomes stricter (#1143). Fix wrong batch info: The dim tag could have an old invalid batch info. E.g. the global batch_dim when it comes from an old run. If we really need this, we should validate the dim tag first. But probably it's better to remove it and clean it up. Engine reset global batch dim.
I now merged #1159 which should fix this. Can you rebase? |
So the remaining failing tests are:
|
Probably those need to be adopted. E.g. in in0 = Data(
name="in0", shape=(None, None, n_dim), batch_dim_axis=1, auto_create_placeholders=True)
in1 = Data(
# same time as first in in0
name="in1", shape=(None, n_dim), auto_create_placeholders=True)
in2 = Data(
# same time as in second in in0
name="in2", shape=(None, n_dim), batch_dim_axis=1, auto_create_placeholders=True)
extern_data.register_data(in0)
extern_data.register_data(in1)
extern_data.register_data(in2)
in1.get_size_dim_tag(0).declare_same_as(in0.get_size_dim_tag(0))
in2.get_size_dim_tag(0).declare_same_as(in0.get_size_dim_tag(1)) Change this to: in0 = Data(
name="in0", shape=(None, None, n_dim), batch_dim_axis=1, auto_create_placeholders=True)
in1 = Data(
# same time as first in in0
name="in1", dim_tags=[in0.dim_tags[i] for i in (1, 0, 3)], auto_create_placeholders=True)
in2 = Data(
# same time as in second in in0
name="in2", dim_tags=[in0.dim_tags[i] for i in (2, 1, 3)], auto_create_placeholders=True)
extern_data.register_data(in0)
extern_data.register_data(in1)
extern_data.register_data(in2) |
In |
I haven't looked too much into the other test cases yet but from a first glance, I think they are a bit more difficult. It might be that this even detects now some error which we have not noticed before. E.g. in
The size placeholders really look different. They should not become the same.
So |
returnn/tf/util/data.py
Outdated
@@ -998,12 +998,11 @@ def declare_same_as(self, other): | |||
if self.dyn_size is not None and other_same_base.dyn_size is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this whole block of code (if self.dyn_size is not None and other_same_base.dyn_size is not None: ...
) a few lines up, just before the line if self_same_as is not self: ...
?
This should make the error message and stacktrace more clear because then you still see the dim tags separated and not yet merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trace is pretty big but I think this is the most relevant part:
Finished training in epoch 2.
Run 3: Train results:
{1: EpochData(learningRate=0.01, error={
'dev_error': 0.6000000089406967,
'dev_score': 1.0964271708790534,
'train_error': 0.700000010430813,
'train_score': 1.0993032428518419,
}),
2: EpochData(learningRate=0.01, error={
'dev_error': 0.6000000089406967,
'dev_score': 1.0963989421188671,
'train_error': 0.6600000098347665,
'train_score': 1.0985237666744554,
})}
Run 3: Forward cv seq 0:
[1 2 0 1 2]
Epoch 1, error key 'dev_error', current value 0.600000 vs prev value 0.600000, equal?
Epoch 1, error key 'dev_score', current value 1.096427 vs prev value 1.096427, equal?
Epoch 1, error key 'train_error', current value 0.700000 vs prev value 0.680000, equal?
EXCEPTION
Traceback (most recent call last):
File "/Users/darkrock/returnn/tests/test_TFEngine.py", line 4406, in <module>
line: v()
locals:
v = <local> <function test_rec_subnet_auto_optimize at 0x1658a9120>
File "/Users/darkrock/returnn/tests/test_TFEngine.py", line 2603, in test_rec_subnet_auto_optimize
line: run(run_idx=3, optimize_move_layers_out=True)
locals:
run = <local> <function test_rec_subnet_auto_optimize.<locals>.run at 0x168c6df30>
run_idx = <not found>
optimize_move_layers_out = <not found>
File "/Users/darkrock/returnn/tests/test_TFEngine.py", line 2594, in test_rec_subnet_auto_optimize.<locals>.run
line: numpy.testing.assert_almost_equal(error_value, prev_error_value, decimal=3)
locals:
numpy = <global> <module 'numpy' from '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/__init__.py'>
numpy.testing = <global> <module 'numpy.testing' from '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/testing/__init__.py'>
numpy.testing.assert_almost_equal = <global> <function assert_almost_equal at 0x165877ac0>
error_value = <local> 0.700000010430813
prev_error_value = <local> 0.6800000101327897
decimal = <not found>
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/testing/_private/utils.py", line 599, in assert_almost_equal
line: raise AssertionError(_build_err_msg())
locals:
AssertionError = <builtin> <class 'AssertionError'>
_build_err_msg = <local> <function assert_almost_equal.<locals>._build_err_msg at 0x168c6de10>
AssertionError:
Arrays are not almost equal to 3 decimals
ACTUAL: 0.700000010430813
DESIRED: 0.6800000101327897
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't exactly understand your comment. Did you do this change? I don't see it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the change, didn't commit it yet. I'll push it so you can see the stacktrace in the pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I don't see this stacktrace/error you posted here in the pipeline?
returnn/tf/util/data.py
Outdated
@@ -983,6 +983,14 @@ def declare_same_as(self, other): | |||
if self_derived_bases.issubset(other_derived_bases): | |||
# Avoid cycles on derived_from_tag. https://github.com/rwth-i6/returnn/issues/1054 | |||
return other.declare_same_as(self) | |||
if self.dyn_size is not None and other_same_base.dyn_size is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very strange that moving this up now causes different behavior, i.e. namely that some tests are passing now which were not passing before.
I now wonder whether the check here is actually not so correct, or maybe just incomplete.
The _maybe_update
call below might update self.dyn_size
. But self.dyn_size
is also only set if self.batch
is properly set. In case self.batch is None
, which is totally valid, self.dyn_size
is likely also not set.
What we actually want to test is: If there are any combination of batch
and ctx
where both self
and other
are defined (via get_for_batch_ctx
logic), and those dyn_size
is different, then we raise this exception.
I will come up with some code for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange, my first update also did not yield the previous behavior. I am now trying again sth more.
Fix #1141.
Introduces a new behavior version (#508).