Some issues in Tutorial_Multidimensional_Motif_Discovery and MDL #946
Replies: 6 comments
-
Also, not sure if it is related to issue (2) mentioned in previous post or not... but I feel doing may not be entirely correct because we are finding the min and max considering ALL values across ALL dimensions. Shouldn't we compute the min / max of each dimenison separately? |
Beta Was this translation helpful? Give feedback.
-
I don't know, I followed what you did and got the same MDL results both times (without setting the last value to 1000 and then with setting the last value to 1000):
and it prints:
This produces
Are you sure? In the code that you provided, Wait, why are you doing this?
This takes the first row of all three time series and multiplies it by 1000. Then it takes the second row in all 3 time series and multiplies it by 100. Finally, it takes the 3rd row in all 3 time series and multiplies it by 10. Is this what you really want? I would've thought that you wanted to multiply ALL values of the first time series by 1000 and so on and so forth. |
Beta Was this translation helpful? Give feedback.
-
Without having put too much thought into it, I don't think so. If you apply min/max to each dimension separately then you'd be discretizing each dimension using completely independent functions. I believe that the same discretization function must be performed uniformly across all dimensions using the same max/min. |
Beta Was this translation helpful? Give feedback.
-
Oops, my bad! The title of the second part should have been:
So, I meant when normalize is
Again... my bad! That should have been
Would you mind trying this?
@seanlaw |
Beta Was this translation helpful? Give feedback.
-
Okay, I am able to reproduce it now. In both cases, I think the issue stems from the fact that one or more of the time series have a significantly larger/different min/max range, which then affects the MDL modeling. Essentially, in our current implementation, we are basically assuming that the data from each time series are being sampled from the same distribution (e.g., all of the time series come from three different thermostats that are sitting in the same room). However, it's possible that you have three time series that are collecting values from different distributions (e.g., all three time series are in the same room but one is measuring temperature, a second is measuring the pressure, and a third is measuring the amount of CO2 gas). In the first case, it's likely "okay" to use the default discretization function. However, in the latter case, it might not make any sense and, instead, the user should specify their own discretization function (via Perhaps, your questions is whether or not there is a "smarter" way to either warn the user that the default discretization function may be bad/insufficient for their data and/or maybe there is a better default discretization function?
Yes and no. It depends. If the scale of all of the time series are unrelated, then "yes". If the scale of all of the time series are related (as in the former case above), then "no".
Again, this is fine in the former case above but not fine for the latter case. When |
Beta Was this translation helpful? Give feedback.
-
Yes... I noticed it after playing with data and checking out the results.
Yeah, and I was hoping to see if it helps me with finding some solution for #942. My main reason behind creating this issue was to dig a little bit deeper and see if it is possible to consider customize offset
You are right. Sadly I don't know either. |
Beta Was this translation helpful? Give feedback.
-
I have been looking for different documents to better understand MDL, and I came across this tutorial notebook which explains Multidimensional Motif Discovery. I discovered a few issues:
(1) The locations of co-motifs do not match
According to Fig. 2 in Matrix Profile VI, the locations of motifs in the first two dimensions are the same. Personally, I call it co-motifs, i.e. motif pair
(A, A')
in one dimension and(B, B')
in another dimension starts at the same index. (Also: see Definition 11).The toy data provided in the notebook, however, does not result in matching indices for the motifs in the first two dimensions.
(2) When I set
normalize=True
everything is good. But, if I set the last value of time series in the last dimension to1000
, I get inconclusive result if I use MDL.And, I will see this plot when I want to visualize the MDL results:
In this case, the minimum is at index 2. However, we know that this is not correct. It is interesting that the elbow still indicates the correct result:
(3) Let's set
normalize
to False again. Also, let's scale the time series in the dim 0, 1, 2 by 1000, 100, 10, respectively.And I get this:
But I was expecting to get the same index for the first two dimensions. In this case, I think the reason is that we are just adding the distances across dimensions. see:
https://github.com/EitanHemed/stumpy/blob/d569c9adbb5f4fd3ba018661a78ac80cbb2d5808/stumpy/core.py#L3999-L4001
While this can make sense when
normalize=True
, it may not be appropriate to just add them together (but I do understand that we probably do not have any other choice here). Note that if we apply matrix profile on each dimension individually, we get correct answer (still, issue (1) exists). However, if we just apply multi-dim matrix profile, we get strange result because the scale of time series are not the same, and it affects the result whennormalize==False
.Maybe it is not an issue(?!) but still I expected to get correct answer since applying metrix profile on each time series reveals co-motifs in the first two dimensions. So, maybe we just add a note in the doctoring saying that it is better to normalize the WHOLE time series in EACH dimension first before passing it to
mstump(...., normalize=False)
Beta Was this translation helpful? Give feedback.
All reactions