What is the proper data preprocessing before using FPCA? #369

hovinh · 2021-10-15T07:20:28Z

hovinh
Oct 15, 2021

Hi scikit-fda team,

I have a time-series dataset with 50 traces and approximate 330 data points per trace.

The mean trace looks like this:

I would like to apply FPCA on them and later use PC score for a classification task. There are 2 ways I tried out:

Method 1: apply FPCA directly on the traces in discrete form.

Top 5 components' explained variance ratio are: [0.41167929 0.14467096 0.08427026 0.04618083 0.03430349]
Method 2: use Elastic Registration to remove phase variation in prior to apply FPCA.

Elastic Registration:

FPCA:

Top 5 components' explained variance ratio are: [0.11482679 0.10225417 0.09011544 0.08024505 0.06961226]

Given the above observation, I have the following question:

Is it true if I conclude Elastic Registration is not a good way to preprocess data in prior to feed into FPCA, based on the explained variance ratio? Do you know why it is so?

In one of your example, FPCA can be done on either discrete form or functional form. Therefore, I tried out the same experiment with basis found in issue #367 (BSpline, n_basis=200) and obtained similar result in explained variance ratio, with and without registration, so this seems data form is irrelevant. Do you have any suggestion on proper way to preprocess my data before feeding in FPCA?

Thank you.

Sincerely,
Vinh

vnmabus · 2021-10-15T11:37:34Z

vnmabus
Oct 15, 2021
Maintainer

As this is not an issue of the package, but instead a usage question, I am moving it to Discussions.

Your data is very atypical for Functional Data Analysis. Usually data in FDA is continuous and may take a continuum of values, while your data seems to take only two values (as far as I can see, the jump from one to the other is instantaneous, between measures). My first approach would be to use multivariate tools encoding each signal as a sequence of ones and zeros. Alternatively if you know that the data has always the same number of peaks you can "compress" your data as a vector of positions and durations of the peaks.

If you really like to use functional tools, for example for alignment, I would use the discretized representation or maybe a wavelet basis, such as the haar basis. This basis is not currently implemented, but you can add custom bases easily by just subclassing the Basis class and implementing the _evaluate method (see for example how it is done in Monomial).

1 reply

vnmabus Oct 15, 2021
Maintainer

Sorry, I mixed this question and the previous one. I will answer in a separate answer.

vnmabus · 2021-10-15T11:46:54Z

vnmabus
Oct 15, 2021
Maintainer

Your data has almost all the variation in phase instead of amplitude. It is thus not very surprising that your classification, and PCA, becomes worse after alignment.

What you can do here is to register the data, and then use the obtained warpings in phase space (the warping_ attribute of the registration class) for classification and PCA. Maybe you could try that and post what happens.

3 replies

hovinh Oct 18, 2021
Author

Thank you, I have these follow-up comments:

Correct me if I'm wrong, PCA tries to capture variation as much as possible, i.e. deviation from the mean. Because my data has variation mainly in phase, not amplitude, hence what's left after data registration is actually noise, which is not useful. Therefore you said there are 2 ways to capture phase variation?
- either do FPCA on the non-registered discretized data
- or use the warping_ attribute
May I ask what's the difference between apply FPCA on discretized vs functional format? In which case the latter is more preferable, cause it seems the former can alleviate the basis selection hassle for the user but I wonder any shortcoming/tradeoff?
I will try out the warping_ attribute and let you know.

Cheers,
Vinh

vnmabus Oct 18, 2021
Maintainer

* Correct me if I'm wrong, PCA tries to capture variation as much as possible, i.e. deviation from the mean. Because my data has variation mainly in phase, not amplitude, hence what's left after data registration is actually noise, which is not useful. Therefore you said there are 2 ways to capture phase variation?
  
  * either do FPCA on the non-registered discretized data
  * or use the `warping_` attribute

Yeah, totally.

* May I ask what's the difference between apply FPCA on discretized vs functional format? In which case the latter is more preferable, cause it seems the former can alleviate the basis selection hassle for the user but I wonder any shortcoming/tradeoff?

Well, conversion to a basis is a way to smooth your data, removing noise in the process. It could be also interesting in further analysis. For example, if you use linear regression without any kind of penalization using discretized data, the learned projection function could easily be "overfitted" and completely discontinuous. By using a basis for it, that can't happen, and the results generalize better and are more interpretable.

If you only want to perform PCA, probably it wouldn't matter too much.

* I will try out the `warping_` attribute and let you know.

Thank you! That could be interesting.

hovinh Oct 26, 2021
Author

Hi vnmabus,

I have studied further my use case and learnt that the first 50 time units of my traces are important, hence I cropped them. The problem I deal with is anomaly detection, with the signature pattern is that the trace does not reset completely to zero during the [20-35] unit interval, i.e zero point shift. In the figure below, you can see that there is a magnitude difference in the mentioned interval.

I have fitted BSplines (order = 2), applied Elastic Registration then FPCA (n_components = 5). Gladly, the 5th component does capture this zero point shift signature.

Plotting FPC score 1 (X axis) vs FPC score 5 (Y axis), we can see the anomaly traces (in red) concentrate on a particular area.

I have tried out with warping_ and realized it's functional data too. Did you mean unless exploring the original trace, if what we care about is phase variation, we can study the warping function instead? And if it's promising, we can apply FPCA on top to have FPC scores as features?

Thank you,
Vinh

vnmabus · 2021-10-26T09:28:43Z

vnmabus
Oct 26, 2021
Maintainer

I am glad you made it to work!

Did you mean unless exploring the original trace, if what we care about is phase variation, we can study the warping function instead?

Yeah, that was the idea. But now that you told me more about your problem, it seems that the information you want is in amplitude space, so aligning the data would maybe be better. I now also understand why you did not want to use a binary representation. I think the original images misguided me about the nature of your problem.

The question I have now is if you should not attempt to align the data and pick the point in the middle between the two maxima, as it seems that every point in that interval has all the information you care about.

2 replies

hovinh Oct 28, 2021
Author

Thank you for the comment. Picking the points in the middle works for this initial study, but I do have other traces measuring different aspect of my object with different patterns so I try to make the approach generalizable.

Speaking of which, this reminds me of a particular case whose dataset consists of only straight lines. ElasticRegistration threw out an error. Do you think this is expected behaviour?

Error details:

<ipython-input-78-2e90e2892696> in vanilla_warping(fd_good, fd_bad, sensor)
      7     elastic_registration = ElasticRegistration()
      8     start = time.time()
----> 9     fd_good_registered = elastic_registration.fit_transform(fd_good)
     10     end = time.time()
     11 

/opt/conda/envs/fda_py37/lib/python3.7/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    842         if y is None:
    843             # fit method of arity 1 (unsupervised transformation)
--> 844             return self.fit(X, **fit_params).transform(X)
    845         else:
    846             # fit method of arity 2 (supervised transformation)

/opt/conda/envs/fda_py37/lib/python3.7/site-packages/skfda/preprocessing/registration/elastic.py in fit(self, X, y)
    680             self.template_ = self.template  # Template already constructed
    681         else:
--> 682             self.template_ = self.template(X)
    683 
    684         # Constructs the SRSF of the template

/opt/conda/envs/fda_py37/lib/python3.7/site-packages/skfda/preprocessing/registration/elastic.py in elastic_mean(fdatagrid, penalty, center, max_iter, tol, initial, grid_dim, **kwargs)
    547     if center:
    548         # Gamma mean in Hilbert Sphere
--> 549         mean_normalized = warping_mean(gammas, **kwargs)
    550 
    551         gamma_mean = FDataGrid(

/opt/conda/envs/fda_py37/lib/python3.7/site-packages/skfda/preprocessing/registration/elastic.py in warping_mean(warping, max_iter, tol, step_size)
    373         np.square(mu, out=mu)[0],
    374         x=eval_points,
--> 375         initial=0,
    376     )
    377 

/opt/conda/envs/fda_py37/lib/python3.7/site-packages/scipy/integrate/_quadrature.py in cumtrapz(y, x, dx, axis, initial)
    294     `cumulative_trapezoid` instead.
    295     """
--> 296     return cumulative_trapezoid(y, x=x, dx=dx, axis=axis, initial=initial)
    297 
    298 

/opt/conda/envs/fda_py37/lib/python3.7/site-packages/scipy/integrate/_quadrature.py in cumulative_trapezoid(y, x, dx, axis, initial)
    360             # reshape to correct shape
    361             shape = [1] * y.ndim
--> 362             shape[axis] = -1
    363             d = d.reshape(shape)
    364         elif len(x.shape) != len(y.shape):

IndexError: list assignment index out of range

vnmabus Oct 28, 2021
Maintainer

That does look like a bug. If you can find a simplified reproducible case where it happens, please open an issue and I will take a look.

ego-thales · 2023-03-30T09:42:19Z

ego-thales
Mar 30, 2023

Just a note: it seems that from your data, you lose the abnormality when aligning, but this might not be the case if you try aligning the derivatives. Indeed, in the decreasing phase, you would have 2 peaks for anomalies, and only one for nominal samples. It might be another lead.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the proper data preprocessing before using FPCA? #369

{{title}}

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

What is the proper data preprocessing before using FPCA? #369

hovinh Oct 15, 2021

Replies: 4 comments · 6 replies

vnmabus Oct 15, 2021 Maintainer

vnmabus Oct 15, 2021 Maintainer

vnmabus Oct 15, 2021 Maintainer

hovinh Oct 18, 2021 Author

vnmabus Oct 18, 2021 Maintainer

hovinh Oct 26, 2021 Author

vnmabus Oct 26, 2021 Maintainer

hovinh Oct 28, 2021 Author

vnmabus Oct 28, 2021 Maintainer

ego-thales Mar 30, 2023

hovinh
Oct 15, 2021

Replies: 4 comments 6 replies

vnmabus
Oct 15, 2021
Maintainer

vnmabus Oct 15, 2021
Maintainer

vnmabus
Oct 15, 2021
Maintainer

hovinh Oct 18, 2021
Author

vnmabus Oct 18, 2021
Maintainer

hovinh Oct 26, 2021
Author

vnmabus
Oct 26, 2021
Maintainer

hovinh Oct 28, 2021
Author

vnmabus Oct 28, 2021
Maintainer

ego-thales
Mar 30, 2023