-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Artificial information #209
Comments
I am interested in working on this problem. I did come across it on the GSoC ideas page. Can someone help me understand what is the current status of this problem statement with #234 in place. What is the future work targeted here to extend the project ? |
Hi @faze-geek, (cc @milankl ) |
We tried possibility 2 from the graph above, which is trickier than we thought, because the "somehow" isn't well constrained and so it doesn't generalise as easily. We haven't tried possibility 1 yet, which I hope would be more promising. |
Dequantizing works well. Try something like import numpy as np
def dequantize(array, sig_fig):
a = 0.5 * 10**sig_fig
noise = np.random.uniform(-a, a, size=array.shape)
return array + noise
ds = xr.tutorial.open_dataset("air_temperature")
ds['air'] = dequantize(ds['air'], sig_fig=-1) I'm assuming the data were rounded rather than truncated, but I don't think this will make any difference in practice. |
I think we're talking in different directions here. Dequantizing is what can cause the artificial information in the first place? |
I played around with this a little out of curiosity. I think it is possible to expand on the idea of @thodson-usgs a bit, considering how the quantization is done. Assuming the scaling factor for quantization is derived as One way forward I could see is to open the file in question as usual, read the scaling factor and the number of bits used to store the quantized data from the file metadata for each affected variable and then add noise that is maybe one order of magnitude smaller than the I hope I am not pointing out the obvious here; if so, sorry! I would be curious if there are any gaping issues in my logic or whether the general approach may be worth pursuing further. Some unstructured experiments and sample code can be found here: https://gist.github.com/JoelJaeschke/5951314b62e112b0ccbe98fe75b8b997 |
No it's absolutely possible to use information about the linear packing that was previously applied to set some bound on the precision instead of assuming that a float32 has full float32 precision. But I think it points to the same problem, you need to use some information before the decoding of the linear packing happened, that's why I thought why not directly compute the information content from the quantized uint values? That'll give you, say, 14 bits of the uint16 having information so you to chop those last two off if you want to stay in quantised space or if you want to decode back to floats you can translate that to keepbits. In this case |
It does make sense what you are suggesting and have played around with this some more, using some ERA5 wind speed data. When running I don't know if you already have some idea in mind about translating the keepbits from quantized space to dequantized space again, so I thought about the following. The smallest change in dequantized space we can resolve is the To translate the integer keepbits to float keepbits, I thought about the following: Assuming the full integer range is used for the quantized storage, i.e. the maximum dequantized value is I have not really given any thought to the error bounds, however, one thing that came to mind is that the error will likely vary in dequantized space, due to the inhomogeneous distribution of floating point numbers. Taking the example of wind speed again, near the surface, values may be on the order of Again, I hope I am not stating the obvious here, but do you see issues with this? |
We just discussed the artificial information problem and illustrated it
So, ideally, one would have an original dataset (blue), analyse the information, bitround it and obtained a bitrounded dataset (red) such that the information is simply truncated at the, e.g. 99% threshold. However, most climate does not come in full precision but is already quantized (orange). E.g. all data from ECMWF is quantized, inclusing ERA-5, which is very popular and probably the most downloaded climate dataset out there. Having only quantized data, dequantization often creates artifical information which complicates the analysis and especially poses users of xbitinfo with the problem of "what is that" and we would need to explain artificial information and how to circumvent or avoid this problem when assessing the number of keepbits.
There might be several solutions to this problem, illustrated here are two:
If the data comes quantized actually do the information analysis in the quantized integers (and not in floats after dequantization) obtain some keepbits wrt the integer encoding and translate them to an error (relative/absolute) that could then be translated to keepbits for floating point numbers. How well this works, we don't know but it would be awesome to test.
Given an original dataset we have the true information distribution at hand (purple line), meaning that if we obtain artificial information information there might be a way of correcting that. For example, there might be some simple way of flagging some information as artificial, so it an be removed and then obtain the keepbits from there. Maybe that's a good supervised learning problem to find a function that maps real + artifical information (black) to real information only (purple). This would be another way of addressing the problem.
The text was updated successfully, but these errors were encountered: