-
-
Notifications
You must be signed in to change notification settings - Fork 396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add specific thresholds for detecting quality errors in vitamins and other components [quality data] #11083
Comments
We should exclude the supplements category for this. I rather start with a low value and increase if needed. I assume this will be added to the vitamins/minerals taxonomy |
This is very interesting! I have investigated this a little bit and here are my first comments. 1. Raw products vs the restThe thresholds you have mentioned works with raw products. But the majority of food products are processed and many products have added vitamins or micronutrients at higher levels than your thresholds. I don't know if we identify all the raw products at once: is there a specific category for them? 2. The quest to clear outliersIf the product is not vitamins or micronutrients, it should never contain more than XXg of a vitamin or a micronutrient, even I would start with a very important value, such as 20 g / 100 g. As of 2024-12-13, a request on Mirabelle counts 542 products like this that don't have already a data quality error. 3. Level of "errorness"There are data quality errors which have bigger impacts than others. A wrong value for the fats can lead to a bad Nutri-Score or other kind of bad evaluation in terms of nutrition. Issues with vitamins or micronutrients seems to be less important IMHO, but it might exist some case where they are: any opinion? Should we create a new level of "minor data quality errors"? |
Do you have raw food examples? I agree it is a minor error. Not sure about an error classification. More an issue for the daily mail: what to include and what not |
I suggest to start with a threshold for which we are certain they are errors. Then we can lower the threshold to a point where we start to get false positives. Some (how many?) false positives are acceptable (what to do with those? ignore generated error in some way?) as long as we can detect errors |
Problem
The current threshold for triggering a quality error for a vitamin's amount is 105g per 100g. I think it is not precise enough and we can do much better!
On the screenshot below, only 5 values raise an error instead of all of them since the unit is wrong. Here the problem comes from the units, but it could come from a typo, like it often does in the nutritional facts errors.
Proposed solution
I suggest we establish a specific threshold for each vitamin so we can detect new quality errors. I've extracted the max values for each vitamin so we can set a threshold value above the max known value.
Retrievable data I've extracted from ciqual and usda (the proposed threshold here is 4 times greater that the maximal known value): nutrient max values.ods
Expected outcome
Many new errors would be raised but we would be able to fix the products' data more easily and then improve the overall quality of the database :)
Note: I've arbitrarily chosen a factor of 4 for the example but we should discuss together which value would be the best.
The text was updated successfully, but these errors were encountered: