-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YODA2 writer #53
YODA2 writer #53
Conversation
Dear @GraemeWatt - here's my attempt at tweaking the YODA converter to support the upcoming YODA2 release. I'm still in the process of validating it locally: My plan was to download the YAML inputs for the Inspire IDs in the usual I noticed occasional crashes where the YAML-parsing of the YAML input downloaded from HepData fails, e.g. similar to this:
Is this expected / should I just skip these? Thanks! |
@20DM, thanks for making a start on this! The problem with the YAML parsing is because the validation has become more strict over time, so some older records can fail validation against the most recent schema. The workaround is to validate against the original schema using an option |
Thanks @GraemeWatt ! That fixed a lot of the parsing errors. There are still O(10) of the following type left:
Are they also "fixable" with an option or shall I skip them? |
That particular case came up in a previous discussion. I just fixed the corresponding table by directly editing the YAML data file stored on disk to remove the extra We don't have other options to skip errors, but in most cases a YODA file will still be produced. In the case you mentioned (with an extra It would be helpful if you could list the O(10) records that still give an error with the corresponding error message. These are likely caused by incorrectly formatted records, so the fix is to either update the records or to modify the converter to tolerate the incorrect formatting. Some errors might already be logged as open issues in this repo. |
Adding some more printout I released that all 9 messages came from the same entry, which was indeed the STAR measurement (793126) which you've just fixed. I've synced the yaml and the messages have disappeared! Apologies for the noise and thanks for fixing! |
Hi @GraemeWatt - here's another odd one I've come across, cf. second independent column of Table 18 https://www.hepdata.net/record/56218 I could try editing the array writer to give preference to the low/high edges if they exist and there's a type mismatch with the value, but perhaps fixing the yaml directly is the preferred approach here. Any preference? |
PS - this sort of patch is what I had in mind:
|
That's a strange one. The combination of a string |
Hi @GraemeWatt - ah, I hadn't even clocked that the uncertainty intervals don't match. Even worse - but good that you found Mike's original submission! Looking at the paper now, I would argue for the second option: the |
I agree it's better as a dependent variable, so I've reformatted the table. It's an open issue to allow two dependent variables to be plotted against each other, as seems to be common in heavy-ion publications. I spotted another mistake in the qualifier |
Thanks @GraemeWatt - just to confirm that this fixes the conversion on my side as well. 👍 |
Hi @GraemeWatt - I found one more like this cf Table 11 here. I believe this is the last one with a pseudo-independent variable. |
That record was encoded by me in 2013, with data from Table 11 of the record taken from Table 1 of the publication. The first row of the original input file has |
Hi Graeme, apologies for the delay on here. It took me longer than expected, but I've got Rivet up and running with YODA2 now, and I'm currently in the process of trying to get the full suite of routines to run with the reference data coming from this draft converter. It seems to me that we can get rid of a lot of the custom Python patches that we apply as part of the HepData sync (which was the aim of the game of course), but there's a few entries that need a bit of thought. One example is this record from AMY where nearly all the tables have a binned differential cross-section with an additional overlapping bin at the end for the average. The averages should have probably gone into separate tables as opposed to superimposing them onto the differential versions. In fact, we currently use a Python patch to remove this last bin since we don't need/want a histo bin corresponding to the average; that's of course a histogram-level quantity that would normally be calculated at the very end of an analysis run. Could we just split the averages off into a dedicated table? |
I agree that the averages, for example, the last bin of this HEPData table, would be better given in a separate HEPData table. The encoding just reflects the presentation of the journal publication (Table 5). It was not foreseen in 1990 how the data would be used in 2023. If there are only a small number of affected HEPData records relevant for Rivet, I can upload revised versions with the averages moved to separate tables (or I can give you Uploader permissions). Otherwise, it is probably possible to write code to automatically remove the inclusive bins in the YAML to YODA conversion. For example, if there are more than two bins, find the low and high values of all bins, then if the low and high values of a single bin match the low and high values of all bins, that single bin should be removed. |
Hi Graeme - just to keep you updated: I'm just on the final stretch in terms of converting the reference data files from Y1->Y2, still ironing out a few more edge cases with small tweaks to YODA and/or the converter. You may be pleased to hear that the vast majority of our custom Python patches in Rivet become obsolete thanks to the new types, bringing it down by an order of magnitude from O(150) to O(15) currently. 😅 Some things to flag up:
There are a handful more that I'm looking into. |
Sounds like great progress! Brief response to your comments:
|
Another weird one that I wanted to point out: The lower bin edges in T12 of Tasso-266893 seem to be messed up. I think they may have unintentionally ended up copying the lower bin edges from T6, but put the correct upper edges - the plot makes little sense to me otherwise. If I keep the leading 0.0 as the overall lowest bin edge and then only accept the upper bin edges, I get a reasonable binning for this distribution:
I almost didn't notice this one because this is how the HepData-to-YODA converter in practice constructs the edges, and hence gets it right, but we might wanna fix the record anyway. (The paper is behind a paywall unfortunately, so can't easily double check.) |
…previous estimate for a given set of local indices
Just to add that with the latest logic I pushed, I don't see any issues with the various records that include the average bins along with the main distribution. The logic is a little slower CPU-wise as it checks previous indices/edges with oven gloves, but it seems to work just fine as far as I can tell. Let me know if you have any suggestions for improvements, though. I also introduced two new options for (de-)selecting specific table identifiers. This is useful for a few records that contain huge covariance matrices that would otherwise take a long time to render, but wouldn't be used in practice. They are still being included by default, but e.g. if we know in advance that we don't want to ship them with every Rivet release, we can now disable their conversion/inclusion via the analysis info file and save ourselves some time & disk space when doing the HepData syncing in the future. |
Regarding Table 12 of ins266893, the mistake looks to be in the tabulated data of the original publication: |
Could we please remove the hyperlinks from the error labels in CDF-552797? I can try and escape the extra quotes in the converter, but I don't think the error labels should allow for hyperlinks to be submitted in the first place... This feels like a security issue - I'm assuming the uploader has some sanity checks to prevent people from uploading dodgy strings with code in it? |
I've shortened the error label from |
Hi Graeme - the error breakdowns in the three tables of ins1294140 appear to have been merged with the central value...? Can we fix this one? |
Hi Graeme, just to update you on the status: I managed worked around the remaining issues with Python patches (will continue with the syncing after the releases are out) and got a Rivet branch with all the reference data converted to the new YODA2 formats using the HepData converter from this pull request. I copied the original converter into a "YODA1" converter and for the time being we'll provide a YODA1-style writer in the new YODA as well. The is legacy writer is marked deprecated and the plan would be to remove it eventually (e.g. in a few years) once users had a chance to migrate their setups to the Rivet+YODA versions. There's a couple of things I still wanna look into, but I reckon we'll be ready to make an alpha release for YODA2 in the coming days - of course, it would be great if we could make sure that everything is working on the HepData side once that alpha is out / before the actual production release. I'll undraft this pull request now -- feel free to review whenever is good for you. Cheers, |
Thanks Chris! I'll look into this soon. Just one comment for now. Could you please update the testsuite to work with the renamed |
Tests re-enabled now!
|
* Upgrade actions in ci.yml and use Python 3.10 (not 3.8). * Use hepdata/hepdata-converter:latest Docker image (until new tag). * Python requirements now need to be installed as not in Docker image. * Renamed 'master' branch to 'main' as with other HEPData repositories. * Update 'language' and 'intersphinx_mapping' for newer Sphinx versions. * Replace virtualenvwrapper commands by standard Python 'venv' module. * Update Read the Docs configuration file '.readthedocs.yml'.
* Also add 'yoda1' to the output formats in 'convert' docstring.
Thanks for adding the tests! I prepared a Docker image with YODA v2.0.0alpha installed. If you want to upgrade the YODA version in future, you could submit a PR updating the Could you please also clarify the timescale for when you want the new |
Hi Graeme, thanks for the feedback! I'll look into additional test coverage in the next few days. I agree regarding the download options. I think realistically we'll need the option to download the new and the old version, since it will take some time for people to migrate and ideally we don't want to hinder them submitting routines in the old format for the time being. Would you have an estimate for how long introducing the additional download option might take? For the proper YODA 2.0.0 release, we were thinking that we should probably wait for HepData to support the new format, as otherwise there isn't too much motivation for people to make the switch. |
Added a few more unit tests to cover the pattern (un-)matching as well as adding an example that includes a 0-dimensional objects (i.e. one that has a dependent axis but no independent axis, such as an integrated cross-sections for example). Should hopefully increase the coverage to an acceptable level. |
* Add "html_title" to avoid warning when building epub format: WARNING: conf value "epub_title" (or "html_title") should not be empty for EPUB3 * Explain how to build docs locally using Sphinx in installation.rst. * Add "SPHINXOPTS = -W --keep-going" to Makefile for Sphinx builds. *
New YODA writers to work with upcoming YODA2 release.
A new
YODA1
writer is introduced, intended as a temporary legacy writer that will convert data tables to YODA1-style scatters (but without the error breakdown).The current
YODA
writer is modified to convert the data tables to the newBinnedEstimateND
objects. Discrete integer and string axes are now also supported.