Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make std::filesystem::path conversion to/from UTF-8 encoded string explicit #4631

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

risa2000
Copy link

This PR makes the default conversions of std::filesystem::path from/to JSON UTF-8 encoded string explicit.
This is a follow up on #4271.

Original implementation relied on the platform locale to be UTF-8 and used std::string as std::filesystem::path conversion source/target. This did not work on Windows (where the implicit locale is using "multibyte" encoding) and may potentially trigger a similar issue on platforms which do not have the locale set to UTF-8.

Proposed change uses explicit conversion functions from std library to UTF-8 encoded string. Unfortunately, the UTF-8 handling of std library changed from C++17 to C++20 so the code uses different codepaths for corresponding C++ standards.

This code has been inspired by:
https://github.com/alf-p-steinbach/C---how-to---make-non-English-text-work-in-Windows/blob/main/microlibs/cppm/stdlib_workarounds/fs_path.hpp#L19

@coveralls
Copy link

coveralls commented Jan 26, 2025

Coverage Status

coverage: 99.186%. remained the same
when pulling 5f12f8c on risa2000:patch-fspath-conversions
into 606b634 on nlohmann:develop.

@risa2000 risa2000 force-pushed the patch-fspath-conversions branch from 61a26dd to a395539 Compare January 26, 2025 19:41
@github-actions github-actions bot added L and removed M labels Jan 26, 2025
@risa2000 risa2000 force-pushed the patch-fspath-conversions branch from a395539 to 008c50b Compare January 27, 2025 06:16
@risa2000
Copy link
Author

I am bit confused about how make amalgamate works. It seems to arbitrary change formatting on some files, even when I do only a small change (e.g. unit-conversions.cpp) and sometimes it looks like those changes are different each time I run it?

I could not run it on Windows (seems like Makefile is not compatible with nmake) so I run it on macOS.

@nlohmann
Copy link
Owner

I am bit confused about how make amalgamate works. It seems to arbitrary change formatting on some files, even when I do only a small change (e.g. unit-conversions.cpp) and sometimes it looks like those changes are different each time I run it?

asytle and the amalgamation is deterministic. There should not be different results after subsequent calls.

I could not run it on Windows (seems like Makefile is not compatible with nmake) so I run it on macOS.

It's based on Python. The calls the Makefile makes are:

python3 -mvenv tools/astyle/venv
tools/astyle/venv/bin/pip3 install --quiet --upgrade pip
tools/astyle/venv/bin/pip3 install --quiet -r tools/astyle/requirements.txt
tools/amalgamate/amalgamate.py -c tools/amalgamate/config_json_fwd.json -s . --verbose=yes
tools/astyle/venv/bin/astyle --project=tools/astyle/.astylerc include/nlohmann/adl_serializer.hpp include/nlohmann/byte_container_with_subtype.hpp include/nlohmann/detail/abi_macros.hpp include/nlohmann/detail/conversions/from_json.hpp include/nlohmann/detail/conversions/to_chars.hpp include/nlohmann/detail/conversions/to_json.hpp include/nlohmann/detail/exceptions.hpp include/nlohmann/detail/hash.hpp include/nlohmann/detail/input/binary_reader.hpp include/nlohmann/detail/input/input_adapters.hpp include/nlohmann/detail/input/json_sax.hpp include/nlohmann/detail/input/lexer.hpp include/nlohmann/detail/input/parser.hpp include/nlohmann/detail/input/position_t.hpp include/nlohmann/detail/iterators/internal_iterator.hpp include/nlohmann/detail/iterators/iter_impl.hpp include/nlohmann/detail/iterators/iteration_proxy.hpp include/nlohmann/detail/iterators/iterator_traits.hpp include/nlohmann/detail/iterators/json_reverse_iterator.hpp include/nlohmann/detail/iterators/primitive_iterator.hpp include/nlohmann/detail/json_custom_base_class.hpp include/nlohmann/detail/json_pointer.hpp include/nlohmann/detail/json_ref.hpp include/nlohmann/detail/macro_scope.hpp include/nlohmann/detail/macro_unscope.hpp include/nlohmann/detail/meta/call_std/begin.hpp include/nlohmann/detail/meta/call_std/end.hpp include/nlohmann/detail/meta/cpp_future.hpp include/nlohmann/detail/meta/detected.hpp include/nlohmann/detail/meta/identity_tag.hpp include/nlohmann/detail/meta/is_sax.hpp include/nlohmann/detail/meta/std_fs.hpp include/nlohmann/detail/meta/type_traits.hpp include/nlohmann/detail/meta/void_t.hpp include/nlohmann/detail/output/binary_writer.hpp include/nlohmann/detail/output/output_adapters.hpp include/nlohmann/detail/output/serializer.hpp include/nlohmann/detail/string_concat.hpp include/nlohmann/detail/string_escape.hpp include/nlohmann/detail/string_utils.hpp include/nlohmann/detail/value_t.hpp include/nlohmann/json.hpp include/nlohmann/json_fwd.hpp include/nlohmann/ordered_map.hpp include/nlohmann/thirdparty/hedley/hedley.hpp include/nlohmann/thirdparty/hedley/hedley_undef.hpp tests/abi/config/config.hpp tests/abi/config/custom.cpp tests/abi/config/default.cpp tests/abi/config/noversion.cpp tests/abi/diag/diag.cpp tests/abi/diag/diag.hpp tests/abi/diag/diag_off.cpp tests/abi/diag/diag_on.cpp tests/abi/inline_ns/use_current.cpp tests/abi/inline_ns/use_v3_10_5.cpp tests/abi/main.cpp tests/benchmarks/src/benchmarks.cpp tests/cmake_add_subdirectory/project/main.cpp tests/cmake_fetch_content/project/main.cpp tests/cmake_fetch_content2/project/main.cpp tests/cmake_import/project/main.cpp tests/cmake_import_minver/project/main.cpp tests/cmake_target_include_directories/project/Bar.cpp tests/cmake_target_include_directories/project/Bar.hpp tests/cmake_target_include_directories/project/Foo.cpp tests/cmake_target_include_directories/project/Foo.hpp tests/cmake_target_include_directories/project/main.cpp tests/cuda_example/json_cuda.cu tests/src/fuzzer-driver_afl.cpp tests/src/fuzzer-parse_bjdata.cpp tests/src/fuzzer-parse_bson.cpp tests/src/fuzzer-parse_cbor.cpp tests/src/fuzzer-parse_json.cpp tests/src/fuzzer-parse_msgpack.cpp tests/src/fuzzer-parse_ubjson.cpp tests/src/make_test_data_available.hpp tests/src/test_utils.hpp tests/src/unit-32bit.cpp tests/src/unit-algorithms.cpp tests/src/unit-allocator.cpp tests/src/unit-alt-string.cpp tests/src/unit-assert_macro.cpp tests/src/unit-binary_formats.cpp tests/src/unit-bjdata.cpp tests/src/unit-bson.cpp tests/src/unit-byte_container_with_subtype.cpp tests/src/unit-capacity.cpp tests/src/unit-cbor.cpp tests/src/unit-class_const_iterator.cpp tests/src/unit-class_iterator.cpp tests/src/unit-class_lexer.cpp tests/src/unit-class_parser.cpp tests/src/unit-class_parser_diagnostic_positions.cpp tests/src/unit-comparison.cpp tests/src/unit-concepts.cpp tests/src/unit-constructor1.cpp tests/src/unit-constructor2.cpp tests/src/unit-convenience.cpp tests/src/unit-conversions.cpp tests/src/unit-custom-base-class.cpp tests/src/unit-deserialization.cpp tests/src/unit-diagnostic-positions-only.cpp tests/src/unit-diagnostic-positions.cpp tests/src/unit-diagnostics.cpp tests/src/unit-disabled_exceptions.cpp tests/src/unit-element_access1.cpp tests/src/unit-element_access2.cpp tests/src/unit-hash.cpp tests/src/unit-inspection.cpp tests/src/unit-items.cpp tests/src/unit-iterators1.cpp tests/src/unit-iterators2.cpp tests/src/unit-iterators3.cpp tests/src/unit-json_patch.cpp tests/src/unit-json_pointer.cpp tests/src/unit-large_json.cpp tests/src/unit-locale-cpp.cpp tests/src/unit-merge_patch.cpp tests/src/unit-meta.cpp tests/src/unit-modifiers.cpp tests/src/unit-msgpack.cpp tests/src/unit-no-mem-leak-on-adl-serialize.cpp tests/src/unit-noexcept.cpp tests/src/unit-ordered_json.cpp tests/src/unit-ordered_map.cpp tests/src/unit-pointer_access.cpp tests/src/unit-readme.cpp tests/src/unit-reference_access.cpp tests/src/unit-regression1.cpp tests/src/unit-regression2.cpp tests/src/unit-serialization.cpp tests/src/unit-testsuites.cpp tests/src/unit-to_chars.cpp tests/src/unit-type_traits.cpp tests/src/unit-ubjson.cpp tests/src/unit-udl.cpp tests/src/unit-udt.cpp tests/src/unit-udt_macro.cpp tests/src/unit-unicode1.cpp tests/src/unit-unicode2.cpp tests/src/unit-unicode3.cpp tests/src/unit-unicode4.cpp tests/src/unit-unicode5.cpp tests/src/unit-user_defined_input.cpp tests/src/unit-windows_h.cpp tests/src/unit-wstring.cpp tests/src/unit.cpp single_include/nlohmann/json.hpp single_include/nlohmann/json_fwd.hpp docs/mkdocs/docs/examples/*.cpp

@risa2000
Copy link
Author

After I pushed my PR, I noticed here when checking the diff that in my PR the amalgamation changed unit-conversions.cpp format in a way it obscured the change. So I run make amalgamate again and it seemed to change the formatting again, but still making changes on many more places. So I am giving up - I do not know how to keep the original format while running make amalgamate. Probably formatting changed meanwhile.

@risa2000
Copy link
Author

Just for the record, I do not know how to fix the remaining failing tests. It looks like the problems are not related to the changes I made, plus I cannot reproduce any of them on my setup (Windows & macOS).

@nlohmann
Copy link
Owner

@risa2000 risa2000 force-pushed the patch-fspath-conversions branch from 008c50b to 144f340 Compare January 27, 2025 22:47
@github-actions github-actions bot added M and removed L labels Jan 27, 2025
@risa2000
Copy link
Author

I believe I figured out why I had those unpredictable changes in unit-conversions.cpp. It was not amalgamate but VSCode, which took the liberty to reformat the file on save! amalgamate just then only run on already reformatted text. I pushed a reverted version of unit-conversions.cpp to see if it fixes the clang-tidy errors.
For the failing tests I have an idea, but I will need to push some experimental code to see the effects. While the test throws at testing regression for #3070 I suspect it is not a real reason.

@nlohmann
Copy link
Owner

I believe I figured out why I had those unpredictable changes in unit-conversions.cpp. It was not amalgamate but VSCode, which took the liberty to reformat the file on save! amalgamate just then only run on already reformatted text.

😄 Good to know the reason for that behavior!

I pushed a reverted version of unit-conversions.cpp to see if it fixes the clang-tidy errors. For the failing tests I have an idea, but I will need to push some experimental code to see the effects. While the test throws at testing regression for #3070 I suspect it is not a real reason.

Great!

@risa2000 risa2000 force-pushed the patch-fspath-conversions branch from fa03950 to 18ed74e Compare January 28, 2025 23:40
@risa2000 risa2000 force-pushed the patch-fspath-conversions branch from 18ed74e to 7974976 Compare January 28, 2025 23:41
@risa2000
Copy link
Author

risa2000 commented Jan 29, 2025

I figured out the problem with my original patch, but it will need somewhat deeper analysis, so bear with me.

The failing tests for the original patch were all related to unit-regression.cpp in C++20 incarnation. What happened I will demonstrate on the test 25_ci_cmake_options (ci_test_noglobaludls), which runs on Ubuntu with GNU 9.4.0 compiler and fails running test-regression2_cpp20.

Since gcc-9.4.0 does not support c++20 directly (https://gcc.gnu.org/onlinedocs/gcc-9.4.0/cpp/Standard-Predefined-Macros.html#Standard-Predefined-Macros), I guess cmake, when instructed to compile C++20 variant instructed the compiler with -std=gnu++2a as the best guess.
Compiling with this option defines __cplusplus=201709L (https://godbolt.org/z/5fTcazYrW)
This value is not an official std value (201703L defines c++17 and 202002L defines c++20), but according to gcc-9.4.0 mentioned earlier defines experimental features of the future standard.

The problem is/was that __cplusplus=201709L gets decoded by macro_scope.hpp to JSON_HAS_CPP_17, but not required JSON_HAS_CPP_20, which was intended by the test. But using -std=gnu++2a enabled the "experimental" future features, including new UTF-8 handling in std::filesystem::path (released in C++20). This led to discrepancy between what compiler enforced and what the library enforced.

In particular the use of function std::filesystem::path::u8string() which returns std::string in c++17 and std::u8string in c++20 led to compiling the corresponding to_json function using c++17 code path, using std::filesystem::path::u8string() to initialize BasicJsonType which failed, because the library decoded the returned std::u8string as an "array".

In order to mitigate those corner cases, I changed the logic in macro_scope.hpp in a way it is more inline with intended behavior. The original logic was using "conservative" approach - it is not C++X, until __cplusplus confirms it. I.e.

#if (defined(__cplusplus) && __cplusplus >= 202002L) || (defined(_MSVC_LANG) && _MSVC_LANG >= 202002L)
    #define JSON_HAS_CPP_20
...

to: it is C++X, once __cplusplus is "over" C++(X-1), i.e.

#if (defined(__cplusplus) && __cplusplus > 201703L) || (defined(_MSVC_LANG) && _MSVC_LANG > 201703L)
     #define JSON_HAS_CPP_20
...

This fixed all remaining failing tests related to test-regression2_cpp20, but broke some other tests.

Now, the first response might be - probably this is not a right approach - and we should go back to the original logic. But I would argue that the original logic was wrong because it led to misleading tests.

In the original logic test-regression2_cpp20 was intentionally compiled with c++20 features forced by the compiler, but the library was fully in c++17 mode because JSON_HAS_CPP_20 was not defined. So it basically run c++17 test again.

The other option about how to go around it is fixing now broken tests since they are probably a victim of the same limit conditions but now from the other side. I.e. they are now asked to run the actual features, which the "experimental" implementation might have not (yet) implemented.

The philosophical question is what is the point of testing those incomplete C++ implementations anyway. In other words what should be the expected result from running c++20 version of the test on gcc-9.4.0, which does not officially support c++20, but may support it partially (or fully) as experimental? It may succeed, or fail, but it should be probably fine either way.

It seems that until now these tests, which were borderline two C++ standard, always succeeded, because they were in fact the same tests as the ones for the previous C++ standard.

@nlohmann Any thought on this?

@gregmarr
Copy link
Contributor

https://github.com/nlohmann/json/blob/develop/tests/CMakeLists.txt#L38-L40
https://github.com/nlohmann/json/blob/develop/tests/CMakeLists.txt#L78-L83

The cmake tests currently force gcc 8 to not be c++20. That could also be done for gcc 9. Although I'm not sure that fixing this just for the tests is a good idea.

@risa2000
Copy link
Author

https://github.com/nlohmann/json/blob/develop/tests/CMakeLists.txt#L38-L40 https://github.com/nlohmann/json/blob/develop/tests/CMakeLists.txt#L78-L83

The cmake tests currently force gcc 8 to not be c++20. That could also be done for gcc 9. Although I'm not sure that fixing this just for the tests is a good idea.

gcc 9 is no longer a problem, it seems that transitioning from c++17 to "experimental" c++20 worked well with the new CPP STD detection logic. My guess would be that at that time the compiler writers got better at that.

Now the failing tests are e.g. g++-5 when trying to compile test-conversions_cpp17 (with -std=gnu++1z, probably as the best shot at forcing c++17 on this compiler) and failing at finding #include <optional>. So this now works the other way, cmake and the compiler doing their best at compiling for c++17, it just happens that <optional> did not make it into this compiler version/c++ standard experimental support.

Similar the other failure with clang-3.4 trying to target "experimental" c++14 with -std=gnu++1y and failing at finding std::enable_if_t, which only get introduced in c++14 but apparently not in this compiler.

Quite a few tests got cancelled because of these two failing, so it is difficult to estimate what is the real impact. I still prefer these tests to fail, because they fail rightfully as the compiler/toolchain do not support the features these tests are trying to test.

@risa2000
Copy link
Author

What might be a solution to this is using cmake detection for which c++ standard a particular compiler supports through CMAKE_CXX_KNOWN_FEATURES and build the tests only for those which are fully supported - i.e. no experimental/partial support.

cmake claims this should be working for some ancient compilers. I am not sure that all the tests use cmake though.
The scary part, on the other hand, is this:

The following meta features indicate general support for the associated language standard. It reflects the language support claimed by the compiler, but it does not necessarily imply complete conformance to that standard.

@risa2000
Copy link
Author

risa2000 commented Feb 1, 2025

@nlohmann @gregmarr
Could you run CI tests in a way they won't get cancelled when one in the row fails?
To get better visibility on how big the impact of the change is.

@nlohmann
Copy link
Owner

nlohmann commented Feb 1, 2025

Sorry, but this in not possible. :(

@risa2000
Copy link
Author

risa2000 commented Feb 1, 2025

@nlohmann Ok, I will try to tackle it one by one then (or batch by batch), using the approach pointed out by @gregmarr earlier. It seems that the relatively "modern" compilers are fine, so it might not be that bad.
The strategy would be to drop the tests for particular C++ standard for compilers which do not have required support.

It would help if I could schedule only particular test for running, but I guess this might not be possible either.

@nlohmann
Copy link
Owner

nlohmann commented Feb 1, 2025

You could remove the compilers you know to work from the matrix.

@github-actions github-actions bot added L CI and removed M labels Feb 1, 2025
@risa2000 risa2000 force-pushed the patch-fspath-conversions branch 2 times, most recently from 28fe853 to 7974976 Compare February 1, 2025 13:14
@github-actions github-actions bot added M and removed L CI labels Feb 1, 2025
@risa2000
Copy link
Author

risa2000 commented Feb 1, 2025

I could not change the workflow for ubuntu.yml in the commit (which makes sense for security reasons), so I reverted it back. I figured out how to run it on my fork.

@github-actions github-actions bot added the CMake label Feb 1, 2025
@risa2000
Copy link
Author

risa2000 commented Feb 1, 2025

The last commit drops tests which failed because the particular compiler did not support all the features of requested C++ standard:

  • Dropped c++17 tests for g++-5, g++-6 which do not have <optional>.
  • Dropped c++14 test for clang-3.4 because it does not have std::enable_if_t.

In total 3 tests were dropped.

@risa2000 risa2000 force-pushed the patch-fspath-conversions branch from 5f46f56 to 5f12f8c Compare February 1, 2025 20:00
@github-actions github-actions bot added CI and removed M labels Feb 1, 2025
@risa2000
Copy link
Author

risa2000 commented Feb 1, 2025

Running VS 2017 compiler with /std:c++latest by appveyor.yml caused the tests being compiled with c++20 support, except the compiler did not support char8_t.
This commit sets VS 2017 compiler options to /std:c++17.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants