Change: Use simdjson's ondemand API #5

spnda · 2022-10-16T14:51:27Z

In 0fe58d7 we switched from on-demand -> DOM API as it was slightly faster. This switches back to the on-demand API while using a different approach to parsing to hopefully increase speeds over the current code using the DOM API.

Design

The first iteration of the new parsing interface also helped with the DOM API, which is why that was merged separately already. However, this patch also tries to port that idea of iterating over all fields linearly in every function. Because currently, we only iterate over fields for the root of the JSON document and use hashed strings to speed up the switch. For every other function the order of fields may still be in the exact opposite (worse case) as how we read from them, which is why I want to use this idea everywhere in fastgltf.

Performance

Performance has gotten worse which is why this is a draft PR and I will look into this drawback later on. From a quick profiling run it seems like the ondemand::parser::iterate function is taking a considerable amount of time while the rest executes within at most 500microseconds.

lemire · 2023-03-14T01:51:26Z

It is unsafe to assume that the builtin is fallback. On my laptop, it is arm64. It should not be necessary.
The count_elements() that the code keeps computing are probably harming.
Try compiling simdjson with SIMDJSON_VERBOSE_LOGGING and inspect the resulting log to see whether it looks sensible.
Doing some profiling to see whether you have specific bottlenecks could be great (e.g., perf record/perf report).

spnda · 2023-03-17T18:47:13Z

It is unsafe to assume that the builtin is fallback. On my laptop, it is arm64. It should not be necessary.

The count_elements() that the code keeps computing are probably harming.

Try compiling simdjson with SIMDJSON_VERBOSE_LOGGING and inspect the resulting log to see whether it looks sensible.

Doing some profiling to see whether you have specific bottlenecks could be great (e.g., perf record/perf report).

I've switched to include the simdjson header in fastgltf_parser.hpp. However, I really don't want to expose any headers that I don't actually need to expose. In this case, I use unique_ptr for anything from simdjson to essentially use PIMPL, which has worked amazingly so far. And as simdjson::ondemand is an alias for the builtin implementation, I can't just set it to anything as you've said. Is there any recommended way of forward declaring the stuff?
I've removed those, and by looking at some data from SIMDJSON_VERBOSE_LOGGING I noticed that it was fully reading the whole file searching for a field I was using to "verify" some basics of the file (asset info, extension info). I removed that part and I managed to reduce the runtime (for one specific glTF file) from 5.2ms average to 4.6ms average, which is a lot quicker but actually still slower than the 4.2ms with the DOM implementation.
See above. I saw a gazillion lines of "skip" before it even started reading anything.
I'm on my Windows machine right now, so I can't give much info. I did use Tracy once to analyze a few things in simdjson and found that stage1 was sometimes taking up significantly more than half the time (which seemed a bit suspect). But I will re-test on my Mac in a few days at most, so that I can get back to the simdjson issue with a bit more information on what takes the longest and is the biggest issue in my code.

Also sorry for the late response, I had a lot of stuff to do for school. I really appreciate your help.

lemire · 2023-03-21T17:50:28Z

Thanks.

The functions that have to do with simdjson don’t have to be declared in your header file, do they?

Or, at least, they don’t have to be in a public header file?

lemire · 2023-03-21T20:31:54Z

I'm on my Windows machine right now, so I can't give much info.

If you are compiling under Windows, I recommend using Clang if possible. See

https://lemire.me/blog/2023/03/03/float-parsing-benchmark-regular-visual-studio-clangcl-and-linux-gcc/

spnda · 2023-03-22T17:25:38Z

If you are compiling under Windows, I recommend using Clang if possible. See

https://lemire.me/blog/2023/03/03/float-parsing-benchmark-regular-visual-studio-clangcl-and-linux-gcc/

Thanks for the read. I've gone ahead and I've just run my benchmarks using Clang-CL 16 (I used clang-cl from the LLVM downloads). It has actually increased the runtime speed of the fastgltf cases by often fourfold with both master and this branch. Though this PR's branch is still lacking behind consistently.

lemire · 2023-03-22T23:48:15Z

It has actually increased the runtime speed of the fastgltf cases by often fourfold with both master and this branch.

I see something like a factor of two... but yeah, Visual Studio has disappointing performance in some cases I care about.

spnda · 2024-02-02T14:39:04Z

@lemire I've revisited this PR because I want to add support for #41, which I think I want to do using raw JSON tokens, as provided by the on demand API.

Using verbose logging I already managed to restructure my code to have no skips at all during the entire parsing process. I've looked through some things using Xcode Instruments, and have found some issues. I fixed some of the issues with allocations and vector resizing where it wasn't needed, but this one stumps me (there's more than just the preview shown, and ignore the return at the top of the function I accidentally committed that):

fastgltf/src/fastgltf.cpp

Lines 1330 to 1400 in fc5b3c2

    
                // Function for parsing the min and max arrays for accessors 
        
                auto parseMinMax = [&](simdjson::ondemand::array& array, decltype(Accessor::max)& ref) -> fastgltf::Error { 
        
           return Error::None; 
        
                    decltype(Accessor::max) variant; 
        
                    using double_vec = std::variant_alternative_t<1, decltype(Accessor::max)>; 
        
                    using int64_vec = std::variant_alternative_t<2, decltype(Accessor::max)>; 
        
                    // It's possible the min/max fields come before the accessor type is declared, in which case we don't know the size. 
        
           // This single line is actually incredibly important as this function without it takes up roughly 15% of the entire 
        
           // parsing process on average due to vector resizing. 
        
                    auto initialCount = accessor.type == AccessorType::Invalid ? 0 : getNumComponents(accessor.type); 
        
                    if (accessor.componentType == ComponentType::Float || accessor.componentType == ComponentType::Double) { 
        
                        variant = FASTGLTF_CONSTRUCT_PMR_RESOURCE(double_vec, resourceAllocator.get(), 0); 
        
                    } else { 
        
                        variant = FASTGLTF_CONSTRUCT_PMR_RESOURCE(int64_vec, resourceAllocator.get(), 0); 
        
                    } 
        
                    for (auto element : array) { 
        
                        ondemand::number num; 
        
                        if (auto error = element.get_number().get(num); error != SUCCESS) { 
        
                            return error == INCORRECT_TYPE ? Error::InvalidGltf : Error::InvalidJson; 
        
                        } 
        
                        switch (num.get_number_type()) { 
        
                            case ondemand::number_type::floating_point_number: { 
        
                                // We can't safely promote double to ints. Therefore, if the element is a double, 
        
                                // but our component type is not a floating point, that's invalid. 
        
                                if (accessor.componentType != ComponentType::Float && accessor.componentType != ComponentType::Double) { 
        
                                    return Error::InvalidGltf; 
        
                                } 
        
                                if (!std::holds_alternative<double_vec>(variant)) { 
        
                                    return Error::InvalidGltf; 
        
                                } 
        
                                std::get<double_vec>(variant).emplace_back(num.get_double()); 
        
                                break; 
        
                            } 
        
                            case ondemand::number_type::signed_integer: { 
        
                                std::int64_t value = num.get_int64(); 
        
                                if (auto* doubles = std::get_if<double_vec>(&variant); doubles != nullptr) { 
        
                                    (*doubles).emplace_back(static_cast<double>(value)); 
        
                                } else if (auto* ints = std::get_if<int64_vec>(&variant); ints != nullptr) { 
        
                                    (*ints).emplace_back(static_cast<std::int64_t>(value)); 
        
                                } else { 
        
                                    return Error::InvalidGltf; 
        
                                } 
        
                                break; 
        
                            } 
        
                            case ondemand::number_type::unsigned_integer: { 
        
                                // Note that the glTF spec doesn't care about any integer larger than 32-bits, so 
        
                                // truncating uint64 to int64 wouldn't make any difference, as those large values 
        
                                // aren't allowed anyway. 
        
                                std::uint64_t value = num.get_uint64(); 
        
                                if (auto* doubles = std::get_if<double_vec>(&variant); doubles != nullptr) { 
        
                                    (*doubles).emplace_back(static_cast<double>(value)); 
        
                                } else if (auto* ints = std::get_if<int64_vec>(&variant); ints != nullptr) { 
        
                                    (*ints).emplace_back(static_cast<std::int64_t>(value)); 
        
                                } else { 
        
                                    return Error::InvalidGltf; 
        
                                } 
        
                                break; 
        
                            } 
        
                        } 
        
                    } 
        
                    ref = std::move(variant); 
        
                    return Error::None; 
        
                };

This lambda function takes up 36% of the entire average runtime of the parseAccessors function. It essentially just reads a number from an array, and then checks if it's a double, float, or integer and then converts them appropriately. Is there some better way to do this with simdjson, so that its runtime is not so excessive? Do you have any suggestions on improving that function?

lemire · 2024-02-02T14:42:17Z

@spnda Having a look.

src/fastgltf.cpp

lemire · 2024-02-02T16:10:18Z

src/fastgltf.cpp

+                        break;
+                    }
+                    case ondemand::number_type::unsigned_integer: {
+                        // Note that the glTF spec doesn't care about any integer larger than 32-bits, so


It should only trigger unsigned_integer if the integer is larger or equal to 2**63, so maybe this case is in error for you, or needs to be treated as a double?

The data included in the arrays can at most be 2**32 (or 32-bit floats), so that that case is just never reached probably. If it's only ever taken with values larger or equal to 2**63 I guess I would be safe to just remove it. Though for correctness, would it really be bad to keep it? Or are you saying that you think I should always error on this case because such large numbers shouldn't exist?

would it really be bad to keep it?

I recommend that you simplify your code prior to optimizing it. It is much easier to optimize simpler code.

I tried dumping the assembly from this function and I got several pages of instructions. It is just not possible to understand that much code at once. If you can trim it down to something simpler, then it might be much easier to understand where the performance challenges are.

src/fastgltf.cpp

lemire · 2024-02-02T16:40:22Z

src/fastgltf.cpp


+			for (auto element : array) {


I am not sure I understand this code, but I would expect that you could do...

if(accessor.componentType == ComponentType::Double) { auto v = std::get<double_vec>(variant); for (auto element : array) { // error handing as needed v.push_back(element.get_double()); } }

And so forth. The number type is a fallback for complicated scenarios, where you need to branch according to the number type, but if you know that a double is good enough in some instances, then just grab that.

Okay, so a few things first:

I should actually be down casting to single precision in all cases and never use double directly:

For floating-point components, JSON-stored minimum and maximum values represent single precision floats and SHOULD be rounded to single precision before usage to avoid any potential boundary mismatches.

JSON does not make the distinction between integers and floating point numbers, as you know. Therefore, based on accessor.componentType, I have to cast either to float or to an integer type. The glTF spec specifically says this:

Array elements MUST be treated as having the same data type as accessor’s componentType.

From that, I understand that if the component type specifies an integer but the value is 3.5 for whatever reason in the JSON, I have to just cast that to an integer. The specification makes no further clarifications on how the values should be treated as far as I can tell.

As I don't know how simdjson parses or identifies the numbers internally, would you say it'd be safe to just have this instead to avoid the ondemand::number object?

switch (accessor.componentType) { case ComponentType::Double: v.emplace_back(element.get_double(); break; case ComponentType::Float: ... }

Oh, and, most importantly, because I am doing order-agnostic parsing there's a chance the componentType field hasn't been parsed yet when this function executes. Which makes me have to rely on the number type identification from simdjson. So the proposed code isn't a possibility, and also reveals an issue with the current implementation because I shouldn't be using anything from the accessor object while parsing.

Converting to float from double is typically done with a single instruction (e.g., cvtsd2ss). It is a high latency operation, but not overly expensive. So if you don't need to immediately check the result, it should be quite cheap.

would you say it'd be safe to just have this

I wouldn't use a switch case if I could avoid it to handle individual elements. I would have distinct code paths based on componentType. It depends on the context, with repeated calls to a switch/case can be detrimental to the performance. Not always, it depends on how the compiler decides to handle it and how precisely you wrote the code, but to be safe, assuming that a switch/case adds significant overhead because it might. It is not the switch case per se that's the problem, it is that it prevents some useful optimizations by the compiler.

Of course, trying to parse "3.5" as an integer will cause an error, but you could handle this with an exceptional case (using a cold path if needed).

So the proposed code isn't a possibility

My proposal is that you try it and benchmark. I submit to you that you need to run experiments to find the right design, even if it means temporarily having code that does not solve the right problem. You need to identify the bottleneck and that's damn difficult if you are just starting at high-level code.

I wrote a small little benchmark which tried to parse the same JSON array which contains integers and doubles (50-50 distribution). Turns out get_number is ever so slightly faster in my little experiment, even with the switch-case covering those three cases. Though the difference is so small I think it makes no difference in real world scenarios (0.1us to 0.3us), and is probably also just within margin of error. I'll try and do some other experiments and see what I can find...

lemire · 2024-02-02T16:52:32Z

This lambda function takes up 36% of the entire average runtime of the parseAccessors function. It essentially just reads a number from an array, and then checks if it's a double, float, or integer and then converts them appropriately. Is there some better way to do this with simdjson, so that its runtime is not so excessive? Do you have any suggestions on improving that function?

The whole function does a lot of complicated work.

If you turn it into something like ...

  auto v = std::get<double_vec>(variant);
  for (auto element : array) {
    // error handing as needed
    v.push_back(element.get_double());
  }

You should be pretty close to be limited by the number parsing speed. If not, then you have other overhead.

My recommendation is to try to break down your code into simpler functions. There is a lot going on.

It may help also to include a standard benchmark as part of the project. If I could run a benchmark easily, I might be able to give you more insights. Just looking at high level code buys only so much. I need to know what the code is doing in practice.

lemire · 2024-02-02T16:53:58Z

This part is important: I recommend including a standard benchmark as part of your project. If I could just do...

cmake --build build
./build/benchmark

That would tremendously useful. Not just to me but to anyone who cares about performance.

lemire · 2024-02-02T16:55:46Z

Feel free to steal some of my benchmarking code:

https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2023/11/28/benchmarks

lemire · 2024-02-02T17:52:54Z

My recommendation is to rely more on experiments and repeated high-quality benchmarks, than on profiling. Profilers mislead us all the time. They may point at a potential direction, but that's all. There are exceptions. If you do pure numerical analysis, you may find 3 lines of code where your code is spending 99% of its time, and then you rewrite them in assembly and you are done... but for a problem such as this one, experiments are very likely to be the way forward. One needs to gain insights into what is costing what.

For example, how much does the get_number() function costs as opposed to get_double()? This takes just one experiment to find it out.

spnda · 2024-02-02T18:42:55Z

This part is important: I recommend including a standard benchmark as part of your project. If I could just do...
cmake --build build
./build/benchmark
That would tremendously useful. Not just to me but to anyone who cares about performance.

I already include benchmarks (which I am also using right now locally) which use the Catch2 testing framework. I should probably add something about how to build tests and run them in the README...

Though this is how you'd build and run the benchmarks. Note that these do require external downloads of sample assets because I currently just bench an entire run over a JSON asset. I mostly just wrote them to work well for me locally and for the CI.

python fetch_test_deps.py
cmake -B build -DFASTGLTF_ENABLE_TESTS=ON
cmake --build build --target fastgltf_tests
build/tests/fastgltf_tests "[gltf-benchmark]"

(note that the quotation marks in the last line might just be a requirement of zsh which I'm using locally, so you might have to remove those).

lemire · 2024-02-02T21:41:47Z

I already include benchmarks

The parseAccessors function is a tiny component of this benchmark.

Capture d’écran, le 2024-02-02 à 16 40 02

Capture d’écran, le 2024-02-02 à 16 41 23

lemire · 2024-02-02T22:31:27Z

@spnda Here is my recommendation.

Take your current code, do all the parsing and error handling, but do not touch your data structures. Do not call reserve, emplace_back, push_back at least as far as the numbers are concerned.

Next do the reverse. Do not call get_number... iterate through the array, but do not parse the values. Just add random numbers in your data structures (say random numbers between 0 and 1).

And then compare with the current code.

Lay out the three results. This should tell you a lot about where the time goes.

My generally point is that you need to run experiments, and collect data.

spnda · 2024-02-02T23:51:19Z

@spnda Here is my recommandation.

Take your current code, do all the parsing and error handling, but do not touch your data structures. Do not call reserve, emplace_back, push_back at least as far as the numbers are concerned.

Next do the reverse. Do not call get_number... iterate through the array, but do not parse the values. Just add random numbers in your data structures (say random numbers between 0 and 1).

And then compare with the current code.

Lay out the three results. This should tell you a lot about where the time goes.

My generally point is that you need to run experiments, and collect data.

Well, big thanks for taking the time to look over my project. I'll do some testing along the lines of what you suggested. I'm not knowledgeable about these low-level optimizations, so much of this is still new to me. Using profiling I managed to find a few issues in my code apart from the parseMinMax lambda, which has already had a big impact on the ultimate speed. The original question, of why DOM was faster, is mostly solved actually, or at least on my arm64 M3. I still expect the ondemand code to be quicker than it is currently, but any improvement is welcome right now. And on my 5800X I still have a measurable performance regression in all cases. I guess I'll just have to do more investigation on both chips at the same time to see what differences there are to cover both. Another user had also suggested on Discord that the slowdown might be cache/memory related, which I have yet to test/confirm.

lemire · 2024-02-03T16:51:10Z

@spnda The reason I raise these questions is that software performance is complex and requires hard numbers and experimentation.

For example, just by tracing the code without too much attention, I noticed that you are often enough storing a single integer in a std::vector which is itself a std::variant. This is fine if that's what you need, but constructing these data structures is expensive. And parsing routine themselves may not carry much weight.

But it is impossible to tell without experimentations and profiling won't tell you.

spnda force-pushed the main branch 2 times, most recently from e8089ca to 5117fa2 Compare October 31, 2022 21:35

spnda force-pushed the use_ondemand branch from 6409d34 to c3809b9 Compare December 24, 2022 22:07

spnda changed the title ~~Change: New parse API and use of simdjson's ondemand API~~ Change: Use simdjson's ondemand API Dec 25, 2022

spnda force-pushed the main branch from 21e8f58 to 7ea6770 Compare December 28, 2022 02:38

spnda force-pushed the use_ondemand branch from 2791095 to 182d902 Compare March 17, 2023 18:35

spnda force-pushed the use_ondemand branch from 182d902 to 84dca14 Compare March 21, 2023 15:01

spnda force-pushed the use_ondemand branch 2 times, most recently from ca17e96 to fc5b3c2 Compare February 2, 2024 14:37

lemire reviewed Feb 2, 2024

View reviewed changes

src/fastgltf.cpp Outdated Show resolved Hide resolved

lemire reviewed Feb 2, 2024

View reviewed changes

src/fastgltf.cpp Outdated Show resolved Hide resolved

lemire reviewed Feb 2, 2024

View reviewed changes

src/fastgltf.cpp Outdated Show resolved Hide resolved

lemire reviewed Feb 2, 2024

View reviewed changes

spnda force-pushed the use_ondemand branch from 857a71a to 6c01ee3 Compare February 10, 2024 14:51

spnda added 4 commits February 10, 2024 23:16

Change: Switch to simdjson's ondemand

cea72a8

Some fixes and optimisations

23c5cdc

Fix minor issues

71db731

Remove operator=(Asset&&)

97a05ef

spnda force-pushed the use_ondemand branch from 6c01ee3 to 97a05ef Compare February 10, 2024 23:05

spnda added 2 commits February 11, 2024 01:58

Avoid unnecessary string re-allocation

32ffab2

Fix test failures

20fcee7

spnda mentioned this pull request Feb 21, 2024

Support the extras field #41

Open

spnda mentioned this pull request Nov 5, 2024

Add user callback for unsupported extensions #78

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change: Use simdjson's ondemand API #5

Change: Use simdjson's ondemand API #5

spnda commented Oct 16, 2022 •

edited

Loading

lemire commented Mar 14, 2023

spnda commented Mar 17, 2023 •

edited

Loading

lemire commented Mar 21, 2023

lemire commented Mar 21, 2023

spnda commented Mar 22, 2023

lemire commented Mar 22, 2023

spnda commented Feb 2, 2024

lemire commented Feb 2, 2024

lemire Feb 2, 2024

spnda Feb 2, 2024 •

edited

Loading

lemire Feb 2, 2024

lemire Feb 2, 2024 •

edited

Loading

spnda Feb 2, 2024 •

edited

Loading

spnda Feb 2, 2024

lemire Feb 2, 2024

lemire Feb 2, 2024

spnda Feb 2, 2024 •

edited

Loading

lemire commented Feb 2, 2024

lemire commented Feb 2, 2024

lemire commented Feb 2, 2024

lemire commented Feb 2, 2024

spnda commented Feb 2, 2024

lemire commented Feb 2, 2024 •

edited

Loading

lemire commented Feb 2, 2024 •

edited

Loading

spnda commented Feb 2, 2024

lemire commented Feb 3, 2024

Change: Use simdjson's ondemand API #5

Are you sure you want to change the base?

Change: Use simdjson's ondemand API #5

Conversation

spnda commented Oct 16, 2022 • edited Loading

Design

Performance

lemire commented Mar 14, 2023

spnda commented Mar 17, 2023 • edited Loading

lemire commented Mar 21, 2023

lemire commented Mar 21, 2023

spnda commented Mar 22, 2023

lemire commented Mar 22, 2023

spnda commented Feb 2, 2024

lemire commented Feb 2, 2024

lemire Feb 2, 2024

Choose a reason for hiding this comment

spnda Feb 2, 2024 • edited Loading

Choose a reason for hiding this comment

lemire Feb 2, 2024

Choose a reason for hiding this comment

lemire Feb 2, 2024 • edited Loading

Choose a reason for hiding this comment

spnda Feb 2, 2024 • edited Loading

Choose a reason for hiding this comment

spnda Feb 2, 2024

Choose a reason for hiding this comment

lemire Feb 2, 2024

Choose a reason for hiding this comment

lemire Feb 2, 2024

Choose a reason for hiding this comment

spnda Feb 2, 2024 • edited Loading

Choose a reason for hiding this comment

lemire commented Feb 2, 2024

lemire commented Feb 2, 2024

lemire commented Feb 2, 2024

lemire commented Feb 2, 2024

spnda commented Feb 2, 2024

lemire commented Feb 2, 2024 • edited Loading

lemire commented Feb 2, 2024 • edited Loading

spnda commented Feb 2, 2024

lemire commented Feb 3, 2024

spnda commented Oct 16, 2022 •

edited

Loading

spnda commented Mar 17, 2023 •

edited

Loading

spnda Feb 2, 2024 •

edited

Loading

lemire Feb 2, 2024 •

edited

Loading

spnda Feb 2, 2024 •

edited

Loading

spnda Feb 2, 2024 •

edited

Loading

lemire commented Feb 2, 2024 •

edited

Loading

lemire commented Feb 2, 2024 •

edited

Loading