Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change: Use simdjson's ondemand API #5

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft

Change: Use simdjson's ondemand API #5

wants to merge 6 commits into from

Conversation

spnda
Copy link
Owner

@spnda spnda commented Oct 16, 2022

In 0fe58d7 we switched from on-demand -> DOM API as it was slightly faster. This switches back to the on-demand API while using a different approach to parsing to hopefully increase speeds over the current code using the DOM API.

Design

The first iteration of the new parsing interface also helped with the DOM API, which is why that was merged separately already. However, this patch also tries to port that idea of iterating over all fields linearly in every function. Because currently, we only iterate over fields for the root of the JSON document and use hashed strings to speed up the switch. For every other function the order of fields may still be in the exact opposite (worse case) as how we read from them, which is why I want to use this idea everywhere in fastgltf.

Performance

Performance has gotten worse which is why this is a draft PR and I will look into this drawback later on. From a quick profiling run it seems like the ondemand::parser::iterate function is taking a considerable amount of time while the rest executes within at most 500microseconds.

Mean vs Buggy gltf (16k lines, no embedded buffers or images) load from memory (3)

@spnda spnda force-pushed the main branch 2 times, most recently from e8089ca to 5117fa2 Compare October 31, 2022 21:35
@spnda spnda changed the title Change: New parse API and use of simdjson's ondemand API Change: Use simdjson's ondemand API Dec 25, 2022
@lemire
Copy link

lemire commented Mar 14, 2023

  1. It is unsafe to assume that the builtin is fallback. On my laptop, it is arm64. It should not be necessary.
  2. The count_elements() that the code keeps computing are probably harming.
  3. Try compiling simdjson with SIMDJSON_VERBOSE_LOGGING and inspect the resulting log to see whether it looks sensible.
  4. Doing some profiling to see whether you have specific bottlenecks could be great (e.g., perf record/perf report).

@spnda
Copy link
Owner Author

spnda commented Mar 17, 2023

  1. It is unsafe to assume that the builtin is fallback. On my laptop, it is arm64. It should not be necessary.
  2. The count_elements() that the code keeps computing are probably harming.
  3. Try compiling simdjson with SIMDJSON_VERBOSE_LOGGING and inspect the resulting log to see whether it looks sensible.
  4. Doing some profiling to see whether you have specific bottlenecks could be great (e.g., perf record/perf report).
  1. I've switched to include the simdjson header in fastgltf_parser.hpp. However, I really don't want to expose any headers that I don't actually need to expose. In this case, I use unique_ptr for anything from simdjson to essentially use PIMPL, which has worked amazingly so far. And as simdjson::ondemand is an alias for the builtin implementation, I can't just set it to anything as you've said. Is there any recommended way of forward declaring the stuff?
  2. I've removed those, and by looking at some data from SIMDJSON_VERBOSE_LOGGING I noticed that it was fully reading the whole file searching for a field I was using to "verify" some basics of the file (asset info, extension info). I removed that part and I managed to reduce the runtime (for one specific glTF file) from 5.2ms average to 4.6ms average, which is a lot quicker but actually still slower than the 4.2ms with the DOM implementation.
  3. See above. I saw a gazillion lines of "skip" before it even started reading anything.
  4. I'm on my Windows machine right now, so I can't give much info. I did use Tracy once to analyze a few things in simdjson and found that stage1 was sometimes taking up significantly more than half the time (which seemed a bit suspect). But I will re-test on my Mac in a few days at most, so that I can get back to the simdjson issue with a bit more information on what takes the longest and is the biggest issue in my code.

Also sorry for the late response, I had a lot of stuff to do for school. I really appreciate your help.

@lemire
Copy link

lemire commented Mar 21, 2023

Thanks.

The functions that have to do with simdjson don’t have to be declared in your header file, do they?

Or, at least, they don’t have to be in a public header file?

@lemire
Copy link

lemire commented Mar 21, 2023

I'm on my Windows machine right now, so I can't give much info.

If you are compiling under Windows, I recommend using Clang if possible. See

https://lemire.me/blog/2023/03/03/float-parsing-benchmark-regular-visual-studio-clangcl-and-linux-gcc/

@spnda
Copy link
Owner Author

spnda commented Mar 22, 2023

If you are compiling under Windows, I recommend using Clang if possible. See

https://lemire.me/blog/2023/03/03/float-parsing-benchmark-regular-visual-studio-clangcl-and-linux-gcc/

Thanks for the read. I've gone ahead and I've just run my benchmarks using Clang-CL 16 (I used clang-cl from the LLVM downloads). It has actually increased the runtime speed of the fastgltf cases by often fourfold with both master and this branch. Though this PR's branch is still lacking behind consistently.

image

@lemire
Copy link

lemire commented Mar 22, 2023

It has actually increased the runtime speed of the fastgltf cases by often fourfold with both master and this branch.

I see something like a factor of two... but yeah, Visual Studio has disappointing performance in some cases I care about.

@spnda spnda force-pushed the use_ondemand branch 2 times, most recently from ca17e96 to fc5b3c2 Compare February 2, 2024 14:37
@spnda
Copy link
Owner Author

spnda commented Feb 2, 2024

@lemire I've revisited this PR because I want to add support for #41, which I think I want to do using raw JSON tokens, as provided by the on demand API.

Using verbose logging I already managed to restructure my code to have no skips at all during the entire parsing process. I've looked through some things using Xcode Instruments, and have found some issues. I fixed some of the issues with allocations and vector resizing where it wasn't needed, but this one stumps me (there's more than just the preview shown, and ignore the return at the top of the function I accidentally committed that):

fastgltf/src/fastgltf.cpp

Lines 1330 to 1400 in fc5b3c2

// Function for parsing the min and max arrays for accessors
auto parseMinMax = [&](simdjson::ondemand::array& array, decltype(Accessor::max)& ref) -> fastgltf::Error {
return Error::None;
decltype(Accessor::max) variant;
using double_vec = std::variant_alternative_t<1, decltype(Accessor::max)>;
using int64_vec = std::variant_alternative_t<2, decltype(Accessor::max)>;
// It's possible the min/max fields come before the accessor type is declared, in which case we don't know the size.
// This single line is actually incredibly important as this function without it takes up roughly 15% of the entire
// parsing process on average due to vector resizing.
auto initialCount = accessor.type == AccessorType::Invalid ? 0 : getNumComponents(accessor.type);
if (accessor.componentType == ComponentType::Float || accessor.componentType == ComponentType::Double) {
variant = FASTGLTF_CONSTRUCT_PMR_RESOURCE(double_vec, resourceAllocator.get(), 0);
} else {
variant = FASTGLTF_CONSTRUCT_PMR_RESOURCE(int64_vec, resourceAllocator.get(), 0);
}
for (auto element : array) {
ondemand::number num;
if (auto error = element.get_number().get(num); error != SUCCESS) {
return error == INCORRECT_TYPE ? Error::InvalidGltf : Error::InvalidJson;
}
switch (num.get_number_type()) {
case ondemand::number_type::floating_point_number: {
// We can't safely promote double to ints. Therefore, if the element is a double,
// but our component type is not a floating point, that's invalid.
if (accessor.componentType != ComponentType::Float && accessor.componentType != ComponentType::Double) {
return Error::InvalidGltf;
}
if (!std::holds_alternative<double_vec>(variant)) {
return Error::InvalidGltf;
}
std::get<double_vec>(variant).emplace_back(num.get_double());
break;
}
case ondemand::number_type::signed_integer: {
std::int64_t value = num.get_int64();
if (auto* doubles = std::get_if<double_vec>(&variant); doubles != nullptr) {
(*doubles).emplace_back(static_cast<double>(value));
} else if (auto* ints = std::get_if<int64_vec>(&variant); ints != nullptr) {
(*ints).emplace_back(static_cast<std::int64_t>(value));
} else {
return Error::InvalidGltf;
}
break;
}
case ondemand::number_type::unsigned_integer: {
// Note that the glTF spec doesn't care about any integer larger than 32-bits, so
// truncating uint64 to int64 wouldn't make any difference, as those large values
// aren't allowed anyway.
std::uint64_t value = num.get_uint64();
if (auto* doubles = std::get_if<double_vec>(&variant); doubles != nullptr) {
(*doubles).emplace_back(static_cast<double>(value));
} else if (auto* ints = std::get_if<int64_vec>(&variant); ints != nullptr) {
(*ints).emplace_back(static_cast<std::int64_t>(value));
} else {
return Error::InvalidGltf;
}
break;
}
}
}
ref = std::move(variant);
return Error::None;
};

This lambda function takes up 36% of the entire average runtime of the parseAccessors function. It essentially just reads a number from an array, and then checks if it's a double, float, or integer and then converts them appropriately. Is there some better way to do this with simdjson, so that its runtime is not so excessive? Do you have any suggestions on improving that function?

Screenshot 2024-02-02 at 15 26 25

@lemire
Copy link

lemire commented Feb 2, 2024

@spnda Having a look.

src/fastgltf.cpp Outdated Show resolved Hide resolved
src/fastgltf.cpp Outdated
break;
}
case ondemand::number_type::unsigned_integer: {
// Note that the glTF spec doesn't care about any integer larger than 32-bits, so
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should only trigger unsigned_integer if the integer is larger or equal to 2**63, so maybe this case is in error for you, or needs to be treated as a double?

Copy link
Owner Author

@spnda spnda Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data included in the arrays can at most be 2**32 (or 32-bit floats), so that that case is just never reached probably. If it's only ever taken with values larger or equal to 2**63 I guess I would be safe to just remove it. Though for correctness, would it really be bad to keep it? Or are you saying that you think I should always error on this case because such large numbers shouldn't exist?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it really be bad to keep it?

I recommend that you simplify your code prior to optimizing it. It is much easier to optimize simpler code.

I tried dumping the assembly from this function and I got several pages of instructions. It is just not possible to understand that much code at once. If you can trim it down to something simpler, then it might be much easier to understand where the performance challenges are.

src/fastgltf.cpp Outdated Show resolved Hide resolved
src/fastgltf.cpp Outdated Show resolved Hide resolved

for (auto element : array) {
Copy link

@lemire lemire Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand this code, but I would expect that you could do...

if(accessor.componentType == ComponentType::Double) {
  auto v = std::get<double_vec>(variant);
  for (auto element : array) {
    // error handing as needed
    v.push_back(element.get_double());
  }
}

And so forth. The number type is a fallback for complicated scenarios, where you need to branch according to the number type, but if you know that a double is good enough in some instances, then just grab that.

Copy link
Owner Author

@spnda spnda Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so a few things first:

  • I should actually be down casting to single precision in all cases and never use double directly:

    For floating-point components, JSON-stored minimum and maximum values represent single precision floats and SHOULD be rounded to single precision before usage to avoid any potential boundary mismatches.

  • JSON does not make the distinction between integers and floating point numbers, as you know. Therefore, based on accessor.componentType, I have to cast either to float or to an integer type. The glTF spec specifically says this:

    Array elements MUST be treated as having the same data type as accessor’s componentType.

    From that, I understand that if the component type specifies an integer but the value is 3.5 for whatever reason in the JSON, I have to just cast that to an integer. The specification makes no further clarifications on how the values should be treated as far as I can tell.

As I don't know how simdjson parses or identifies the numbers internally, would you say it'd be safe to just have this instead to avoid the ondemand::number object?

switch (accessor.componentType) {
    case ComponentType::Double:
        v.emplace_back(element.get_double();
        break;
    case ComponentType::Float:
        ...
}

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and, most importantly, because I am doing order-agnostic parsing there's a chance the componentType field hasn't been parsed yet when this function executes. Which makes me have to rely on the number type identification from simdjson. So the proposed code isn't a possibility, and also reveals an issue with the current implementation because I shouldn't be using anything from the accessor object while parsing.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting to float from double is typically done with a single instruction (e.g., cvtsd2ss). It is a high latency operation, but not overly expensive. So if you don't need to immediately check the result, it should be quite cheap.

would you say it'd be safe to just have this

I wouldn't use a switch case if I could avoid it to handle individual elements. I would have distinct code paths based on componentType. It depends on the context, with repeated calls to a switch/case can be detrimental to the performance. Not always, it depends on how the compiler decides to handle it and how precisely you wrote the code, but to be safe, assuming that a switch/case adds significant overhead because it might. It is not the switch case per se that's the problem, it is that it prevents some useful optimizations by the compiler.

Of course, trying to parse "3.5" as an integer will cause an error, but you could handle this with an exceptional case (using a cold path if needed).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the proposed code isn't a possibility

My proposal is that you try it and benchmark. I submit to you that you need to run experiments to find the right design, even if it means temporarily having code that does not solve the right problem. You need to identify the bottleneck and that's damn difficult if you are just starting at high-level code.

Copy link
Owner Author

@spnda spnda Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a small little benchmark which tried to parse the same JSON array which contains integers and doubles (50-50 distribution). Turns out get_number is ever so slightly faster in my little experiment, even with the switch-case covering those three cases. Though the difference is so small I think it makes no difference in real world scenarios (0.1us to 0.3us), and is probably also just within margin of error. I'll try and do some other experiments and see what I can find...

@lemire
Copy link

lemire commented Feb 2, 2024

This lambda function takes up 36% of the entire average runtime of the parseAccessors function. It essentially just reads a number from an array, and then checks if it's a double, float, or integer and then converts them appropriately. Is there some better way to do this with simdjson, so that its runtime is not so excessive? Do you have any suggestions on improving that function?

The whole function does a lot of complicated work.

If you turn it into something like ...

  auto v = std::get<double_vec>(variant);
  for (auto element : array) {
    // error handing as needed
    v.push_back(element.get_double());
  }

You should be pretty close to be limited by the number parsing speed. If not, then you have other overhead.

My recommendation is to try to break down your code into simpler functions. There is a lot going on.

It may help also to include a standard benchmark as part of the project. If I could run a benchmark easily, I might be able to give you more insights. Just looking at high level code buys only so much. I need to know what the code is doing in practice.

@lemire
Copy link

lemire commented Feb 2, 2024

This part is important: I recommend including a standard benchmark as part of your project. If I could just do...

cmake --build build
./build/benchmark

That would tremendously useful. Not just to me but to anyone who cares about performance.

@lemire
Copy link

lemire commented Feb 2, 2024

Feel free to steal some of my benchmarking code:

https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2023/11/28/benchmarks

@lemire
Copy link

lemire commented Feb 2, 2024

My recommendation is to rely more on experiments and repeated high-quality benchmarks, than on profiling. Profilers mislead us all the time. They may point at a potential direction, but that's all. There are exceptions. If you do pure numerical analysis, you may find 3 lines of code where your code is spending 99% of its time, and then you rewrite them in assembly and you are done... but for a problem such as this one, experiments are very likely to be the way forward. One needs to gain insights into what is costing what.

For example, how much does the get_number() function costs as opposed to get_double()? This takes just one experiment to find it out.

@spnda
Copy link
Owner Author

spnda commented Feb 2, 2024

This part is important: I recommend including a standard benchmark as part of your project. If I could just do...

cmake --build build
./build/benchmark

That would tremendously useful. Not just to me but to anyone who cares about performance.

I already include benchmarks (which I am also using right now locally) which use the Catch2 testing framework. I should probably add something about how to build tests and run them in the README...

Though this is how you'd build and run the benchmarks. Note that these do require external downloads of sample assets because I currently just bench an entire run over a JSON asset. I mostly just wrote them to work well for me locally and for the CI.

python fetch_test_deps.py
cmake -B build -DFASTGLTF_ENABLE_TESTS=ON
cmake --build build --target fastgltf_tests
build/tests/fastgltf_tests "[gltf-benchmark]"

(note that the quotation marks in the last line might just be a requirement of zsh which I'm using locally, so you might have to remove those).

@lemire
Copy link

lemire commented Feb 2, 2024

I already include benchmarks

The parseAccessors function is a tiny component of this benchmark.

Capture d’écran, le 2024-02-02 à 16 40 02 Capture d’écran, le 2024-02-02 à 16 41 23

@lemire
Copy link

lemire commented Feb 2, 2024

@spnda Here is my recommendation.

Take your current code, do all the parsing and error handling, but do not touch your data structures. Do not call reserve, emplace_back, push_back at least as far as the numbers are concerned.

Next do the reverse. Do not call get_number... iterate through the array, but do not parse the values. Just add random numbers in your data structures (say random numbers between 0 and 1).

And then compare with the current code.

Lay out the three results. This should tell you a lot about where the time goes.

My generally point is that you need to run experiments, and collect data.

@spnda
Copy link
Owner Author

spnda commented Feb 2, 2024

@spnda Here is my recommandation.

Take your current code, do all the parsing and error handling, but do not touch your data structures. Do not call reserve, emplace_back, push_back at least as far as the numbers are concerned.

Next do the reverse. Do not call get_number... iterate through the array, but do not parse the values. Just add random numbers in your data structures (say random numbers between 0 and 1).

And then compare with the current code.

Lay out the three results. This should tell you a lot about where the time goes.

My generally point is that you need to run experiments, and collect data.

Well, big thanks for taking the time to look over my project. I'll do some testing along the lines of what you suggested. I'm not knowledgeable about these low-level optimizations, so much of this is still new to me. Using profiling I managed to find a few issues in my code apart from the parseMinMax lambda, which has already had a big impact on the ultimate speed. The original question, of why DOM was faster, is mostly solved actually, or at least on my arm64 M3. I still expect the ondemand code to be quicker than it is currently, but any improvement is welcome right now. And on my 5800X I still have a measurable performance regression in all cases. I guess I'll just have to do more investigation on both chips at the same time to see what differences there are to cover both. Another user had also suggested on Discord that the slowdown might be cache/memory related, which I have yet to test/confirm.

@lemire
Copy link

lemire commented Feb 3, 2024

@spnda The reason I raise these questions is that software performance is complex and requires hard numbers and experimentation.

For example, just by tracing the code without too much attention, I noticed that you are often enough storing a single integer in a std::vector which is itself a std::variant. This is fine if that's what you need, but constructing these data structures is expensive. And parsing routine themselves may not carry much weight.

But it is impossible to tell without experimentations and profiling won't tell you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants