Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the performance of certain GraphNode methods and avoid recalculating the AABB multiple times during rendering. #7245

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

querielo
Copy link
Contributor

@querielo querielo commented Jan 3, 2025

I experimented with placing multiple animated skinned meshes in Playcanvas and Unity 6.1, which is going to support WebGPU to address huge performance issues with skinned meshes.

I understand that it's best to bake animations for skinned meshes instead of using the Anim component when dealing with high number of animated skinned meshes. But still...

The PR proposes replacing the recursive implementation of core GraphNode methods with an iterative approach and avoid recalculating the AABB multiple times during rendering to improve performance of skinned meshes.

Here are my results:

Main branch PR
Screenshot 2025-01-03 at 11 31 08 Screenshot 2025-01-03 at 12 52 36

@Maksims
Copy link
Collaborator

Maksims commented Jan 3, 2025

Please test the memory footprint and allocation/garbage collection difference in dynamic scenes with/without this PR. Glancing into the code, this PR will introduce a lot of allocations (arrays, pop, unshift, etc) into hot code (the code that might be running multiple times on every frame).

This might lead to a minor/major GC stalls.

@querielo
Copy link
Contributor Author

querielo commented Jan 3, 2025

@Maksims It is a good point, that's why _tmpGraphNodeStack was added. It doesn's looks like there is a big difference in terms of memory

Main PR
Screenshot 2025-01-03 at 14 39 48 Screenshot 2025-01-03 at 14 40 02

@@ -98,6 +73,12 @@ function findNode(node, test) {
* a powerful set of features that are leveraged by the `Entity` class.
*/
class GraphNode extends EventHandler {
/**
* @type {GraphNode[]}
* @ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be @private instead?

@willeastcott
Copy link
Contributor

willeastcott commented Jan 3, 2025

This is very cool. I was wondering...since this is such performance critical code, would it be more efficient to store the stack pointer (index)? Then, a pop is a decrement and a push is an array 'write at index'. And to start using the stack, just set the pointer number to 0. Obviously, Array#push and Array#pop are very optimized...so I don't know if this would really make a tangible difference.

@willeastcott willeastcott added the performance Relating to load times or frame rate label Jan 3, 2025
@@ -594,11 +593,22 @@ class GraphNode extends EventHandler {
const results = [];
const test = createTest(attr, value);

this.forEach((node) => {
const stack = GraphNode._tmpGraphNodeStack;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since forEach is now optimized to execute iteratively, why duplicate the iterative loop again here?

@AlexAPPi
Copy link
Contributor

AlexAPPi commented Jan 3, 2025

This approach eats up a lot of memory, maybe we should move away from storing references to objects in arrays on indexes?

about: https://www.mattzeunert.com/2018/01/25/v8-javascript-memory-quiz.html

jsbench: https://jsbench.me/cvm5h498pk/1

export interface Node {
    get parent(): Node;
    get children(): Node[];
}

export type SKIP_DIVE_OPERATION = 1;
export type ABORT_OPERATION = 2;
export type TECWDC<T> = (child: T, index: number, depth: number) => void | SKIP_DIVE_OPERATION | ABORT_OPERATION;

export function everyChildWithDive<T extends Node>(node: T, callback: TECWDC<T>, offset?: number): void | ABORT_OPERATION;
export function everyChildWithDive(node: Node, callback: TECWDC<Node>, offset: number = 0): void | ABORT_OPERATION {

    let children = node.children;
    let length   = children.length;

    if (offset < 0) {
        offset += length;
    }

    if (length < 1 || offset < 0) {
        return;
    }

    let parent = node;
    let stackIndex = 0;

    const stack = [offset];

    while (offset < length) {

        const child  = children[offset];
        const result = callback(child, offset, stackIndex);

        if (result === 2) {
            return 2;
        }

        offset++;

        if (result !== 1) {

            const childChildren = child.children;
            const childChildrenLength = childChildren.length;

            if (childChildrenLength > 0) {

                stack.push(0);
                stack[stackIndex++] = offset;

                offset   = 0;
                parent   = child;
                children = childChildren;
                length   = childChildrenLength;

                continue;
            }
        }
        
        while (offset === length && stackIndex > 0) {
            stack.pop();
            offset   = stack[--stackIndex];
            parent   = parent.parent;
            children = parent.children;
            length   = children.length;
        }
    }
}

@querielo
Copy link
Contributor Author

querielo commented Jan 3, 2025

@AlexAPPi I'll look at your suggestion tomorrow, ok?

Regarding my solution, as noted in the comment, I'm using shift() in the forEach method to implement a breadth-first approach (it is a queue instead of a stack):

const queue = [node];
while (queue.length > 0) {
    // shift() is used for breadth-first approach; use pop() for depth-first
    const current = queue.shift();
    
    // . . .
    
    const children = current.children;
    for (let i = 0; i < children.length; i++) {
        queue.push(children[i]);
    }
}

While breadth-first traversal tends to be slower and consume more memory than depth-first, it's necessary here to pass unit tests that expect a specific order. I'd prefer to use pop() for a depth-first approach, but this would require modifying some unit tests. I'm hesitant to make such changes, as they might disrupt components that depend on the current order.

Note: It appears that the third solution from https://jsbench.me/cvm5h498pk/1 has a bug (I'm not sure about purpose of the break statement).

Screenshot 2025-01-04 at 00 04 56

I cannot run the fourth solution:

Screenshot 2025-01-04 at 00 07 57

But as I can see the depth-first solution is faster than the second solution:

Screenshot 2025-01-04 at 00 10 41

Thank you for this article!

@willeastcott can I change breadth-first to depth-first? And a unit test that requires breadth-first? I'll try to use an offset instead of shift() but it probably consumes more memory or implement a fast queue.

@querielo
Copy link
Contributor Author

querielo commented Jan 4, 2025

Screenshot 2025-01-04 at 13 04 25

I'll try to implement a solution with a queue pool (breadth-first). It is 20% slower than using a pure offset, but requires less memory. For example, on my scene with 21_943 nodes a capacity of an auto extending circular queue is 4096.

Screenshot 2025-01-04 at 13 18 58

… tests for them, use the classes in the forEach method of GraphNode
@querielo
Copy link
Contributor Author

querielo commented Jan 5, 2025

Update:

  • Introduce a new Queue class that implements a circular buffer, automatically resizing its capacity as needed.
    • Added tests for Queue.
  • Add a free method to ObjectPool to allow deallocation of objects back into the pool.
    • Refactor internal _resize logic in ObjectPool to track object indices.
    • Added tests for ObjectPool.
  • Update GraphNode.forEach to use a pooled Queue for iteration, reducing memory allocations in breadth-first traversals.

Checked the method on a large-scale scene to confirm its performance and accuracy. The first run is slow (about 8ms), but other runs are fast.

Main PR
Screenshot 2025-01-05 at 14 43 31 Screenshot 2025-01-05 at 14 43 15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Relating to load times or frame rate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants