Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Hybrid sorting algorithm is incorrect #208

Open
3 tasks done
ashbob999 opened this issue Jan 3, 2025 · 2 comments
Open
3 tasks done

Bug: Hybrid sorting algorithm is incorrect #208

ashbob999 opened this issue Jan 3, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@ashbob999
Copy link
Contributor

Describe the bug

The hybrid sorting algorithm in bench_sort.cpp has some issues leading to an incorrect performance improvement of 3x (#173).

Issue 1

The 4 characters are inserted into the upper 32-bits of the order variable in the wrong order. So str[3] is the most significant byte of the order, leading to incorrect initial sorting order.

// What if we take up-to 4 first characters and the index
    for (size_t i = 0; i != strings.size(); ++i)
        std::memcpy((char *)&order[i] + offset_in_word, strings[order[i]].c_str(),
                    std::min<std::size_t>(strings[order[i]].size(), 4ul));

Issue 2

Currently offset_in_word is set to 0, which means that the order array which holds the 32-bit string order indexes, gets overridden by the first 4 chars of the string, then subsequently cleared. So the final sort does nothing as all values in order are zero.

Although setting offset_in_word to 4 works for hybrid_sort_cpp it breaks hybrid_stable_sort_cpp.

Fixes

After temporarily fixing the above issues it becomes about 1.1x slower than sz_sort. In which the bottle neck seems to be the final sort.

I managed to improve this by calling std::sort on each of the partially sorted chunks instead of the all strings, which makes it ~2.25x faster than sz_sort.

Fixed version (with `offset_in_word=4`)
static idx_t hybrid_sort_cpp_v2(strings_t const &strings, sz_u64_t *order) {

    // What if we take up-to 4 first characters and the index
    for (size_t i = 0; i != strings.size(); ++i) {
        size_t index = order[i];

        for (size_t j = 0; j < std::min<std::size_t>(strings[(sz_size_t)index].size(), 4ul); ++j) {
            std::memcpy((char *)&order[i] + offset_in_word + 3 - j, strings[(sz_size_t)index].c_str() + j, 1ul);
        }
    }

    std::sort(order, order + strings.size(), [&](sz_u64_t i, sz_u64_t j) {
        char *i_bytes = (char *)&i;
        char *j_bytes = (char *)&j;
        return *(uint32_t *)(i_bytes + offset_in_word) < *(uint32_t *)(j_bytes + offset_in_word);
    });

    const auto extractBytes = [](sz_u64_t v) -> uint32_t {
        char *bytes = (char *)&v;
        return *(uint32_t *)(bytes + offset_in_word);
    };

    if (strings.size() >= 2) {
        size_t prevIndex = 0;
        uint64_t prevBytes = extractBytes(order[0]);

        for (size_t i = 1; i < strings.size(); ++i) {
            //
            uint32_t bytes = extractBytes(order[i]);
            if (bytes != prevBytes) {
                std::sort(order + prevIndex, order + i, [&](sz_u64_t i, sz_u64_t j) {
                   // Assumes: offset_in_word==4
                    sz_size_t i_index = i & 0xFFFF'FFFF;
                    sz_size_t j_index = j & 0xFFFF'FFFF;
                    return strings[i_index] < strings[j_index];
                });
                prevIndex = i;
                prevBytes = bytes;
            }
        }

        std::sort(order + prevIndex, order + strings.size(), [&](sz_u64_t i, sz_u64_t j) {
            sz_size_t i_index = i & 0xFFFF'FFFF;
            sz_size_t j_index = j & 0xFFFF'FFFF;
            return strings[i_index] < strings[j_index];
        });
    }

    for (size_t i = 0; i != strings.size(); ++i) std::memset((char *)&order[i] + offset_in_word, 0, 4ul);

    return strings.size();
}

I did not create a PR, because these fixes break the stable sort.

Steps to reproduce

See Above.

Expected behavior

See Above.

StringZilla version

3.11.3

Operating System

Windows 10

Hardware architecture

x86

Which interface are you using?

Other bindings

Contact Details

No response

Are you open to being tagged as a contributor?

  • I am open to being mentioned in the project .git history as a contributor

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct
@ashbob999 ashbob999 added the bug Something isn't working label Jan 3, 2025
@ashvardanian
Copy link
Owner

ashvardanian commented Jan 3, 2025

Amazing! I will look into ways to fix this.

@ashbob999
Copy link
Contributor Author

@ashvardanian I have found the issue, permute_base was being passed into the hybrid stable sort bench_permute call instead of permute_new so the expect_same call would most likely always fail if the input had duplicate strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants