Skip to content

Bug: Last elements in basic_charset initialization are discarded #200

@alexbarev

Description

@alexbarev

Describe the bug

When initializing a basic_charset with character arrays such as those returned by digits() and others, the last element of the array is incorrectly discarded.

inline carray<10> const &digits() noexcept {
static carray<10> const all = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'};
return all;
}

This is due to the use of count_characters - 1 in the loop that populates the basic_charset bitset.

template <std::size_t count_characters>
explicit basic_charset(char_type const (&chars)[count_characters]) noexcept : basic_charset() {
static_assert(count_characters > 0, "Character array cannot be empty");
for (std::size_t i = 0; i < count_characters - 1; ++i) { // count_characters - 1 to exclude the null terminator
char_type c = chars[i];
bitset_._u64s[sz_bitcast(sz_u8_t, c) >> 6] |= (1ull << (sz_bitcast(sz_u8_t, c) & 63u));
}

While this prevents including a null terminator for string literals like sz::char_set("x"), it causes incorrect behavior when handling character arrays that do not have a null terminator, resulting in the exclusion of the final character.

I believe this local block is the full extent of the affected code.

inline char_set ascii_letters_set() { return char_set {ascii_letters()}; }
inline char_set ascii_lowercase_set() { return char_set {ascii_lowercase()}; }
inline char_set ascii_uppercase_set() { return char_set {ascii_uppercase()}; }
inline char_set ascii_printables_set() { return char_set {ascii_printables()}; }
inline char_set ascii_controls_set() { return char_set {ascii_controls()}; }
inline char_set digits_set() { return char_set {digits()}; }
inline char_set hexdigits_set() { return char_set {hexdigits()}; }
inline char_set octdigits_set() { return char_set {octdigits()}; }
inline char_set punctuation_set() { return char_set {punctuation()}; }
inline char_set whitespaces_set() { return char_set {whitespaces()}; }
inline char_set newlines_set() { return char_set {newlines()}; }
inline char_set base64_set() { return char_set {base64()}; }

Steps to reproduce

#include "stringzilla/stringzilla.hpp"

namespace sz = ashvardanian::stringzilla;

int main() {
        sz::string haystack = "239";

        // Test with null-terminated string
        assert(haystack.contains_only(sz::char_set("0123456789")));
        // Passes: null terminator is correctly discarded

        // Test with initializer list
        static std::initializer_list all = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'};
        assert (haystack.contains_only(sz::char_set {all}));
        // Passes: constructor for initializer list is called

        // Test with carray
        sz::carray<10> all_digits = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'};
        assert(haystack.contains_only(sz::char_set {all_digits}));
        assert(haystack.is_digit());
        // Fails: '9' is incorrectly discarded
}

Expected behavior

No asserts

StringZilla version

v3.11.0

Operating System

Ubuntu 22.04.5

Hardware architecture

x86

Which interface are you using?

C++ bindings

Contact Details

No response

Are you open to being tagged as a contributor?

  • I am open to being mentioned in the project .git history as a contributor

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions