Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Add null terminators to character arrays to include final characters #202

Merged

Conversation

alexbarev
Copy link
Collaborator

This PR fixes the issue where the last character of certain character arrays was not included in basic_charset. By explicitly adding a '\0' terminator to these arrays (e.g., ascii_letters()), all intended characters are now properly included.

In addition, new tests have been added for the ASCII utility functions provided by sz::string and sz::string_view. Previously, these tests would fail due to the missing final character. With this fix, they pass successfully, confirming that the issue is resolved.

#200

@alexbarev
Copy link
Collaborator Author

alexbarev commented Dec 7, 2024

@ashvardanian On my local computer (WSL Ubuntu 22.04, x86_64), I only encounter test issues with C++20. It seems unrelated to my changes.

/home/kent/code/forks/StringZilla/scripts/test.cpp: In function ‘int main(int, const char**)’:
/home/kent/code/forks/StringZilla/scripts/test.cpp:1605:17: error: ‘_sz_edit_distance_skewed_diagonals_upto63_avx512’ was not declared in this scope; did you mean ‘_sz_edit_distance_skewed_diagonals_serial’?
 1605 |     auto dist = _sz_edit_distance_skewed_diagonals_upto63_avx512("kiten", 5, "katerinas", 9, SZ_SIZE_MAX);
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                 _sz_edit_distance_skewed_diagonals_serial
compilation terminated due to -Wfatal-errors.
gmake[2]: *** [CMakeFiles/stringzilla_test_cpp20_x86_serial.dir/build.make:76: CMakeFiles/stringzilla_test_cpp20_x86_serial.dir/scripts/test.cpp.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:1127: CMakeFiles/stringzilla_test_cpp20_x86_serial.dir/all] Error 2
gmake: *** [Makefile:101: all] Error 2

@ashvardanian
Copy link
Owner

@alexbarev, is there a different way to solve this by constraining the type-conversion rules? Extending every carray with a zero element at the end isn't a great solution in my opinion, as it signals the wrong size of the character set.

@alexbarev
Copy link
Collaborator Author

@alexbarev, is there a different way to solve this by constraining the type-conversion rules? Extending every carray with a zero element at the end isn't a great solution in my opinion, as it signals the wrong size of the character set.

I agree it looks unappealing.

Apparently, the issue is that carray must be converted to std::initializer_list<int> because that is the only constructor in char_set that does not discard the last character. Some compilers support std::initializer_list<int> i(v.data(), v.data() + v.size()), but this goes against the standard, which states that initializer_list only has a default constructor.

I see nothing that can be done with the current constructor from c array that is being called now, because strings literals are also cast to a C-style string where a null terminator \0 is placed at the end. I will search if some template hack can be done.

template <std::size_t count_characters>
explicit basic_charset(char_type const (&chars)[count_characters]) noexcept : basic_charset() {
static_assert(count_characters > 0, "Character array cannot be empty");
for (std::size_t i = 0; i < count_characters - 1; ++i) { // count_characters - 1 to exclude the null terminator
char_type c = chars[i];
bitset_._u64s[sz_bitcast(sz_u8_t, c) >> 6] |= (1ull << (sz_bitcast(sz_u8_t, c) & 63u));
}
}
template <std::size_t count_characters>
explicit basic_charset(std::array<char_type, count_characters> const &chars) noexcept : basic_charset() {
static_assert(count_characters > 0, "Character array cannot be empty");
for (std::size_t i = 0; i < count_characters - 1; ++i) { // count_characters - 1 to exclude the null terminator
char_type c = chars[i];
bitset_._u64s[sz_bitcast(sz_u8_t, c) >> 6] |= (1ull << (sz_bitcast(sz_u8_t, c) & 63u));
}
}

@alexbarev
Copy link
Collaborator Author

alexbarev commented Dec 8, 2024

Hi there,

Without modifying basic_charset or appending '\0' to each utility array of ASCII characters, I propose the following two solutions:

Option 1: Converter Function

Introduce a helper function to handle correct convertation from array to char_set:

template <std::size_t count_characters>
char_set char_set_converter(char const (&chars)[count_characters]) {
    return char_set(chars).add(chars[count_characters - 1]);
}

This function must be applied in every ..._set() function that initializes a char_set:

inline char_set digits_set() { 
    static char_set charset = char_set_converter(digits()); 
    return charset; 
}

There was no static before, but was there any specific reason not to include it?

Option 2: Wrapper Class

With this approach, I aim to provide a struct that enhances safety against incorrect usage. However, it introduces significantly more code and stores a reference as a class data member. I'm unsure if this is entirely appropriate here — it's just an idea.

template <std::size_t count_characters>
class carray_wrapper {
    using carray = char[count_characters];
    carray const & all;
    char_set const charset;
  public:
    carray_wrapper (char const (&chars)[count_characters]) noexcept
        : all(chars), charset{char_set{all}.add(all[count_characters - 1])} {}

    operator char_set() const noexcept {  
        return charset; 
    }

    carray const & operator ()() const noexcept {
        return all;
    }
};

namespace detail {
    char const digits[10] = {'0', '1', '2', '3', '4', '5',  '6', '7', '8',  '9'};
    //.....
}

carray_wrapper<sizeof(detail::digits)> digits{detail::digits};

inline char_set digits_set() { 
    // User-defined conversion ensures `char_set` includes the ad-hoc last character:
    static char_set charset {digits}; 
    return charset; 
}

This preserves the previous behavior of returning raw arrays for string operations through the digits() syntax.

@alexbarev alexbarev force-pushed the 200-fix-ascii-utilities-bug branch from 245c85a to 86f53d9 Compare December 8, 2024 23:20
@ashvardanian ashvardanian merged commit 002e433 into ashvardanian:main-dev Dec 9, 2024
2 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants