[BUG] `String.endswith` breaks when used on a string containing UTF-8 characters. #3903

thatstoasty · 2024-12-21T01:36:06Z

Bug description

endswith returns the wrong result when unicode characters are part of the string. The example below should paint a clear picture. This may be fixed in one of the open utf8 PRs already, but I wanted to raise this issue just in case.

Steps to reproduce

Repro example:

def main():
    var test: String = "│\n"
    print(test.removesuffix("\n")) # Does not strip the newline.
    print(test.endswith("\n")) # Returns False

System information

- What OS did you do install Mojo on ? MacOS 14.6.1
- Provide version information for Mojo by pasting the output of `mojo -v` 24.6
- Provide Magic CLI version by pasting the output of `magic -V` or `magic --version` magic 0.5.1 (based on pixi 0.37.0)
- Optionally, provide more information with `magic info`.

The text was updated successfully, but these errors were encountered:

thatstoasty · 2024-12-21T01:37:38Z

@martinvuyk You've done a lot of work already on the utf8 side of String, so you may have already fixed this in one of your PRs🙂

martinvuyk · 2024-12-21T02:43:21Z

@thatstoasty I haven't gotten around to those functions yet, but I think I found the problem. StringSlice.__len__() works by unicode codepoints and String doesn't (it should in the future). StringSlice.find() works by byte offset, and it should be by unicode codepoints.

All of that context to explain:

    fn endswith(
        self, suffix: StringSlice, start: Int = 0, end: Int = -1
    ) -> Bool:
        """Verify if the `StringSlice` end with the specified suffix between
        start and end positions.

        Args:
            suffix: The suffix to check.
            start: The start offset from which to check.
            end: The end offset from which to check.

        Returns:
            True if the `self[start:end]` is suffixed by the input suffix.
        """
        if len(suffix) > len(self):
            return False
        if end == -1:
            return self.rfind(suffix, start) + len(suffix) == len(self)
        return StringSlice[origin](
            ptr=self.unsafe_ptr() + start, length=end - start
        ).endswith(suffix)

The line self.rfind(suffix, start) + len(suffix) == len(self) is to blame. Since rfind returns a byte offset and len() unicode codepoints. Or at least I think that line is the problem. This might get fixed when we switch to full unicode support (find(), __getitem__ and len() for strings should all work by unicode codepoints).

If that's not the case then I'm as lost as you are 😅

martinvuyk · 2024-12-21T02:48:45Z

@JoeLoser FYI this kind of problems is why I'm insisting we need #3548 and to do the switch to full unicode ASAP

thatstoasty added bug Something isn't working mojo-repo Tag all issues with this label labels Dec 21, 2024

martinvuyk mentioned this issue Jan 8, 2025

[stdlib] Fix startswith() and endswith() #3922

Open

ConnorGray closed this as completed Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `String.endswith` breaks when used on a string containing UTF-8 characters. #3903

[BUG] `String.endswith` breaks when used on a string containing UTF-8 characters. #3903

thatstoasty commented Dec 21, 2024 •

edited

Loading

thatstoasty commented Dec 21, 2024

martinvuyk commented Dec 21, 2024

martinvuyk commented Dec 21, 2024

[BUG] String.endswith breaks when used on a string containing UTF-8 characters. #3903

[BUG] String.endswith breaks when used on a string containing UTF-8 characters. #3903

Comments

thatstoasty commented Dec 21, 2024 • edited Loading

Bug description

Steps to reproduce

System information

thatstoasty commented Dec 21, 2024

martinvuyk commented Dec 21, 2024

martinvuyk commented Dec 21, 2024

[BUG] `String.endswith` breaks when used on a string containing UTF-8 characters. #3903

[BUG] `String.endswith` breaks when used on a string containing UTF-8 characters. #3903

thatstoasty commented Dec 21, 2024 •

edited

Loading