Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] String.endswith breaks when used on a string containing UTF-8 characters. #3903

Closed
thatstoasty opened this issue Dec 21, 2024 · 3 comments
Labels
bug Something isn't working mojo-repo Tag all issues with this label

Comments

@thatstoasty
Copy link
Contributor

thatstoasty commented Dec 21, 2024

Bug description

endswith returns the wrong result when unicode characters are part of the string. The example below should paint a clear picture. This may be fixed in one of the open utf8 PRs already, but I wanted to raise this issue just in case.

Steps to reproduce

Repro example:

def main():
    var test: String = "\n"
    print(test.removesuffix("\n")) # Does not strip the newline.
    print(test.endswith("\n")) # Returns False

System information

- What OS did you do install Mojo on ? MacOS 14.6.1
- Provide version information for Mojo by pasting the output of `mojo -v` 24.6
- Provide Magic CLI version by pasting the output of `magic -V` or `magic --version` magic 0.5.1 (based on pixi 0.37.0)
- Optionally, provide more information with `magic info`.
@thatstoasty thatstoasty added bug Something isn't working mojo-repo Tag all issues with this label labels Dec 21, 2024
@thatstoasty
Copy link
Contributor Author

@martinvuyk You've done a lot of work already on the utf8 side of String, so you may have already fixed this in one of your PRs🙂

@martinvuyk
Copy link
Contributor

@thatstoasty I haven't gotten around to those functions yet, but I think I found the problem. StringSlice.__len__() works by unicode codepoints and String doesn't (it should in the future). StringSlice.find() works by byte offset, and it should be by unicode codepoints.

All of that context to explain:

    fn endswith(
        self, suffix: StringSlice, start: Int = 0, end: Int = -1
    ) -> Bool:
        """Verify if the `StringSlice` end with the specified suffix between
        start and end positions.

        Args:
            suffix: The suffix to check.
            start: The start offset from which to check.
            end: The end offset from which to check.

        Returns:
            True if the `self[start:end]` is suffixed by the input suffix.
        """
        if len(suffix) > len(self):
            return False
        if end == -1:
            return self.rfind(suffix, start) + len(suffix) == len(self)
        return StringSlice[origin](
            ptr=self.unsafe_ptr() + start, length=end - start
        ).endswith(suffix)

The line self.rfind(suffix, start) + len(suffix) == len(self) is to blame. Since rfind returns a byte offset and len() unicode codepoints. Or at least I think that line is the problem. This might get fixed when we switch to full unicode support (find(), __getitem__ and len() for strings should all work by unicode codepoints).

If that's not the case then I'm as lost as you are 😅

@martinvuyk
Copy link
Contributor

@JoeLoser FYI this kind of problems is why I'm insisting we need #3548 and to do the switch to full unicode ASAP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mojo-repo Tag all issues with this label
Projects
None yet
Development

No branches or pull requests

2 participants