Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to Prism 1.2.0 #2266

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

herwinw
Copy link
Member

@herwinw herwinw commented Oct 14, 2024

Still a few encoding issues that need to be fixed before we can release this.

@herwinw herwinw self-assigned this Oct 14, 2024
@herwinw
Copy link
Member Author

herwinw commented Oct 15, 2024

The failure in the bytecode test can be reproduced without the bytecode too. I've extracted the following script:

# -*- encoding: binary -*-

puts "😊".encoding

This outputs ASCII-8BIT with Prism 1.1, and UTF-8 with Prism 1.2.

If we run this script with bin/parse of Prism (or bin/natalie --ast,), it shows the difference in the parsed nodes:

# Prism 1.1
@ StringNode (location: (3,5)-(3,11))
├── flags: ∅
├── opening_loc: (3,5)-(3,6) = "\""
├── content_loc: (3,6)-(3,10) = "\xF0\x9F\x98\x8A"
├── closing_loc: (3,10)-(3,11) = "\""
└── unescaped: "\xF0\x9F\x98\x8A"

# Prism 1.2
@ StringNode (location: (3,5)-(3,11))
├── flags: ∅
├── opening_loc: (3,5)-(3,6) = "\""
├── content_loc: (3,6)-(3,10) = "😊"
├── closing_loc: (3,10)-(3,11) = "\""
└── unescaped: "\xF0\x9F\x98\x8A"

The content is now encoded as UTF8, even though we marked the file as binary. For comparison, both Prism 1.1 and Prism 1.2 show everything in UTF8 if we update the encoding line

@ StringNode (location: (3,5)-(3,11))
├── flags: ∅
├── opening_loc: (3,5)-(3,6) = "\""
├── content_loc: (3,6)-(3,10) = "😊"
├── closing_loc: (3,10)-(3,11) = "\""
└── unescaped: "😊"

We determine the encoding of the String object in pass1.rb:

def encoding_for_string_node(node)
  if node.forced_utf8_encoding?
    Encoding::UTF_8
  elsif node.forced_binary_encoding?
    Encoding::ASCII_8BIT
  else
    @file.encoding
  end
end

Prism 1.2 now return true for the statement node.forced_utf8_encoding?. This result cascades into the VM instructions.

# Prism 1.1
push_string "\xF0\x9F\x98\x8A", 4, ASCII-8BIT

# Prism 1.2
push_string "\xF0\x9F\x98\x8A", 4, UTF-8

We could probably fix this by returning the file encoding earlier, and fall back to the forced_*_encoding? methods if there is no file encoding, but this might be a Prism bug.
The release notes don't indicate any changes related to encodings.

@herwinw
Copy link
Member Author

herwinw commented Oct 15, 2024

Related upstream issues:

So I guess it's intended behaviour, but I will have to read these PRs again because I don't yet understand why.

@herwinw
Copy link
Member Author

herwinw commented Nov 4, 2024

I'm going to put this one on hold for now, I'm just not sure about these character encoding changes and I kind of want to see how MRI 3.4 will handle this. Just two more months.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant