-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use string scanner with baseparser #105
Conversation
09b7fb9
to
4c56eb8
Compare
4c56eb8
to
8edd4ce
Compare
def read | ||
begin | ||
@buffer << readline | ||
@scanner.string = @scanner.rest + readline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use @scanner << readline
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing to @scanner << readline
causes the following error in JRuby.
https://github.com/ruby/rexml/actions/runs/7434514894/job/20228750659#step:4:43
Error: test_rexml(REXMLTests::TestIssuezillaParsing):
REXML::ParseException: No close tag for /issuezilla/issue[118]/activity[2]
Line: -1
Position: -1
Last 80 unconsumed characters:
/Users/naitoh/ghq/github.com/naitoh/rexml/lib/rexml/parsers/treeparser.rb:28:in `parse'
/Users/naitoh/ghq/github.com/naitoh/rexml/lib/rexml/document.rb:448:in `build'
/Users/naitoh/ghq/github.com/naitoh/rexml/lib/rexml/document.rb:101:in `initialize'
org/jruby/RubyClass.java:904:in `new'
/Users/naitoh/ghq/github.com/naitoh/rexml/test/test_rexml_issuezilla.rb:8:in `block in test_rexml'
5: include Helper::Fixture
6: def test_rexml
7: doc = File.open(fixture_path("ofbiz-issues-full-177.xml")) do |f|
=> 8: REXML::Document.new(f)
9: end
10: ctr = 1
11: doc.root.each_element('//issue') do |issue|
org/jruby/RubyIO.java:1179:in `open'
/Users/naitoh/ghq/github.com/naitoh/rexml/test/test_rexml_issuezilla.rb:7:in `test_rexml'
org/jruby/RubyKernel.java:1310:in `catch'
org/jruby/RubyKernel.java:1305:in `catch'
org/jruby/RubyKernel.java:1310:in `catch'
org/jruby/RubyKernel.java:1305:in `catch'
I am not sure why the error is occurring, but I am thinking that ruby/strscan#78 may be affected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the issue shows that scanner.string = scanner.rest + XXX
has a problem but scanner << XXX
doesn't have a problem. So it may not be related...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ruby/strscan#78 has been fixed. Could you try with the latest strscan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried, but it did not fix it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@naitoh Thank you! I will try to fix it today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ruby/strscan#83 is fixed and will be released soon!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kou
ruby/strscan#84 has been merged into master and I confirmed that JRuby's @scanner << readline
works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
lib/rexml/parsers/baseparser.rb
Outdated
match = @source.match( ENTITYDECL, true ).to_a.compact | ||
match[0] = :entitydecl | ||
match = @source.match( ENTITYDECL, true ) | ||
match = match.nil? ? [:entitydecl] : [:entitydecl, *match.captures.compact.reject(&:empty?)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. This assumes that match
is StringScanner
.
How about returning @scanner.captures
instead of @scanner
by @source.match
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that ruby/strscan#72 needs to be merged in order to use @scanner.captures
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged and released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed source#match?
to return scanner.captures
instead of @scanner
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add compact option to @source.match
with 8bc8955
Improve processing speed by returning @scanner.captures.compact
if @compact=true
and @scanner
if compact=false
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added match?
and removed compact
option in 50b3057.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed Source#match?
and return @scanner
in Source#match.
8edd4ce
to
fcc4db8
Compare
@kou I used I don't think this is a good idea... https://github.com/ruby/rexml/actions/runs/7510468370/job/20448939679?pr=105
|
fcc4db8
to
8227cc2
Compare
Add compact option to Improve processing speed by returning https://github.com/ruby/rexml/actions/runs/7512802060/job/20453872698?pr=105
|
def read | ||
begin | ||
@buffer << readline | ||
@scanner.string = @scanner.rest + readline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the issue shows that scanner.string = scanner.rest + XXX
has a problem but scanner << XXX
doesn't have a problem. So it may not be related...
It seems that calling How about providing We'll use diff --git a/lib/rexml/parsers/baseparser.rb b/lib/rexml/parsers/baseparser.rb
index 305b120..610209e 100644
--- a/lib/rexml/parsers/baseparser.rb
+++ b/lib/rexml/parsers/baseparser.rb
@@ -223,13 +223,13 @@ module REXML
return process_instruction
when DOCTYPE_START
base_error_message = "Malformed DOCTYPE"
- @source.match(DOCTYPE_START, true)
+ @source.match?(DOCTYPE_START, true)
@nsstack.unshift(curr_ns=Set.new)
name = parse_name(base_error_message)
- if @source.match(/\A\s*\[/um, true)
+ if @source.match?(/\A\s*\[/um, true)
id = [nil, nil, nil]
@document_status = :in_doctype
- elsif @source.match(/\A\s*>/um, true)
+ elsif @source.match?(/\A\s*>/um, true)
id = [nil, nil, nil]
@document_status = :after_doctype
else |
Hmm. https://github.com/ruby/rexml/pull/105/files#r1451610001 may fix this. |
8bc8955
to
50b3057
Compare
It was not fixed... |
Added https://github.com/ruby/rexml/actions/runs/7516658356/job/20461969215?pr=105
|
OK. It seems that we don't access all captured results in our use case. |
ruby/ruby#9536 will fix the CI failure. |
50b3057
to
ec62e37
Compare
I removed https://github.com/ruby/rexml/actions/runs/7519306454/job/20467689597?pr=105
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
head CI jobs are fixed.
lib/rexml/parsers/baseparser.rb
Outdated
match = @source.match( ENTITYDECL, true ).to_a.compact | ||
match[0] = :entitydecl | ||
match = @source.match( ENTITYDECL, true ) | ||
match = match.nil? ? [:entitydecl] : [:entitydecl, *match.captures.compact] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need match.nil?
check here?
(Is there any case that the above @source.match()
failed?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the match.nil? check is removed, all tests succeed.
But, if the string <!ENTITY>
comes in, @source.match()
responds with nil
and undefined method ``captures' for nil
is raised.
However, since <!ENTITY>
violates the XML specification and should be treated as an error.
I removed the match.nil?
check.
https://xml.coverpages.org/xmlBNF.html
EntityDecl ::= '<!ENTITY' S Name S EntityDef S? '>' /* General entities */
| '<!ENTITY' S '%' S Name S EntityDef S? '>' /* Parameter entities */
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Could you add a test for the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can do it as a separated PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
I added test for this case.
def read | ||
begin | ||
@buffer << readline | ||
@scanner.string = @scanner.rest + readline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ruby/strscan#78 has been fixed. Could you try with the latest strscan?
[Why] Using StringScanner reduces the string copying process and speeds up the process.
ec62e37
to
995d3e2
Compare
995d3e2
to
ba9f7fc
Compare
…ve processing speed.
ba9f7fc
to
eeb45e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Could you update the PR description before we merge this?
@kou |
Thanks! |
Thanks for your review!!! |
Using StringScanner reduces the string copying process and speeds up the process.
And I removed unnecessary methods.
https://github.com/ruby/rexml/actions/runs/7549990000/job/20554906140?pr=105