-
Notifications
You must be signed in to change notification settings - Fork 88
Use string scanner with baseparser #105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
09b7fb9 to
4c56eb8
Compare
4c56eb8 to
8edd4ce
Compare
| def read | ||
| begin | ||
| @buffer << readline | ||
| @scanner.string = @scanner.rest + readline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use @scanner << readline here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing to @scanner << readline causes the following error in JRuby.
https://github.com/ruby/rexml/actions/runs/7434514894/job/20228750659#step:4:43
Error: test_rexml(REXMLTests::TestIssuezillaParsing):
REXML::ParseException: No close tag for /issuezilla/issue[118]/activity[2]
Line: -1
Position: -1
Last 80 unconsumed characters:
/Users/naitoh/ghq/github.com/naitoh/rexml/lib/rexml/parsers/treeparser.rb:28:in `parse'
/Users/naitoh/ghq/github.com/naitoh/rexml/lib/rexml/document.rb:448:in `build'
/Users/naitoh/ghq/github.com/naitoh/rexml/lib/rexml/document.rb:101:in `initialize'
org/jruby/RubyClass.java:904:in `new'
/Users/naitoh/ghq/github.com/naitoh/rexml/test/test_rexml_issuezilla.rb:8:in `block in test_rexml'
5: include Helper::Fixture
6: def test_rexml
7: doc = File.open(fixture_path("ofbiz-issues-full-177.xml")) do |f|
=> 8: REXML::Document.new(f)
9: end
10: ctr = 1
11: doc.root.each_element('//issue') do |issue|
org/jruby/RubyIO.java:1179:in `open'
/Users/naitoh/ghq/github.com/naitoh/rexml/test/test_rexml_issuezilla.rb:7:in `test_rexml'
org/jruby/RubyKernel.java:1310:in `catch'
org/jruby/RubyKernel.java:1305:in `catch'
org/jruby/RubyKernel.java:1310:in `catch'
org/jruby/RubyKernel.java:1305:in `catch'
I am not sure why the error is occurring, but I am thinking that ruby/strscan#78 may be affected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the issue shows that scanner.string = scanner.rest + XXX has a problem but scanner << XXX doesn't have a problem. So it may not be related...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ruby/strscan#78 has been fixed. Could you try with the latest strscan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried, but it did not fix it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@naitoh Thank you! I will try to fix it today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ruby/strscan#83 is fixed and will be released soon!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kou
ruby/strscan#84 has been merged into master and I confirmed that JRuby's @scanner << readline works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
lib/rexml/parsers/baseparser.rb
Outdated
| match = @source.match( ENTITYDECL, true ).to_a.compact | ||
| match[0] = :entitydecl | ||
| match = @source.match( ENTITYDECL, true ) | ||
| match = match.nil? ? [:entitydecl] : [:entitydecl, *match.captures.compact.reject(&:empty?)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. This assumes that match is StringScanner.
How about returning @scanner.captures instead of @scanner by @source.match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that ruby/strscan#72 needs to be merged in order to use @scanner.captures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged and released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed source#match? to return scanner.captures instead of @scanner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add compact option to @source.match with 8bc8955
Improve processing speed by returning @scanner.captures.compact if @compact=true and @scanner if compact=false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added match? and removed compact option in 50b3057.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed Source#match? and return @scanner in Source#match.
8edd4ce to
fcc4db8
Compare
|
@kou I used I don't think this is a good idea... https://github.com/ruby/rexml/actions/runs/7510468370/job/20448939679?pr=105 |
fcc4db8 to
8227cc2
Compare
|
Add compact option to Improve processing speed by returning https://github.com/ruby/rexml/actions/runs/7512802060/job/20453872698?pr=105 |
| def read | ||
| begin | ||
| @buffer << readline | ||
| @scanner.string = @scanner.rest + readline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the issue shows that scanner.string = scanner.rest + XXX has a problem but scanner << XXX doesn't have a problem. So it may not be related...
It seems that calling How about providing We'll use diff --git a/lib/rexml/parsers/baseparser.rb b/lib/rexml/parsers/baseparser.rb
index 305b120..610209e 100644
--- a/lib/rexml/parsers/baseparser.rb
+++ b/lib/rexml/parsers/baseparser.rb
@@ -223,13 +223,13 @@ module REXML
return process_instruction
when DOCTYPE_START
base_error_message = "Malformed DOCTYPE"
- @source.match(DOCTYPE_START, true)
+ @source.match?(DOCTYPE_START, true)
@nsstack.unshift(curr_ns=Set.new)
name = parse_name(base_error_message)
- if @source.match(/\A\s*\[/um, true)
+ if @source.match?(/\A\s*\[/um, true)
id = [nil, nil, nil]
@document_status = :in_doctype
- elsif @source.match(/\A\s*>/um, true)
+ elsif @source.match?(/\A\s*>/um, true)
id = [nil, nil, nil]
@document_status = :after_doctype
else |
Hmm. https://github.com/ruby/rexml/pull/105/files#r1451610001 may fix this. |
8bc8955 to
50b3057
Compare
It was not fixed... |
|
Added https://github.com/ruby/rexml/actions/runs/7516658356/job/20461969215?pr=105 |
|
OK. It seems that we don't access all captured results in our use case. |
|
ruby/ruby#9536 will fix the CI failure. |
50b3057 to
ec62e37
Compare
I removed https://github.com/ruby/rexml/actions/runs/7519306454/job/20467689597?pr=105
|
kou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
head CI jobs are fixed.
lib/rexml/parsers/baseparser.rb
Outdated
| match = @source.match( ENTITYDECL, true ).to_a.compact | ||
| match[0] = :entitydecl | ||
| match = @source.match( ENTITYDECL, true ) | ||
| match = match.nil? ? [:entitydecl] : [:entitydecl, *match.captures.compact] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need match.nil? check here?
(Is there any case that the above @source.match() failed?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the match.nil? check is removed, all tests succeed.
But, if the string <!ENTITY> comes in, @source.match() responds with nil and undefined method ``captures' for nil is raised.
However, since <!ENTITY> violates the XML specification and should be treated as an error.
I removed the match.nil? check.
https://xml.coverpages.org/xmlBNF.html
EntityDecl ::= '<!ENTITY' S Name S EntityDef S? '>' /* General entities */
| '<!ENTITY' S '%' S Name S EntityDef S? '>' /* Parameter entities */
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Could you add a test for the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can do it as a separated PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
I added test for this case.
| def read | ||
| begin | ||
| @buffer << readline | ||
| @scanner.string = @scanner.rest + readline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ruby/strscan#78 has been fixed. Could you try with the latest strscan?
[Why] Using StringScanner reduces the string copying process and speeds up the process.
ec62e37 to
995d3e2
Compare
995d3e2 to
ba9f7fc
Compare
…ve processing speed.
ba9f7fc to
eeb45e1
Compare
kou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Could you update the PR description before we merge this?
|
@kou |
|
Thanks! |
|
Thanks for your review!!! |
Using StringScanner reduces the string copying process and speeds up the process.
And I removed unnecessary methods.
https://github.com/ruby/rexml/actions/runs/7549990000/job/20554906140?pr=105