Skip to content

Conversation

@tenderlove
Copy link
Member

@tenderlove tenderlove commented Feb 22, 2024

This commit adds scan_byte and peek_byte. scan_byte will scan the current byte, return it as an integer, and advance the cursor. peek_byte will return the current byte as an integer without advancing the cursor.

Currently StringScanner#get_byte returns a string, but I want to get the current byte without allocating a string. I think this will help with writing high performance lexers.

Thanks!

@tenderlove tenderlove requested a review from kou February 22, 2024 23:13
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that #read_byte and #peek_byte are better because #getbyte is obsoleted by #get_byte.

BTW, #read_byte may not a good name... I don't feel that it implies that integer not string... Do you have other candidate name? (I'll also consider better name.)

@tenderlove
Copy link
Member Author

I think that #read_byte and #peek_byte are better because #getbyte is obsoleted by #get_byte.

Adding an underscore seems fine. I chose readbyte because of IO#readbyte:

>> File.open("Rakefile").readbyte
=> 114
>> File.open("Rakefile").getbyte
=> 114
>>

BTW, #read_byte may not a good name... I don't feel that it implies that integer not string... Do you have other candidate name? (I'll also consider better name.)

I'm not sure what to name it. I think String#getbyte returned a string in the Ruby 1.8 era, but since Ruby 1.9, it returned an integer.

Maybe getchi? Like getch, but returns an int? I'm not sure 😅

@tenderlove tenderlove force-pushed the readbyte branch 2 times, most recently from 55cde40 to 1e8ec46 Compare February 22, 2024 23:58
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

I chose readbyte because of IO#readbyte.

Ah, OK. I didn't notice it. It makes sense. Let's use read (read_byte) for this feature.

We may want to deprecate get_byte and getbyte.

Could you update these lists?

strscan/ext/strscan/strscan.c

Lines 1604 to 1619 in 338d870

* === Advancing the Scan Pointer
*
* - #getch
* - #get_byte
* - #scan
* - #scan_until
* - #skip
* - #skip_until
*
* === Looking Ahead
*
* - #check
* - #check_until
* - #exist?
* - #match?
* - #peek

Could you also update the PR description because we use the PR description for commit message?

@headius Could you review JRuby part?

@@ -1,5 +1,6 @@
#!/usr/bin/env ruby

gem 'strscan'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! I didn't notice that we need this to use installed gem with Ruby 3.2 and 3.3.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to use bundle exec? I'm not sure what the policy of bundled gems is. I thought you didn't need gem either, but it seems to fix tests in CI.

@kou kou requested a review from headius February 23, 2024 07:21
@tenderlove
Copy link
Member Author

tenderlove commented Feb 23, 2024

I chose readbyte because of IO#readbyte.

Ah, OK. I didn't notice it. It makes sense. Let's use read (read_byte) for this feature.

We may want to deprecate get_byte and getbyte.

I kind of wish getbyte returns an int so that StringScanner API was similar to IO and String (so that people don't need to learn new method names). But changing the return value seems to introduce an incompatibility. Maybe we could do strscan version 4.0.0 and change the return value?

Could you update these lists?

Fixed!

Could you also update the PR description because we use the PR description for commit message?

Done!

@kou
Copy link
Member

kou commented Feb 24, 2024

I can understand what you said. I will reconsider API. Please give me a few days...

@kou
Copy link
Member

kou commented Feb 25, 2024

How about using scan_byte instead of readbyte/read_byte?
Because:

  • IO#readbyte raises EOFError for EOF case but we don't want to do it.
  • StringScanner isn't an IO-like object. So read* is a bit strange.
  • We have scan/scan_until that advance the scan pointer. Adding scan_byte to the scan family will not be strange.

I can understand the same getbyte method name may be easy to guess but I don't want to introduce a backward incompatibility without migration period. (StringScanner isn't IO-like/String-like object. So IO/String like method name may not be suitable for StringScanner.)

@tenderlove
Copy link
Member Author

How about using scan_byte instead of readbyte/read_byte?

Yes! I like this a lot. I'll change the method name.

Because:

  • IO#readbyte raises EOFError for EOF case but we don't want to do it.
  • StringScanner isn't an IO-like object. So read* is a bit strange.
  • We have scan/scan_until that advance the scan pointer. Adding scan_byte to the scan family will not be strange.

💯

tenderlove and others added 3 commits February 25, 2024 15:26
This commit adds `scan_byte` and `peek_byte`. `scan_byte` will read the
current byte, return it as an integer, and advance the cursor.
`peek_byte` will return the current byte as an integer without advancing
the cursor.
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
@kou kou changed the title Add a method for peeking and reading bytes as integers Add a method for peeking and scanning bytes as integers Feb 26, 2024
@kou
Copy link
Member

kou commented Feb 26, 2024

Let's merge this.

@headius If you have any comment, please let us know later.

@kou kou merged commit 873aba2 into master Feb 26, 2024
@kou kou deleted the readbyte branch February 26, 2024 00:45
matzbot pushed a commit to ruby/ruby that referenced this pull request Feb 26, 2024
integers
(ruby/strscan#89)

This commit adds `scan_byte` and `peek_byte`. `scan_byte` will scan the
current byte, return it as an integer, and advance the cursor.
`peek_byte` will return the current byte as an integer without advancing
the cursor.

Currently `StringScanner#get_byte` returns a string, but I want to get
the current byte without allocating a string. I think this will help
with writing high performance lexers.

---------

ruby/strscan@873aba2e5d

Co-authored-by: Sutou Kouhei <kou@clear-code.com>
@headius
Copy link
Contributor

headius commented Feb 26, 2024

I was unable to review in time but there are some performance improvements I will submit as a separate PR.

@headius
Copy link
Contributor

headius commented Feb 26, 2024

The PR in #90 optimizes the new methods and should be merged before release. The original was functionally correct but str.getBytes() returns a copy of the byte range, not a pointer to the actual byte range within the string (because we COW and JVM can't return a pointer to an array offset, use ByteList#get to get the byte at its correct offset).

kou pushed a commit that referenced this pull request Feb 27, 2024
A few simple changes on top of #89:

* Fields accessed more than once should be loaded into a local variable
(curr, byteList, runtime). Otherwise, the JVM in some modes will re-load
from memory due to the possibility of a concurrent update (better
caching, constant propagation).
* Use ByteList.get(index) to access directly rather than str.getBytes
which makes a copy byte[] (reduce allocation, let ByteList calculate
begin offset and 0xFF masking).

Performance is nearly 2x better, mostly because we've eliminated the
copy:

BEFORE:
```
[] strscan $ jruby -Xcompile.invokedynamic -Ilib -rstrscan -rbenchmark -e 'ss = StringScanner.new("x"); 10.times { puts Benchmark.measure { i = 0; while i < 100_000_000; i+=1; ss.peek_byte; end } }'
  1.870000   0.080000   1.950000 (  1.393004)
  1.300000   0.020000   1.320000 (  1.221120)
  1.210000   0.000000   1.210000 (  1.201006)
  1.220000   0.010000   1.230000 (  1.188845)
  1.240000   0.000000   1.240000 (  1.226052)
  1.230000   0.010000   1.240000 (  1.203179)
  1.490000   0.000000   1.490000 (  1.254489)
  1.230000   0.010000   1.240000 (  1.218346)
  1.230000   0.000000   1.230000 (  1.206789)
  1.300000   0.010000   1.310000 (  1.237078)
```

AFTER:
```
[] strscan $ jruby -Xcompile.invokedynamic -Ilib -rstrscan -rbenchmark -e 'ss = StringScanner.new("x"); 10.times { puts Benchmark.measure { i = 0; while i < 100_000_000; i+=1; ss.peek_byte; end } }'
  1.250000   0.080000   1.330000 (  0.937111)
  0.740000   0.000000   0.740000 (  0.707315)
  0.710000   0.010000   0.720000 (  0.701804)
  0.710000   0.000000   0.710000 (  0.697933)
  0.710000   0.000000   0.710000 (  0.704491)
  0.730000   0.010000   0.740000 (  0.712514)
  0.710000   0.000000   0.710000 (  0.701410)
  0.740000   0.000000   0.740000 (  0.715182)
  0.720000   0.010000   0.730000 (  0.703247)
  0.770000   0.000000   0.770000 (  0.768175)
```
kou pushed a commit that referenced this pull request Mar 27, 2024
…#91)

The methods were added in #89 but they aren't implemented in TruffleRuby
yet. So let's omit them for now to have CI green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants