Skip to content

Conversation

@schneems
Copy link
Collaborator

@schneems schneems commented Nov 1, 2021

The "banner" is now based on lexical analysis rather than parser regex (fix #68, fix #87)

As described in #68 (comment). This implements a method of checking the source code for logically missing component pairs:

  • {}
  • []
  • ||
  • ()
  • keyword/end

It seems to be much more robust than using a regex on the parser errors. It also seems to provide more confident results than the parser errors.

Here's example output:

$ DEBUG_DISPLAY=1 be rspec spec/integration/ --format=f

Unmatched keyword, missing `end' ?

    1  Rails.application.routes.draw do
❯ 113    namespace :admin do
❯ 116    match "/foobar(*path)", via: :all, to: redirect { |_params, req|
❯ 120    }
  121  end


syntax error, unexpected ':', expecting end-of-input

   1  describe "webmock tests" do
  22    it "body" do
  27      query = Cutlass::FunctionQuery.new(
❯ 28        port: port
❯ 29        body: body
  30      ).call
  34    end
  35  end


Unmatched keyword, missing `end' ?

   5  module DerailedBenchmarks
   6    class RequireTree
   7      REQUIRED_BY = {}
   9      attr_reader   :name
  10      attr_writer   :cost
  11      attr_accessor :parent
❯ 13      def initialize(name)
❯ 18      def self.reset!
❯ 25      end
  73    end
  74  end


--> /Users/rschneeman/Documents/projects/dead_end/spec/fixtures/ruby_buildpack.rb.txt

Unmatched `|', missing `|' ?

❯ 1067    def add_yarn_binary
❯ 1068      return [] if yarn_preinstalled?
❯ 1069  |
❯ 1075    end


--> /Users/rschneeman/Documents/projects/dead_end/spec/fixtures/this_project_extra_def.rb.txt

Unmatched keyword, missing `end' ?

   1  module SyntaxErrorSearch
   3    class DisplayInvalidBlocks
   4      attr_reader :filename
❯ 36      def filename
❯ 38      def code_with_filename
❯ 45      end
  63    end
  64  end


--> /var/folders/b5/6sd8s7l16lqdll3_kzc0shy00000gq/T/20211101-3815-gljgc

Unmatched keyword, missing `end' ?

   16  class Rexe
❯  77    class Lookups
❯  78      def input_modes
❯ 148    end
  551  end

For cases where we are not confident about why code is invalid, we still fall back on the parser error messages. That allows this functionality to act as a progressive enhancement rather than a wholesale replacement.

We may be able to specialize other checks in the future as well. The most common one for me that comes up is missing a trailing comma in a hash/array/method-call:

query = Cutlass::FunctionQuery.new(
  port: port # <== here
  body: body
).call

Fix bug causing poor results (fix #95, fix #88)

As listed in #95 there are other cases where we have valid code nested in a way we didn't previously handle (via CleanDocument). The specific case is:

def ruby_install_binstub_path(ruby_layer_path = ".")
  @ruby_install_binstub_path ||=
    if ruby_version.build?
      "#{build_ruby_path}/bin"
    elsif ruby_version
      "#{ruby_layer_path}/#{slug_vendor_ruby}/bin"
    else
      ""
    end
end

As it's being through being validated/hidden, the inside is removed first, which generates an invalid code block.

This fix is to re-check all code in a block even if it is "hidden".

This conveniently drastically improves the results of #88 beyond just that small case, the whole output becomes very focused

close #95
close #88

CodeBlock Performance

28x faster empty line checks on CodeBlock

Benchmark:

require 'ripper'
require 'benchmark/ips'

@array = 20.times.map do
  obj = "puts a + b"
  def obj.empty?
    true
  end

  def obj.hidden?
    true
  end
  obj
end

def invalid?(source)
  source = source.join if source.is_a?(Array)
  source = source.to_s

  Ripper.new(source).tap(&:parse).error?
end

def valid?(source)
  !invalid?(source)
end

b = Benchmark.ips do |b|
  b.report("empty") { @array.all? {|x| x.empty? || x.hidden? } }
  b.report("ripper") { valid?(@array) }
  b.compare!
end

Gives us:

Warming up --------------------------------------
               empty    59.015k i/100ms
              ripper     2.064k i/100ms
Calculating -------------------------------------
               empty    594.981k (± 3.6%) i/s -      3.010M in   5.065759s
              ripper     20.694k (± 3.3%) i/s -    105.264k in   5.092576s

Comparison:
               empty:   594981.2 i/s
              ripper:    20693.8 i/s - 28.75x  (± 0.00) slower

This reduces overall test run time from ~5.3 seconds to 5.1 seconds. It's not a huge win, but it is something.

We could try to also use this same strategy in the CodeFrontier where we're likely doing a bunch of redundant parsing (i.e. adding empty lines won't ever change the parse results, but it is expensive to re-parse everything.

Frontier check performance

The error case with the ruby buildpack source code was added for "Fix bug causing poor results" which is a pathological case. The syntax error sits at the lowest level of indentation and the file is over a thousand lines long. This ran with no problem on my machine, but when running on shared hardware with circleci it hit the timeout causing tests to fail.

That failure prompted me to apply similar logic from the CodeBlock performance improvement to the CodeFrontier because I know that its where a bulk of time is spent. I had the insight that we cannot ever detect a fixed document unless we also have one or more blocks that are "invalid". With that info I was able to write a short-circuit check that skips Ripper when there are no invalid blocks on the frontier.

That shaved off a whole second from the test suite 🏎!!

Described in #68 (comment). This implements a method of checking the source code for logically missing component pairs:

- {}
- []
- |<value>|
- ()
- keyword/end

It seems to be much more robust than using a regex on the parser errors. It also seems to provide more confident results than the parser errors.

For cases where we are not confident about why code is invalid, we still fall back on the parser error messages. That allows this functionality to act as a progressive enhancement rather than a wholesale replacement.

We may be able to specialize other checks in the future as well. The most common one for me that comes up is missing a trailing comma in a hash/array/method-call:

```
    query = Cutlass::FunctionQuery.new(
      port: port # <== here
      body: body
    ).call
```
The integration test was poorly named. It's really integration cases for when we're not using a different process.
As listed in #95 there are other cases where we have valid code nested in a way we didn't previously handle (via `CleanDocument`). The specific case is:

```ruby
def ruby_install_binstub_path(ruby_layer_path = ".")
  @ruby_install_binstub_path ||=
    if ruby_version.build?
      "#{build_ruby_path}/bin"
    elsif ruby_version
      "#{ruby_layer_path}/#{slug_vendor_ruby}/bin"
    else
      ""
    end
end
```

As it's being through being validated/hidden, the inside is removed first, which generates an invalid code block.

This fix is to re-check all code in a block even if it is "hidden".

This conveniently drastically improves the results of #88 beyond just that small case, the whole output becomes very focused

close #95
close #88
Benchmark:

```
require 'ripper'
require 'benchmark/ips'

@array = 20.times.map do
  obj = "puts a + b"
  def obj.empty?
    true
  end

  def obj.hidden?
    true
  end
  obj
end

def invalid?(source)
  source = source.join if source.is_a?(Array)
  source = source.to_s

  Ripper.new(source).tap(&:parse).error?
end

def valid?(source)
  !invalid?(source)
end

b = Benchmark.ips do |b|
  b.report("empty") { @array.all? {|x| x.empty? || x.hidden? } }
  b.report("ripper") { valid?(@array) }
  b.compare!
end
```

Gives us:

```
Warming up --------------------------------------
               empty    59.015k i/100ms
              ripper     2.064k i/100ms
Calculating -------------------------------------
               empty    594.981k (± 3.6%) i/s -      3.010M in   5.065759s
              ripper     20.694k (± 3.3%) i/s -    105.264k in   5.092576s

Comparison:
               empty:   594981.2 i/s
              ripper:    20693.8 i/s - 28.75x  (± 0.00) slower
```

This reduces overall test run time from ~5.3 seconds to 5.1 seconds. It's not a huge win, but it is something.

We could try to also use this same strategy in the CodeFrontier where we're likely doing a bunch of redundant parsing (i.e. adding empty lines won't ever change the parse results, but it is expensive to re-parse everything.
Before:

```
$ git co 903d434
$ time be rspec ./spec/ --format=f
bundle exec rspec ./spec/ --format=f  4.52s user 1.38s system 97% cpu 6.032 total
```

After:

```
$ git co -
$ time be rspec ./spec/ --format=f
bundle exec rspec ./spec/ --format=f  4.07s user 1.39s system 97% cpu 5.593 total
```

Benchmarking just that test:

Before:

```
  0.899206   0.011891   0.911097 (  0.911433)
```

After:


```
  0.485635   0.008435   0.494070 (  0.495051)
```
@schneems schneems force-pushed the schneems/lex-banner branch from 8314e15 to cabce76 Compare November 2, 2021 02:06
Before:

```
  0.518002   0.007275   0.525277 (  0.526825)
```

After:

```
  0.277188   0.004640   0.281828 (  0.281940)
```

Instead of checking if the frontier holds all syntax errors any time there's an `invalid?` block in the frontier, only check when an invalid block is added. This prevents redundant checks as invalid blocks that are already on the frontier won't change and therefore won't give different results.

This also prevents a pathological performance case where there are two syntax errors, one at the furthest indentation and one at the lowest indentation.
@schneems schneems force-pushed the schneems/lex-banner branch from cabce76 to 9bfed9b Compare November 2, 2021 02:11
@schneems schneems merged commit 67eb5dc into main Nov 2, 2021
@schneems schneems deleted the schneems/lex-banner branch November 2, 2021 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants