Skip to content

[Discussion] Need for early-returning friendly iteration interface #417

@Moelf

Description

@Moelf

this is an example to demonstrate what "early returning" mean following a discussion on Slack with Alexander Plavin

tl;dr

The idea is that you have N columns you filter on, however, maybe 20% of the N columns are enough to early fail 80% of the rows. We need an ergonomic and 0-overhead interface to delay the allocation (or worse, when Feather is compressed) as much as possible.

Setup

julia> using Arrow, BenchmarkTools

julia> function gendata()
           x = [rand(rand(0:10)) for _ = 1:10^5]
           y = [randn(rand(0:10)) for _ = 1:10^5]
           (;x, y)
       end

julia> foreach(1:10) do _
           Arrow.append("./out.feather", gendata())
       end

Benchmark

julia> const tbl = Arrow.Table("out.feather");

julia> function kernel1(xs, ys)
           s1 = maximum(ys; init=0.0)
           s1 < 0.95 && return false

           maximum(xs; init=0.0) < 0.7 && return false
           return true
       end

julia> @benchmark map(kernel1, tbl.x, tbl.y)
BenchmarkTools.Trial: 19 samples with 1 evaluation.
 Range (min  max):  264.955 ms  271.889 ms  ┊ GC (min  max): 3.83%  4.93%
 Time  (median):     267.430 ms               ┊ GC (median):    3.82%
 Time  (mean ± σ):   267.704 ms ±   2.096 ms  ┊ GC (mean ± σ):  4.17% ± 0.47%

  ▁   ▁ ██    ▁▁     ▁ ▁     █▁  ▁  ▁     ▁         ▁       ▁ ▁
  █▁▁▁█▁██▁▁▁▁██▁▁▁▁▁█▁█▁▁▁▁▁██▁▁█▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁█ ▁
  265 ms           Histogram: frequency by time          272 ms <

 Memory estimate: 192.34 MiB, allocs estimate: 2000004.

julia> function method2(xss, yss)
           map(axes(yss, 1)) do i
               ys = yss[i]
               s1 = maximum(ys; init=0.0)
               s1 < 0.95 && return false

               xs = xss[i]
               maximum(xs; init=0.0) < 0.7 && return false
               return true
           end
       end

julia> @benchmark method2(tbl.x, tbl.y)
BenchmarkTools.Trial: 25 samples with 1 evaluation.
 Range (min  max):  197.487 ms  219.503 ms  ┊ GC (min  max): 2.89%  2.82%
 Time  (median):     200.121 ms               ┊ GC (median):    2.95%
 Time  (mean ± σ):   202.677 ms ±   6.567 ms  ┊ GC (mean ± σ):  3.21% ± 0.48%

  █▁ █   ▁        ▄
  ██▁█▁▆▆█▆▁▁▁▁▁▁▆█▆▁▁▁▁▁▆▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▆▁▁▆ ▁
  197 ms           Histogram: frequency by time          220 ms <

 Memory estimate: 148.03 MiB, allocs estimate: 1536913.

julia> map(kernel1, tbl.x, tbl.y) == method2(tbl.x, tbl.y)
true

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions