Skip to content

Conversation

@oliverpool
Copy link
Contributor

Description

This package has an interesting approach of splitting the file. However I think that usual csv processing is CPU-bound and not IO-bound.

This is reflected in the benchmarks with the fakeProcessRow function.

I created a benchmark with uses the stdlib csv reader and spawns multiple workers for record processing: Benchmark50000Rows_50Mb_withGoCsvReaderReadOneByOneProcessParalell.

The result is not far from Benchmark50000Rows_50Mb_withBigCsvReader:

go test -timeout=20m -benchmem -benchtime=1x -bench=.
goos: linux
goarch: amd64
pkg: github.com/actforgood/bigcsvreader
cpu: Intel(R) Core(TM) i7-6770HQ CPU @ 2.60GHz
Benchmark50000Rows_50Mb_withBigCsvReader-8                                     1        7249230548 ns/op        61740648 B/op     100291 allocs/op
Benchmark50000Rows_50Mb_withGoCsvReaderReadAll-8                               1        56917609645 ns/op       68507744 B/op     100044 allocs/op
Benchmark50000Rows_50Mb_withGoCsvReaderReadOneByOneAndReuseRecord-8            1        57368588456 ns/op       57606088 B/op      50019 allocs/op
Benchmark50000Rows_50Mb_withGoCsvReaderReadOneByOneProcessParalell-8           1        7275138093 ns/op        61607144 B/op     100037 allocs/op

Type of change

  • Benchmark

Checklist:

  • I added examples and changed the documentation accordingly (not yet).

Contributor License Agreement
By submitting this pull request, I confirm that my contribution is made under the terms of project's license.

@bogcon
Copy link
Collaborator

bogcon commented Jun 20, 2023

Hi @oliverpool
Nice discovery.

I tried to catch only the I/O (reading) improvement on a bigger file, ~500Mb instead of ~50Mb, and got the bellow results:

// diff
 func fakeProcessRow(_ []string) {
-       time.Sleep(time.Millisecond)
+       // time.Sleep(time.Millisecond)
 }

 func Benchmark50000Rows_50Mb_withBigCsvReader(b *testing.B) {
-       benchmarkBigCsvReader(5e4)(b)
+       benchmarkBigCsvReader(5e5)(b)
 }

 func Benchmark50000Rows_50Mb_withGoCsvReaderReadOneByOneProcessParalell(b *testing.B) {
-       benchmarkGoCsvReaderReadOneByOneProcessParalell(5e4)(b)
+       benchmarkGoCsvReaderReadOneByOneProcessParalell(5e5)(b)
 }
// results
Benchmark50000Rows_50Mb_withBigCsvReader-8                             	       2	1417864642 ns/op	616139968 B/op	 1000259 allocs/op
Benchmark50000Rows_50Mb_withGoCsvReaderReadOneByOneProcessParalell-8   	       2	4566982168 ns/op	616006568 B/op	 1000028 allocs/op

@bogcon bogcon merged commit 364d029 into actforgood:main Jun 20, 2023
@oliverpool
Copy link
Contributor Author

oliverpool commented Jun 20, 2023

Thanks!

Maybe it would be worth mentioning in the description/readme that this package is mostly useful for IO-bound workloads (since as the newly added benchmarks shows, when the CPU is the bottleneck, the stdlib csv reader is fast enough :)

@bogcon
Copy link
Collaborator

bogcon commented Jun 20, 2023

I added this note in the readme.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants