Conversation
|
Oh I missed existing It's actually an internal package but reusable for this use case I guess. |
- (a breaking change) Use signed int in Arrow intemediates - Support some logical types - Fix some test cases
|
I finally examine a mem pprof result. It show a lower usage than the current version's ( #44 (comment) ) The reduction effect is |
|
Here's a quick performance test. I gave the below dummy input Avro file. With the current version( With the latest(this pullreq) version: The elapsed time increased by 1.5x ... The latest version's result contains (inputs) -> map -> arrow additional conversion, so it's not so strange and we possibly reduce the time if we remove (inputs) -> map redundant conversion layer. |
Finally supported! If we have this schema: And these record values. It partially matches field names and values but some values are null which's not allowed by the schema: Then columnify failed by the schema mismatch with this error message: btw the latest release version |
I will profile CPU usages next. |
Codecov Report
@@ Coverage Diff @@
## master #47 +/- ##
===========================================
- Coverage 70.05% 58.36% -11.70%
===========================================
Files 19 18 -1
Lines 875 1237 +362
===========================================
+ Hits 613 722 +109
- Misses 203 462 +259
+ Partials 59 53 -6
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
|
I added benchmark and profilings into the CI job. The cpu profiling was here: And it has some high cum% consumers: It seems that we don't have so many tuning parts in our side now. So what we can do next is, I guess some parts related to modules, mainly parquet-go. |
|
I cannot convert msgpack to parquet using columnify with this PR. So I don't measure memory usage. But v0.0.3 can convert it. I use the following schema and data. {
"name": "RailsAccessLog",
"type": "record",
"fields": [
{
"name": "container_id",
"type": "string"
},
{
"name": "container_name",
"type": "string"
},
{
"name": "source",
"type": "string"
},
{
"name": "log",
"type": "string"
},
{
"name": "__fluentd_address__",
"type": "string"
},
{
"name": "__fluentd_host__",
"type": "string"
},
{
"name": "action",
"type": ["null", "string"]
},
{
"name": "controller",
"type": ["null", "string"]
},
{
"name": "role",
"type": "string"
},
{
"name": "host",
"type": "string"
},
{
"name": "location",
"type": ["null", "string"]
},
{
"name": "severity",
"type": ["null", "string"],
"default": "INFO"
},
{
"name": "status",
"type": "int"
},
{
"name": "db",
"type": ["null", "float"]
},
{
"name": "view",
"type": ["null", "float"]
},
{
"name": "duration",
"type": ["null", "float"]
},
{
"name": "method",
"type": "string"
},
{
"name": "path",
"type": "string"
},
{
"name": "format",
"type": ["null", "string"]
},
{
"name": "error",
"type": ["null", "string"]
},
{
"name": "remote_ip",
"type": ["null", "string"]
},
{
"name": "agent",
"type": ["null", "string"]
},
{
"name": "authenticated_user_id",
"type": ["null", "string"]
},
{
"name": "params",
"type": ["null", "string"]
},
{
"name": "tag",
"type": "string"
},
{
"name": "time",
"type": "string"
}
]
}I can convert mstpack to parquet after I replaced from |
|
I could reproduce that, actually RSS is still so high (but I found that the memprofile result is not so terrible. It's curious). Anyway I would like to find another way to reduce it. Finally supporting streaming conversion ... ? That's not easy way but will be more effective. |
|
For using Arrow instead of naive |
#45
TODO