Skip to content

[Feature, Perf] Audit and optimize the embedded files accessing code #714

@JanKrivanek

Description

@JanKrivanek

Context

Inspired byt the discussion here: #711 (comment)
Embedded files are currently greedely fully read into memory during opening of binlog - while they might never be accessed.

Gotchas

Embedded files are a ziparchive which is within the zipstream of a binlog - acessing those later on would require one of those optins:

  • either leaving the stream open (and hence not veryfying it's properly terminated)
  • or rereading and again decompressing the entire binlog archive.
  • or copying the embedded zip archive into separate temporary file

Each of those options have significant downsides. The optimal way would need to be tested.

Alternative

Redesigning the binlog format.
E.g.: compressed events stream and files zip archive would be two independent streams within single file. The file would have few empty bytes prealocated on the begining and those would then be overwritten as the binlog would be writen:

  • size of compressed events stream (so that this can be quickly skipped in uncompressed FileStream and the next stream - files ziparchive can be read)
  • size of ziparchive (as this cannot be reliably obtained from ZipArchive for possibly larger archives)
  • indication of file properly terminated (having this on befgining of file, instead of end, allows the completeness check on initial open, without the need to read the files ZipArchive).

Other possible alternations to the format can be done at the same time to optimize the compression ratio, quicker redaction workflows etc. - e.g.:

  • deduplicated strings are packed together and possibly compressed in separate stream from the rest of events. Again - the offset would be part of initial 'table of contents'

FYI @rokonec - he was incepting some of those ideas

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions