Bigger chunks, faster queries. by tomwilkie · Pull Request #1048 · cortexproject/cortex

tomwilkie · 2018-10-02T10:08:26Z

A bigchunk is a slice of of prometheus/tsdb chunks. Each individual chunk is fixed at 120 samples; after that we add a new one. There is no upperbound on the number of samples in a bigchunk.

As part of this PR I've removed a bunch of unused code in the chunk package.

Fixes #1045, fixes #766, fixes #300

bboreham · 2018-10-03T09:51:58Z

If pkg/prom1/storage/local/chunk/bigchunk.go is dealing with prom2 chunks, can it go in a directory reflecting that?
prom1 should be, as far as possible, the code from Prometheus v1 copied verbatim.

tomwilkie · 2018-10-03T10:36:51Z

prom1 should be, as far as possible, the code from Prometheus v1 copied verbatim.

I guess you want this so back porting fixes from Prometheus v1 is easier? Its not really something I'm worried about TBH - there is no development work happening on Prometheus v1, we're more likely to find an fix issues in this package ourself.

If pkg/prom1/storage/local/chunk/bigchunk.go is dealing with prom2 chunks, can it go in a directory reflecting that?

I started with in in another package, but as I've modelling this as another chunk encoding, there needed to be a bunch of references to it in the chunk package to make it work with the rest of our code. Therefore I just moved it into this package.

Given that, I was thinking of renaming the directory to chunkenc to remove the name clashes with out chunk package. WDYT?

bboreham · 2018-10-03T11:21:50Z

I guess you want this so back porting fixes from Prometheus v1 is easier?

Also so we can see where bugs came from.

I'm going to have to think about this one. "Prom v1 verbatim" is pretty easy to justify; "Prom v2" similarly, but "Prom v2 data wedged into the code structure from v1" will take time.

bboreham · 2018-10-05T08:59:17Z

If every chunk in a long-running series is ~12 hours, then every second chunk is indexed in two day-buckets, so we have 50% extra index entries. I'm beginning to think that day buckets are too short.

tomwilkie · 2018-10-05T09:09:35Z

Looking at the series store - if we have 4 hour chunks, we write 7x3=21 series entries per day. With this we have 9 entries per day. So we write half the number of entries - 9 vs 21. The number of label entries stays the same - let’s say 10. For the chunk store - 4 hour chunks would have 21x10 entries. Big chunks would have 9x10. So sure it’s 50% when compared with itself, but when compared to short chunks i think the amount of extra overlaps you write remains the same - basically one per day. There is after only one start and end to each day.

…

On Fri, 5 Oct 2018 at 10:59, Bryan Boreham ***@***.***> wrote: If every chunk in a long-running series is ~12 hours, then every second chunk is indexed in two day-buckets, so we have 50% extra index entries. I'm beginning to think that day buckets are too short. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1048 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAbGhcly3BqXiummvddoYtAMOCkE9uvaks5uhx9mgaJpZM4XDuPp> .

bboreham · 2018-10-05T10:20:15Z

My comment was not intended as a criticism of bigger chunks, but a broader comment on the DB design. A 50% overhead is worth talking about no mater what your starting-point is.

Another way to go would be to stop the double index writes, and look up the index for one extra bucket on queries.

tomwilkie · 2018-11-09T12:08:03Z

My comment was not intended as a criticism of bigger chunks, but a broader comment on the DB design. A 50% overhead is worth talking about no mater what your starting-point is.

Lets take this into a separate issue; agree its a valid concern, but its unrelated to this PR.

Also, remove assumptions about marshalled chunk length. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>