Use TSDB's WAL for writes. by tomwilkie · Pull Request #1103 · cortexproject/cortex

tomwilkie · 2018-10-31T20:31:54Z

As per https://docs.google.com/document/d/1n1HcdgmsqaVwqxKg_POZqIREOSqZIAlrxuyx4YYV_pA/edit

Fixes #12

- Add github.com/promtheus/tsdb/wal - Update github.com/prometheus/client_golang for WrapRegistererWith function. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

vtolstov · 2019-03-10T19:44:56Z

@tomwilkie do you plan to fix pr?

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

Rebase with master

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

codesome · 2019-08-13T15:19:44Z

Currently, the idea is to only write the WAL. Reading the WAL on startup without any downtime of ingesters is a little tricky, as reading WAL takes a lot of time - hence need to have ingesters dedicated only to read WAL before it takes any writes. So that would be a follow-up work after this.

As for using the written WAL, we can read the WAL and directly flush the chunks to the chunk store in case ingester crashes. A tool/script for to do it would again be a follow up of this (or should I add in the same PR?)

We (@gouthamve, @tomwilkie and I) also discussed regarding prometheus's tsdb vs only WAL, and concluded that having tsdb would be very tricky right now as it requires a lot of changes and need to think of a way to handle the churn. Hence planned to go ahead with WAL for now.

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

gouthamve

LGTM with nits! Thanks for all the work and patience Ganesh!

gouthamve · 2020-01-02T11:18:10Z

pkg/ingester/ingester.go

+			return nil, err
+		}
+		elapsed := time.Since(start)
+		level.Info(util.Logger).Log("msg", "recovery from WAL completed", "time", elapsed.String())


Can we also make this a metric? So that we can compare the duration changes over releases and also correlate it with the number of series, etc.

gouthamve · 2020-01-02T11:24:48Z

pkg/ingester/ingester.go


 // Push implements client.IngesterServer
 func (i *Ingester) Push(ctx old_ctx.Context, req *client.WriteRequest) (*client.WriteResponse, error) {
+


Extra whitespace

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

jtlisi

Gave it a decent lookover, LGTM

gouthamve · 2020-01-08T11:52:37Z

pkg/ingester/ingester.go

 			// A small number of chunks per series - 10*(8^(7-1)) = 2.6m.
 			Buckets: prometheus.ExponentialBuckets(10, 8, 7),
 		}),
+		walReplayDuration: prometheus.NewSummary(prometheus.SummaryOpts{


This could be just a gauge as it doesn't change at all. No need of a summary.

codesome · 2020-01-08T14:16:01Z

@gouthamve in the last commit

Fixed the wrong increment of cortex_ingester_memory_chunks metric in setChunks function.
WAL replay duration is a gauge now.
Added metric for checkpoint duration.
Small cleanups.

pracucci · 2020-01-15T13:11:44Z

pkg/ingester/ingester.go

I think this metric is not registered. You should register it below.

Oops, thanks!

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

codesome · 2020-01-20T06:59:42Z

pkg/ingester/ingester.go

 		}
 	}
-	client.ReuseSlice(req.Timeseries)
+	defer client.ReuseSlice(req.Timeseries)


Pointing #2000 here as it also includes the same fix, and there is discussion going on for the fix. I have found WAL to cause panics without that fix, so maybe we want to wait for some conclusion on that PR (merging this PR without this fix would make it unsafe to deploy WAL).

It has been merged and I have rebased to remove the change from this PR.

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

gouthamve · 2020-01-20T13:00:06Z