distsql: collect tableReader stats and output to trace by asubiotto · Pull Request #24529 · cockroachdb/cockroach

asubiotto · 2018-04-05T21:18:10Z

This PR is split into two commits. The first commit lays the groundwork for stat collection code. The stats are modeled as a proto to be sent over the wire in the future (we will be sending these stats as metadata). The second commit modifies the tableReader to use this InputStatCollector to count the number of rows it reads. Example usage:

root@:26257/> SELECT COUNT(*) FROM test.xyz;
+-------+
| count |
+-------+
|     7 |
+-------+
(1 row)

Time: 257.249037ms

root@:26257/> SET tracing = ON;
SET

Time: 362.575µs

root@:26257/> SELECT x,y,z FROM test.xyz ORDER BY x;
+---+---+------+
| x | y |  z   |
+---+---+------+
| 1 | 1 | NULL |
| 1 | 1 |    2 |
| 1 | 1 |    2 |
| 1 | 2 |    1 |
| 2 | 2 |    3 |
| 4 | 5 |    6 |
| 4 | 1 |    6 |
+---+---+------+
(7 rows)

Time: 207.334037ms

root@:26257/> SHOW TRACE FOR SESSION;

... rest of trace elided ...

| 2018-04-05 20:53:01.571467+00:00 | 21s794ms540µs951ns | waiting for read lock                                                    | [n1,s1,r23/1:/Table/5{2-3}]                                |     | /cockroach.roachpb.Internal/Batch |   13 |
| 2018-04-05 20:53:01.777272+00:00 | 22s345µs429ns      | read completed                                                           | [n1,s1,r23/1:/Table/5{2-3}]                                |     | /cockroach.roachpb.Internal/Batch |   13 |
| 2018-04-05 20:53:01.777435+00:00 | 22s508µs797ns      | stat summary for tableReader input                                       | [client=[::1]:64047,user=root,n1,TableReader]              |     | table reader                      |   11 |
|                                  |                    |                                                                          |                                                            |     |                                   |      |
|                                  |                    |  rows read         7                                                     |                                                            |     |                                   |      |

... rest of trace elided ...

And then through jaeger:

Note that we can't currently run SHOW TRACE FOR <QUERY> in distsql mode, which I hope to fix soon (#16562)

Any suggestions to improve the output welcome.

cc @knz @jordanlewis

cockroach-teamcity · 2018-04-05T21:18:20Z

This change is

tbg · 2018-04-05T21:23:30Z

Glad to see work in this direction! I was just wondering the other day how we could get better insights here. From the PR description, it looks like you're printing the stats into an event message. Have you considered making them tags instead? Instead of a message you could have tags of the type "tablereader.spans.read" -> 7 (and I'm anticipating that we'll have the RocksDB performance counters similarly). I'm not sure this is a better idea, btw, and even if it is it's straightforward to change it later.

asubiotto · 2018-04-05T21:45:34Z

I haven't considered any alternatives to the trace output since I'm not too familiar with logging. Thanks for suggesting the tags, I'll try those out.

tbg · 2018-04-05T22:01:09Z

Sounds good! I just want to reiterate that maybe this just won't work very well. But it's worth a quick experiment. (Even if you add tags, you probably still want to print it into the text so that it's properly exposed during SHOW TRACE FOR).

petermattis · 2018-04-06T13:49:23Z

Review status: 0 of 7 files reviewed at latest revision, all discussions resolved, all commit checks successful.

pkg/sql/distsqlrun/data.proto, line 265 at r1 (raw file):

message InputStats {
  // name describes the input whose stats were collected.
  optional string name = 1 [(gogoproto.nullable) = false];

Streams that connect distsql processors have a stream_id. Perhaps that should be used to identify the stats instead of a name. @andreimatei Might have opinions about this.

Comments from Reviewable

asubiotto · 2018-04-09T17:52:27Z

I think we need to resolve the question of how to send structured stats data back (e.g. to display in EXPLAIN). Initially, I was thinking of having these stats sent back as ProducerMetadata. The alternative is to, as @tschottdorf suggests, add tags to the processor's span. We could prefix the stats tags with stats. and print those out, to not have to care about any other changes in the stats output. This sounds a bit brittle though. @andreimatei, what do you think?

asubiotto · 2018-04-09T17:56:14Z

Review status: 0 of 7 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.

pkg/sql/distsqlrun/data.proto, line 265 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Streams that connect distsql processors have a stream_id. Perhaps that should be used to identify the stats instead of a name. @andreimatei Might have opinions about this.

I was thinking that this name would describe what the stats are about, rather than identify the stats (e.g. "left input of hashJoiner"). The identification will be done at a later stage, and I'm not sure stream_ids are enough to do that. The goal is to group a bunch of these stats (InputStats, MemoryStats, etc...) into a StatsGroup identified by a unique processor id and a stage in the logical plan.

Comments from Reviewable

andreimatei · 2018-04-09T18:20:06Z

The idea of using span tags for this information seems weird to me. If a processor collects 5 stats, would that be on tag or 5? Would they really look nice in the Jaegar UI? One argument I'd maybe buy was if putting them as tags would let any of the tracing services provide structured searching over them. But I doubt it...
If anything, I think the concept of span baggage items is closer in intent to this use case. But I don't know if tracers render baggage at all.
Another thing to consider is that I'm not convinced that the current mode of running with one span per processor works well for fused processors in the long run. The alternative would be to have all fused processors use a single span - which seems more like the canonical use of spans, but also has disadvantages. If multiple processors were to share a span, we'd need to consider what we want that span's tags to be.

The way we've returned structured info from processors so far is through ProducerMetadata. If the stats introduced here need to be "structured", then I think my preference would be for processors to do both - return some metadata and also render that metadata as a trace message.
We could teach SHOW TRACE FOR to recognize these messages and bump them to the top of their spans.

Review status: 0 of 7 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.

pkg/sql/distsqlrun/data.proto, line 265 at r1 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

I was thinking that this name would describe what the stats are about, rather than identify the stats (e.g. "left input of hashJoiner"). The identification will be done at a later stage, and I'm not sure stream_ids are enough to do that. The goal is to group a bunch of these stats (InputStats, MemoryStats, etc...) into a StatsGroup identified by a unique processor id and a stage in the logical plan.

An "input" is a stream, right? So I think it makes sense to rename this to StreamStats and put a stream_id in it...
I think the concept of a StatsGroup probably makes sense, but I don't think that's in conflict with the StreamStats?

Comments from Reviewable

tbg · 2018-04-09T19:19:07Z

The idea of using span tags for this information seems weird to me. If a processor collects 5 stats, would that be on tag or 5? Would they really look nice in the Jaegar UI? One argument I'd maybe buy was if putting them as tags would let any of the tracing services provide structured searching over them. But I doubt it...
If anything, I think the concept of span baggage items is closer in intent to this use case. But I don't know if tracers render baggage at all.

Could you elaborate on why that seems weird to you? Think of a span like a struct { messages []string, tags map[string]string }. You would use a tag for structured information, which this is. And I think you would have a tag for each metric you're collecting. Also, this isn't limited to DistSQL. For example, you may want your SHOW TRACE FOR COUNT(*) to tell you how many RocksDB tombstones were skipped over. You get a snowball trace bag, and simply add up the corresponding values. How to aggregate could be obvious from the tag name: count.<anything>.

For example, if you want to know how many rows you've read, you let every tablereader's span emit a count count.read (or whatever). Building something DistSQL specific would be unfortunate.

Baggage items are not the right choice here. The storage layer does not need to know how many rows the table reader on top of it has already read.

Review status: 0 of 7 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.

Comments from Reviewable

andreimatei · 2018-04-09T19:25:22Z

Could you elaborate on why that seems weird to you?

My mind is not set on this, so let me ask you a question in return: any log message provides some information (at the very lease its presence/absence is a bit). What nuggets of this information should be tags?

Review status: 0 of 7 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.

Comments from Reviewable

asubiotto · 2018-04-09T19:27:09Z

Review status: 0 of 7 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.

pkg/sql/distsqlrun/data.proto, line 265 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

An "input" is a stream, right? So I think it makes sense to rename this to StreamStats and put a stream_id in it...
I think the concept of a StatsGroup probably makes sense, but I don't think that's in conflict with the StreamStats?

TIL, I thought streams were only remote. It's not in conflict, but I'm not sure what we get out of having a stream_id here when I'm going to add identifiers for a group of stats later. Also, taking this PR as a specific example, not all InputStats will correspond to a stream. Here, the tableReader doesn't have an input from a plan perspective but we still want to collect these stats from its interaction with the rowFetcher. I think it's better to leave this as it is and there's no harm in keeping it general.

Comments from Reviewable

tbg · 2018-04-09T20:26:51Z

Log messages are meant for human consumption and so their form can change from one day to the next (you don't go around wondering whether changing a log.Event will mess up some parsing somewhere else). For span tag keys the semantics are that they are a label for which there is a registry (i.e. you don't just make them up free-form). Span tags are represented by a map so they are used for key lookup while events are not supposed to be searched. I think you'll find that represented in how jaeger and lightstep interpret tags; they are sensitive to certain standard tags.

I don't think that you're seriously suggesting that we don't use tags and instead parse the information out of log messages because well, not having to do that is exactly why tags exist.

See https://github.com/opentracing/specification/blob/master/semantic_conventions.md for some intuition about span tags. Tags are usually something like component, db.statement, user, http_code, etc. Using it to propagate information that can be accumulated is not something mentioned there, but it seems a reasonable fit.

Take a look at the tentative API below. We should show the tags inSHOW TRACE just before the span closes (because in the ordinary SHOW TRACE FOR that's kind of what you want, you don't want it aggregated -- it's the raw view of the timeline).

const keysRead = 12515 // example
log.CountEvent(ctx, "tablereader.keys", keysRead)
// ^- calls
// (opentracing.Span).SetTag("count.tablereader.keys", fmt.Sprintf("%v", keysRead))

But the upside is that you can imagine running (mod syntax bikeshed) SHOW TRACE (STATS) FOR SELECT COUNT(*) ... and get something like

metric                    | value
--------------------------+---------
count.tablereader.keys    | 12345
count.replicas.contacted  | 591
count.requests.           | 25
count.writes              | 3
...

I'm not sure what everyone else's experience is, but I've often wanted to isolate specific information from a trace and that has always been a problem since everything is currently unstructured. Would be good to change that.

This proposal isn't perfect because what if CountEvent is called multiple times for a span? We don't necessarily want to have to parse the string, add, and re-encode it. Maybe that's just not something that should be done? But I think some version of that proposal should be feasible, and we would make a central registry of tags and tie it all together that way. I'd prefer that over cooking up some DistSQL-specific mechanism.

Happy to chat about this more and help out if what I'm proposing is determined worth a shot but hits a snag.

Review status: 0 of 7 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.

Comments from Reviewable

andreimatei · 2018-04-09T21:53:46Z

you don't go around wondering whether changing a log.Event will mess up some parsing somewhere else

Well, actually, we already have logs that we can't change willy-nilly because of the damn "KV tracing" - those are logs selected by some regex.
I always thought the implementation of KV tracing in particular was a bad idea, but I've also thought that as tracing starts being used for testing more and more (which it already is), we should have a set of messages for which the format string is a constant shared by the code doing the logging and by a "trace comprehension" library.

Anyway, as I was telling you, I am inspired by the idea of doing something for these stats which is not specific to DistSQL and also about a metrics-like API, and I guess that reflecting these things in tags rather than messages makes sense. In fact, I think we should access all the metrics that we already have through such an API so that the metrics that a request contributed to are reflected in that request's trace. "What if CountEvent is called multiple times for a span" is something that we really need to decide on though, as is what we want to aggregate per span and what we want to aggregate across spans too. For example, when I run a SELECT, I want to see how many ranges were touched as a top-level summed up value (at least summed up across all KV requests done by a KV fetcher).

Review status: 0 of 7 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.

Comments from Reviewable

knz · 2018-04-10T14:03:50Z

@tschottdorf while I am agnostic on whether tags are the best way to do things conceptually, I'd like to point out that our current implementation of tags (using a linked list of tag structs implementing the Context interface, which need to be traversed to arrive to the actual context) is not performance-friendly with larger tag sets.

Until we solve that implementation problem I'd say we shouldn't use tags unless a specific benchmark measures the corresponding overhead.

tbg · 2018-04-10T14:39:50Z

@knz you're talking about the wrong tags 😄 My "tags" are opentracing span tags. They are internally backed by a map[string]string.

On a related note though, I was thinking what we are really winning by the use of opentracing tags and it's not much. Sure, the various opentracing-compatible tools will allow us to search spans by tag which is nice. But for a metrics-y use case, that seems mostly not very interesting, and being forced into all-strings is pretty terrible for things that are mostly integers.

So what if instead we put a struct with performance counters in (our impl of) span which is then recorded in RecordedSpan? I think this means that we need to widen the opentracing.Span interface, but that seems acceptable. We won't get the "benefit" of having these counters show up on lightstep or wherever, but that seems OK (and we could emulate it by also writing a tag whenever a counter is used).

@andreimatei, what do you think? This seems relatively straightforward to prototype.

asubiotto · 2018-04-11T14:29:47Z

@tschottdorf I think this RecordedSpan idea is the way to go because of the ability to have any layer add its own stats (not just distsql) and keep the benefits of structuring these stats as protos. This PR is in line with the idea of adding these stats to a proto so I don't think there is any extra work to do at this stage. I appreciate the input.

@solongordon, this PR is ready for a review after I resolve the merge conflicts.

solongordon · 2018-04-11T20:37:46Z

Couple quibbles but looks like a good start in general.

Review status: 0 of 7 files reviewed at latest revision, 1 unresolved discussion.

pkg/sql/distsqlrun/stats.go, line 57 at r3 (raw file):

func (is InputStats) SummarizeStats() string {
	var buffer bytes.Buffer
	buffer.WriteString(fmt.Sprintf("stat summary for %s\n", is.Name))

Nit: Do you know what the deal is with the blank line in the trace output?

pkg/sql/distsqlrun/stats.go, line 58 at r3 (raw file):

	var buffer bytes.Buffer
	buffer.WriteString(fmt.Sprintf("stat summary for %s\n", is.Name))
	buffer.WriteString(fmt.Sprintf("%10s%10d", "rows read", is.NumRows))

I would personally find a simple "rows read: 7" format to be more readable, but I'm also happy to wait on bikeshedding this.

Comments from Reviewable

Will be used by processors in future commits. Currently only counts the number of rows read but will be extended. Release note: None

Just number of rows read for now. This information is only collected and output if the tracing span is recording. Part of work for #19476 Release note: None

asubiotto · 2018-04-12T20:41:05Z

Review status: 0 of 7 files reviewed at latest revision, 3 unresolved discussions.

pkg/sql/distsqlrun/stats.go, line 57 at r3 (raw file):

Previously, solongordon (Solon) wrote…

Nit: Do you know what the deal is with the blank line in the trace output?

No idea. I changed to follow the same format as runtime stats:

320 I180412 20:25:00.604783 230 server/status/runtime.go:219  [n1] runtime stats: 238 MiB RSS, 163 goroutines, 94 MiB/45 MiB/164 MiB GO alloc/idle/total, 96 MiB/134 MiB CGO alloc/total, 89.70cgo/sec, 0.01/0.00 %(u/s)time, 0.00 %gc (0x)

pkg/sql/distsqlrun/stats.go, line 58 at r3 (raw file):

Previously, solongordon (Solon) wrote…

I would personally find a simple "rows read: 7" format to be more readable, but I'm also happy to wait on bikeshedding this.

Changed the format, I agree with you. Had the width to make more stats easier to read.

Comments from Reviewable

asubiotto · 2018-04-12T20:47:25Z

bors r+

@knz

24529: distsql: collect tableReader stats and output to trace r=asubiotto a=asubiotto This PR is split into two commits. The first commit lays the groundwork for stat collection code. The stats are modeled as a proto to be sent over the wire in the future (we will be sending these stats as metadata). The second commit modifies the tableReader to use this InputStatCollector to count the number of rows it reads. Example usage: ``` root@:26257/> SELECT COUNT(*) FROM test.xyz; +-------+ | count | +-------+ | 7 | +-------+ (1 row) Time: 257.249037ms root@:26257/> SET tracing = ON; SET Time: 362.575µs root@:26257/> SELECT x,y,z FROM test.xyz ORDER BY x; +---+---+------+ | x | y | z | +---+---+------+ | 1 | 1 | NULL | | 1 | 1 | 2 | | 1 | 1 | 2 | | 1 | 2 | 1 | | 2 | 2 | 3 | | 4 | 5 | 6 | | 4 | 1 | 6 | +---+---+------+ (7 rows) Time: 207.334037ms root@:26257/> SHOW TRACE FOR SESSION; ... rest of trace elided ... | 2018-04-05 20:53:01.571467+00:00 | 21s794ms540µs951ns | waiting for read lock | [n1,s1,r23/1:/Table/5{2-3}] | | /cockroach.roachpb.Internal/Batch | 13 | | 2018-04-05 20:53:01.777272+00:00 | 22s345µs429ns | read completed | [n1,s1,r23/1:/Table/5{2-3}] | | /cockroach.roachpb.Internal/Batch | 13 | | 2018-04-05 20:53:01.777435+00:00 | 22s508µs797ns | stat summary for tableReader input | [client=[::1]:64047,user=root,n1,TableReader] | | table reader | 11 | | | | | | | | | | | | rows read 7 | | | | | ... rest of trace elided ... ``` And then through jaeger: <img width="1427" alt="screen shot 2018-04-05 at 4 53 55 pm" src="https://user-images.githubusercontent.com/10560359/38392141-740ae6d8-38f4-11e8-9406-2c167142eba6.png"> Note that we can't currently run `SHOW TRACE FOR <QUERY>` in distsql mode, which I hope to fix soon (#16562) Any suggestions to improve the output welcome. cc @knz @jordanlewis 24715: opt: Fix flaky test r=andy-kimball a=andy-kimball Update the TestNoMallocs to use the SucceedsSoon method in order to repeatedly retry until zero memory allocations are detected. Release note: None 24752: cli: Fix number of args expected by `debug gossip-values` r=a-robinson a=a-robinson Fixes #24751 Release note: None It looks like the problem is that my original usage string (`gossip-values <directory>`) was badly copied and pasted from other debug commands, and then assumed necessary by #24399 when in reality no positional argument is used by the command. Co-authored-by: Alfonso Subiotto Marqués <alfonso@cockroachlabs.com> Co-authored-by: Andrew Kimball <andyk@cockroachlabs.com> Co-authored-by: Alex Robinson <alexdwanerobinson@gmail.com>

craig · 2018-04-12T21:16:08Z

Build succeeded

GitHub CI (Cockroach)

The recent cockroachdb#24529 made a change: it switched to starting the tr's input using the tr's ctx, instead of the ctx passed to tr.Start(). I think the change was inadvertent. It diverges from the pattern used by every other processor that implements RowSource: a proc's ctx is not supposed to be used by other procs - e.g. it would lead to contradictory logging tags. In this particular case of the tableReader, perhaps this is not the worst thing in the world: the tr's input is not a generic rowSource, it is precisely a RowFetcher (wrapped in layers and layers). So maybe it's not so bad for the RowFetcher in particular to use a tr's ctx. But now we've also introduced the InputStatCollector as a possible input, so we're really pretending that the tr is just like any other proc with an input. So, I went for uniformity. Release note: None

24793: distsqlrun: fix tableReader ctx r=andreimatei a=andreimatei The recent #24529 made a change: it switched to starting the tr's input using the tr's ctx, instead of the ctx passed to tr.Start(). I think the change was inadvertent. It diverges from the pattern used by every other processor that implements RowSource: a proc's ctx is not supposed to be used by other procs - e.g. it would lead to contradictory logging tags. In this particular case of the tableReader, perhaps this is not the worst thing in the world: the tr's input is not a generic rowSource, it is precisely a RowFetcher (wrapped in layers and layers). So maybe it's not so bad for the RowFetcher in particular to use a tr's ctx. But now we've also introduced the InputStatCollector as a possible input, so we're really pretending that the tr is just like any other proc with an input. So, I went for uniformity. Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

asubiotto requested review from a team and solongordon April 5, 2018 21:18

asubiotto mentioned this pull request Apr 9, 2018

sql: collect additional runtime query statistics #19476

Closed

asubiotto added 2 commits April 12, 2018 16:39

distsql: add InputStatCollector to collect simple input stats

b1fa0c3

Will be used by processors in future commits. Currently only counts the number of rows read but will be extended. Release note: None

distsql: collect tableReader stats and output to trace

6015b79

Just number of rows read for now. This information is only collected and output if the tracing span is recording. Part of work for #19476 Release note: None

craig Bot merged commit 6015b79 into cockroachdb:master Apr 12, 2018

andreimatei mentioned this pull request Apr 13, 2018

distsqlrun: fix tableReader ctx #24793

Merged

asubiotto mentioned this pull request May 3, 2018

tracing: add execution stats to span #25299

Closed

asubiotto deleted the pstats branch May 22, 2018 16:00

Conversation

asubiotto commented Apr 5, 2018

Uh oh!

cockroach-teamcity commented Apr 5, 2018

Uh oh!

tbg commented Apr 5, 2018

Uh oh!

asubiotto commented Apr 5, 2018

Uh oh!

tbg commented Apr 5, 2018

Uh oh!

petermattis commented Apr 6, 2018

Uh oh!

asubiotto commented Apr 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asubiotto commented Apr 9, 2018

Uh oh!

andreimatei commented Apr 9, 2018

Uh oh!

tbg commented Apr 9, 2018

Uh oh!

andreimatei commented Apr 9, 2018

Uh oh!

asubiotto commented Apr 9, 2018

Uh oh!

tbg commented Apr 9, 2018

Uh oh!

andreimatei commented Apr 9, 2018

Uh oh!

knz commented Apr 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tbg commented Apr 10, 2018

Uh oh!

asubiotto commented Apr 11, 2018

Uh oh!

solongordon commented Apr 11, 2018

Uh oh!

asubiotto commented Apr 12, 2018

Uh oh!

asubiotto commented Apr 12, 2018

Uh oh!

craig Bot commented Apr 12, 2018

Build succeeded

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

asubiotto commented Apr 9, 2018 •

edited

Loading

knz commented Apr 10, 2018 •

edited

Loading