Skip to content

[WIP] Use BigTable Columns as range keys#753

Closed
gouthamve wants to merge 3 commits intocortexproject:masterfrom
gouthamve:cortex-714
Closed

[WIP] Use BigTable Columns as range keys#753
gouthamve wants to merge 3 commits intocortexproject:masterfrom
gouthamve:cortex-714

Conversation

@gouthamve
Copy link
Contributor

@gouthamve gouthamve commented Mar 16, 2018

@tomwilkie Still have TODOs that I want to tackle in refactorings. Also I used SHA256 to hash, let me know if that's okay.

One curious thing to check is the prefix queries, I haven't found anything other than regex to fix that.

/cc @bboreham


This change is Reviewable

See cortexproject#714 for motivation.

One thing to notice is that unlike existing schemas, here the structure
of the data being stored changed. This caused some backwards
incompatible changes as the `StorageClient` interface doesn't have any
time information in the `QueryPages` making it impossible to know if we
need to range over the row-key or the columns to retrieve the data.

For maintaining backwards compat, use `chunk.storage-client=gcpv1`.

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
@tomwilkie
Copy link
Contributor

Thanks Goutham! Will take a look tomorrow.

@bboreham
Copy link
Contributor

@gouthamve please give the motivation for this change.

@gouthamve
Copy link
Contributor Author

Sorry, explained everything in the commit message but forgot to post here too.

Motivation comes from: #714 This would mean accessing lesser number of rows per query.

srv *bttest.Server
name string

version int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally less of a fan of "version" fields, and more of a fan of feature flags, mainly because feature fields are more self-explanatory.

// this format, so we need to do a proper migration.
rowKey := hashValue + separator + string(rangeValue)
hasher := sha256.New()
hasher.Write([]byte(hashValue))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SHA256 is pretty slow, and we don't need its cryptographic properties. I'd have probably picked FNV (fast and pretty uniform), and then append the row key to final hashed value to guarantee uniqueness.

func (s *storageClient) NewWriteBatch() chunk.WriteBatch {
return bigtableWriteBatch{
func (s *storageClientV2) NewWriteBatch() chunk.WriteBatch {
return bigtableWriteBatchV2{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need two implementations? Or can we abstract away row and column key generation into a set of functions that switch off the flag?


if len(query.RangeValuePrefix) > 0 {
rowRange = bigtable.PrefixRange(query.HashValue + separator + string(query.RangeValuePrefix))
rOpts = append(rOpts, bigtable.RowFilter(bigtable.ColumnFilter(string(query.RangeValuePrefix)+".*"))) // TODO: Check again and anchor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah this is not particularly desirable. @mbrukman do you know of a better way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@garye, what are your thoughts on this?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused, this is doing a ColumnFilter with a value prefix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. We want to filter by prefix.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just confused because this is setting up a filter on column names, but the RangeValuePrefix implies that the filter should be used for values instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a Cortex-ism; our datamodel at this level is super simple, consisting of three-tuples of (hash value, range value, 'cell' value). We always specify the hash value (mapped to the row key in bigtable), want to do range queries over the range value, and equality filtering on the 'cell' value.

Sorry for the confusion.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for the explanation!

I'm not sure there's a significantly better way to do this... It's probably worth doing this filtering on the bigtable side but, if the amount of data that would get filtered out by this ColumnFilter is small, consider measuring against doing it all client-side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, would it be valid to apply the same trick @bcotton added to the row ranges; use range query from prefix to prefix+null ?

}, bigtable.RowFilter(bigtable.FamilyFilter(columnFamily)))
hasher := sha256.New()
hasher.Write([]byte(query.HashValue))
hashValue := string(hasher.Sum(nil))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to share this with the code to generate the row key for writes.


for i := range val {
val[i].Column = strings.TrimPrefix(val[i].Column, columnFamily+":")
// TODO: Hacky hacky ^
Copy link
Contributor

@tomwilkie tomwilkie Mar 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could push this into the readbatch.RangeValue function (if you want to).

Name: c.batch[0].tableName,
ProvisionedRead: 100000,
ProvisionedWrite: 100000,
}))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge fan of using the mock store to get the expected result; I think I'd prefer having the test case spell out explicitly what we expect to return.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it actually increases the test verbosity by a fair share. I'd rather add test-cases to increase the confidence in the in-mem version, but it's your call.

@tomwilkie
Copy link
Contributor

Done a first pass; let me know when you've addressed the comments.

Also, we should test this out and get some performance results before we merge, its not guaranteed to be an improvement.

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
@gouthamve
Copy link
Contributor Author

@tomwilkie PTAL. I've addressed most of the comments.

I've made it a single implementation as much as possible, but the QueryPages implementation differs enough to warrant a new type, I guess. I am not a big fan of queryFn as I think it hurts readability, for example, if it is:

func QueryPages() {
  return sc.queryFn()
}

Where is this queryFn passed? Do we want to have large fn definitions in the struct initialiser?

For the tests, I've done testing using the in-mem/Mock implementation for TSDB where the mock has tests to make sure it works as intended. I can do that, or just test the implementation independently. Your call.

@bboreham
Copy link
Contributor

bboreham commented Apr 3, 2018

What is the status of this PR?

@tomwilkie
Copy link
Contributor

tomwilkie commented Apr 3, 2018 via email

@bboreham
Copy link
Contributor

How did the testing go?

@csmarchbanks
Copy link
Contributor

Looking through this it appears the change is not backwards compatible right now? Did I miss something, or are there plans to use a -from flag?

Also, if you need some more testing done, FreshTracks would be happy to help out!

@tomwilkie
Copy link
Contributor

We dropped the ball on this: We did some testing, and couldn't explain a difference (increase) in size in the bigtables. Its still a WIP.

@tomwilkie tomwilkie changed the title Use BigTable Columns as range keys [WIP] Use BigTable Columns as range keys Jun 16, 2018
@tomwilkie
Copy link
Contributor

This is block on pushing the schema switching up a layer (to the chunk store) so we can have a migration path, as discussed on a recent community call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants