draft: Task table query optimizations#12537
draft: Task table query optimizations#12537kkonstantine wants to merge 6 commits intoapache:0.22.1from
Conversation
|
Thanks for your first PR, @kkonstantine! The performance of this API has been an issue recently. Thanks for trying to solve this! 🙂 Your overall approach looks good. If I understand correctly, you have changed the As compared to #12404 , I have some concerns:
It would be nice if you could share some more detailed results of your perf evaluation (preferably metadata storage footprint, mem footprint of this API and query times) as that might help us compare the two approaches. Overall, I think there is some merit to combining the two approaches.
|
|
Just to add context, we did have some discussions among the team about whether to pursue the JSONB approach and we had come to the local decision to pursue pulling the values out for compatibility reasons. We are unsure of which exact versions of which databases are being used by all of the deployments in the wild. While I'm certain that some people are actively using versions of MySQL/Maria/Postgres that support JSONB, I'm pretty sure there are people out there who do not (our tests using Derby is another case of that). So, basically the "lowest common denominator" thoughts are what led to a decision to pursue the route of #12404 The idea of using JSONB is definitely very nice and as Kashif mentions, could potentially simplify our lives in the future. It makes sense to me to have a metadata driver that believes that data is stored in JSONB for deployments that are running on a database that supports it, so I think that's great. I just primarily worry about the compatibility aspects and how we would roll out a switch to JSONB... One take away from this exercise that I think is worth noting for everybody is that any field that we are exposing in a druid system table probably deserves some form of special handling when it comes to reading it from the DB. Whether that be JSONB or existing as its own column, that can be an implementation detail, but I think that we should set a target of all columns that are exposed from system tables being able to be accessed in a cheap, resource-efficient manner. |
|
This pull request has been marked as stale due to 60 days of inactivity. |
|
This pull request/issue has been closed due to lack of activity. If you think that |
Posting an early and rough draft of the work I've started a few weeks back, given that #12404 was also opened recently and was just brought to my attention. The PR here and #12404 share similar goals regarding the optimization of queries on the tasks endpoints.
The approach I follow is a bit different, after I evaluated tradeoffs between adding new columns for specific subfields (as #12404 proposes) as well as writing queries that would deserialize the payload on the fly and this approach.
In this PR the suggestion is to keep querying the payload columns, but select specific subfields after we effectively change the type of the columns to be JSONB instead of BYTEA. This change is currently applicable to the Postgres backend with a port to MySQL to be potentially possible as well, while support for the non-production backend of Derby doesn't seem straightforward at the moment (this partly explains why this PR lacks tests at the moment and for this reason is not ready for a full review). The PR also adds a query parameter to two task endpoints in order to get the specific fields required by the web console.
Regarding the tradeoffs of this approach, here's a quick list of things I can think of:
PROS:
CONS:
This PR is currently lacking code to migrate task history and tests. This is something I would add if we think this could be plausible approach. Also any new names are not necessarily final or well thought out suggestions.
Note that this PR is based off of the 0.22.1 branch but will rebase on top of
masterif any of the changes here are valuable.Also, please note that this is my first contribution to the Druid project and this PR probably reflects this fact in various ways.
I'd appreciate some high level comments on the direction of this early draft and whether we think that it's an approach that we could follow or combine with the one proposed in #12404.
cc @imply-cheddar @AmatyaAvadhanula @xvrl
This PR has: