-
Notifications
You must be signed in to change notification settings - Fork 29k
[WIP][SPARK-29018][SQL] Implement Spark Thrift Server with it's own code base on PROTOCOL_VERSION_V9 #25721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
gental ping @juliuszsompolski @wangyum |
|
It's a lot to digest @AngersZhuuuu . To help see what are the actual changes, could you separate out any parts that are just removing old code / moving code around, from the actual changes? |
|
I think it would be best if you first had a PR with all the changes done as much as possible in existing code directories, without moving stuff around, and only then after that is committed, reorganize and move around the code in another PR. |
Good Ideal, it's too hard to get all points I changed. |
|
@AngersZhuuuu @wangyum Could you address the comment and move it forward? |
|
@gatorsmile Thanks for your attention. |
|
Hi @AngersZhuuuu - one more question. Why did you decide to base it on V9, instead of the newest V11. Are there any particular problems with supporting the latest ones? |
It just didn't occur to me when I was doing it that there was a higher protocol version..., so I directly make it based on hive-2.3.5 protocaol v9. But it's easy to upgrade protocol version, since it's downward compatible. I will restart this based on new version. |
juliuszsompolski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @AngersZhuuuu !
I made a very high-altitude pass over this PR.
What I understand you are doing is
- removing the Hive v1.2.1 and v2.3.5 code and replacing it with the equivalent code translated into scala, removing dead code that was not needed.
- cutting away the Hive middle layer. E.g. SparkSQLSessionManager used to extend Hive's SessionManager etc. etc. Now all the classes are implemented directly.
- in the process, removing any code that actually depended on Hive.
A few open questions that I have:
- Do we want to translate to scala also all the code that Spark does not modify at all, or just keep it in Java? (I don't have an answer to that - on one hand, when it stays in Java, we can easier port future fixes changes from Hive; on the other hand, we may not be doing that anyway, and then unifying the code base into scala)
-- BTW: Did you use some automatic tool to do the translation from Java to Scala? - Could you describe where the old code was depending on Hive, and what concrete changes needed to be made to cut that dependency? Was it only removing dead code that is overrided by Spark anyway?
- Could you list what dead code that was relevant only to Hive was removed?
| @@ -15,7 +15,7 @@ | |||
| * limitations under the License. | |||
| */ | |||
|
|
|||
| package org.apache.spark.sql.hive.thriftserver | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe fold it in SparkMetadataOperation, now that we have a common superclass?
|
|
||
| <dependency> | ||
| <groupId>${hive.group}</groupId> | ||
| <artifactId>hive-service</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that this is cutting away dependency on Hive?
Also, does it need to be defined in root pom.xml? (genuinely asking: I know little about maven)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see actually from the description that it is for beeline and Hive JDBC client... Would be nice if that dependency changes could be inside hive-thriftserver, but like I said, I don't know much about maven...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that this is cutting away dependency on Hive?
Also, does it need to be defined in root pom.xml? (genuinely asking: I know little about maven)
We just import this and use it for beeline. Don't need to spend more time dealing with version dependencies
| </dependency> | ||
| <dependency> | ||
| <groupId>${hive.group}</groupId> | ||
| <artifactId>hive-service</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is it that we now depend on it just so that Hive JDBC client can work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is it that we now depend on it just so that Hive JDBC client can work?
yes, beeline rely on hive-beeline -> hive-jdbc -> hive-service.
So I just import this part for beeline, and this part is follow hive version.
So it is ok to do this.
|
Looking into it a bit more, the PR description actually does describe most of the changes, but it's just hard to grasp through in one huge PR... Maybe separately have 4 PRs for:
For reviewing we could have 1, 2 and 3 separately on top of each other, plus a PR that does 1+2+3 that can be used for testing the new code. Mostly 3 will be what needs to be reviewed. Then 1, 2, 3 can be committed in quick succession (but I think it's worth keeping them in 3 separate commits). |
|
@juliuszsompolski I will focus on the code change about solve hive dependency and remove unused codes. When I do this, some code is public and used by 1.2.1/2.3.2 . But most class have some different between 1.2.1/2.3.5, big change from 1.2.1 to 2.3.5. About import hive-service, it is acceptable, since we all implement thrift server code in spark/scala. |
|
Good suggestion. Maybe start building it completely separate in |
Yeah, we can just copy exist test files from hive-thrift to make sure code is right. And don't break Doing like this may require more time to deal with new module issues. cc @gatorsmile @wangyum Any advise about the develop process? |
Translate all code seems too heavy. And we build it with protocol v11. we can't direct apply to v1.2.1/v2.3.5. And one important point, what do you think about This work well in our internal self-build thriftserver. |
juliuszsompolski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: RowSet directly with Spark rows: +1, but see comments.
| if (sparkRow.isNullAt(curCol)) { | ||
| row += null | ||
| } else { | ||
| addNonNullColumnValue(sparkRow, row, curCol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using RowSet with Spark Rows is fine, but if you now just do resultRowSet.addRow(sparkRow), then it will e.g. not convert Array, Map, Struct or Interval to String, like addNonNullColumnValue did, so you still need to add these conversions somewhere. Probably would be easiest with a Project with casts added on top of the query if needed.
| } | ||
|
|
||
| def getResultSetSchema: TableSchema = resultSchema | ||
| def getResultSetSchema: StructType = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TableSchema in Hive format still needs to be returned as TTableSchema in TGetResultSetMetadataResp. If you return a Spark StructType here, how do you get TableSchema out of it for the result?
Edit: Ok, I found that you implemented a SchemaMapper for that at the thrift layer. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TableSchema in Hive format still needs to be returned as TTableSchema in TGetResultSetMetadataResp. If you return a Spark StructType here, how do you get TableSchema out of it for the result?
Edit: Ok, I found that you implemented a SchemaMapper for that at the thrift layer. 👍
Convert with same rule in SparkExecuteStatementOperation.
By the way , we may need to implement a spark's jdbc client to remove import of hive-beeline hive-jdbc hive-service
See Google Doc https://docs.google.com/document/d/1oRgzt83vCGykkb45VjzvTEdRC_-ympRLB24pUWqk82o/edit#
I am confused. It seems that in this PR you already did translation from Java to Scala of all Hive Thriftserver code? I assume it was mostly somehow autogenerated? I am also not sure whether we should translate those from Java to Scala at all. Maybe we should keep these in Java code, and only implement the Spark specific stuff in scala, removing Java Hive stuff that is not needed anymore. So e.g.
I think I would keep them in Java to avoid potential errors in translation, and also to see easier if we want to port any future Hive changes to them. |
It's ok to do like this, since base class like |
Not just translated, but first work is direct translated. However, in some code, there will be some parts of hive version conflict, I need to solve these. I need to make sure it can work in different hiev version 1.2.1/2.3.5 . Some basic hive class is stable, that is the best news. But finally, we want to implement a thriftserver that is completely independent of hive |
|
Can one of the admins verify this patch? |
|
Hi, I am curious to ask, is this still in progress? |
|
@Jacobwu123 There has been work in a fork in https://github.com/spark-thriftserver/spark-thriftserver |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Current SparkThriftServer need to fit so much hive version problem and it implement too much unused feature from HiveServer2. Maybe we should implement thrift server with spark's own code.
Changes:
hive-serviceAPI code dependency forbeelineorg.apache.spark.sql.hive.thriftserver.cli.Type,org.apache.spark.sql.hive.thriftserver.utils.LogHelper/VariableSubstitutionWhy are the changes needed?
Get rider of HiveServer2 API 's limit and fit better for Spark
Does this PR introduce any user-facing change?
NO
How was this patch tested?