[WIP][SPARK-29018][SQL] Implement Spark Thrift Server with it's own code base on PROTOCOL_VERSION_V9 #25721

AngersZhuuuu · 2019-09-08T12:25:00Z

What changes were proposed in this pull request?

Current SparkThriftServer need to fit so much hive version problem and it implement too much unused feature from HiveServer2. Maybe we should implement thrift server with spark's own code.

Changes:

Construct RowSet with StructType and Row
Old Thrift server avoid conflicts, pass a HiveConf for execution, remove all action about HiveMetaStore to class DelegationTokenHandler
Remove unnecessary action for Hive execution.
Add hive-service API code dependency for beeline
Implement some class to handle version problem such as org.apache.spark.sql.hive.thriftserver.cli.Type, org.apache.spark.sql.hive.thriftserver.utils.LogHelper/VariableSubstitution
Remove class ThrfitserverShimUtils and dir for hive version v1.2.1 & v2.3.5
Based on PROTOCOL_VERSION_V9, backwards compatible， when protocol version update, we can just replace old and add some more method since protocol is backwards compatible.

Why are the changes needed?

Get rider of HiveServer2 API 's limit and fit better for Spark

Does this PR introduce any user-facing change?

NO

How was this patch tested?

AngersZhuuuu · 2019-09-08T13:13:34Z

gental ping @juliuszsompolski @wangyum

juliuszsompolski · 2019-09-09T14:38:29Z

It's a lot to digest @AngersZhuuuu . To help see what are the actual changes, could you separate out any parts that are just removing old code / moving code around, from the actual changes?

juliuszsompolski · 2019-09-09T14:40:14Z

I think it would be best if you first had a PR with all the changes done as much as possible in existing code directories, without moving stuff around, and only then after that is committed, reorganize and move around the code in another PR.

AngersZhuuuu · 2019-09-09T14:50:44Z

I think it would be best if you first had a PR with all the changes done as much as possible in existing code directories, without moving stuff around, and only then after that is committed, reorganize and move around the code in another PR.

Good Ideal, it's too hard to get all points I changed.
I will create a new one.

gatorsmile · 2019-10-11T00:00:21Z

@AngersZhuuuu @wangyum Could you address the comment and move it forward?

AngersZhuuuu · 2019-10-11T01:28:08Z

@gatorsmile Thanks for your attention.
Since there have been a lot of changes about sql/hive-thriftserver recently, so I stop this and work for other issue. I will follow @juliuszsompolski 's comment and fit it to current.

juliuszsompolski · 2019-10-21T11:16:04Z

Hi @AngersZhuuuu - one more question. Why did you decide to base it on V9, instead of the newest V11. Are there any particular problems with supporting the latest ones?

AngersZhuuuu · 2019-10-21T11:21:23Z

Hi @AngersZhuuuu - one more question. Why did you decide to base it on V9, instead of the newest V11. Are there any particular problems with supporting the latest ones?

It just didn't occur to me when I was doing it that there was a higher protocol version..., so I directly make it based on hive-2.3.5 protocaol v9.

But it's easy to upgrade protocol version, since it's downward compatible. I will restart this based on new version.

juliuszsompolski

Thanks @AngersZhuuuu !
I made a very high-altitude pass over this PR.
What I understand you are doing is

removing the Hive v1.2.1 and v2.3.5 code and replacing it with the equivalent code translated into scala, removing dead code that was not needed.
cutting away the Hive middle layer. E.g. SparkSQLSessionManager used to extend Hive's SessionManager etc. etc. Now all the classes are implemented directly.
in the process, removing any code that actually depended on Hive.

A few open questions that I have:

Do we want to translate to scala also all the code that Spark does not modify at all, or just keep it in Java? (I don't have an answer to that - on one hand, when it stays in Java, we can easier port future fixes changes from Hive; on the other hand, we may not be doing that anyway, and then unifying the code base into scala)
-- BTW: Did you use some automatic tool to do the translation from Java to Scala?
Could you describe where the old code was depending on Hive, and what concrete changes needed to be made to cut that dependency? Was it only removing dead code that is overrided by Spark anyway?
Could you list what dead code that was relevant only to Hive was removed?

juliuszsompolski · 2019-10-21T11:26:39Z

...scala/org/apache/spark/sql/hive/thriftserver/cli/operation/SparkMetadataOperationUtils.scala

@@ -15,7 +15,7 @@
 * limitations under the License.
 */

-package org.apache.spark.sql.hive.thriftserver


maybe fold it in SparkMetadataOperation, now that we have a common superclass?

juliuszsompolski · 2019-10-21T12:39:54Z

pom.xml

-
+      <dependency>
+        <groupId>${hive.group}</groupId>
+        <artifactId>hive-service</artifactId>


I thought that this is cutting away dependency on Hive?

Also, does it need to be defined in root pom.xml? (genuinely asking: I know little about maven)

Ok, I see actually from the description that it is for beeline and Hive JDBC client... Would be nice if that dependency changes could be inside hive-thriftserver, but like I said, I don't know much about maven...

I thought that this is cutting away dependency on Hive?

Also, does it need to be defined in root pom.xml? (genuinely asking: I know little about maven)

We just import this and use it for beeline. Don't need to spend more time dealing with version dependencies

juliuszsompolski · 2019-10-21T12:42:06Z

sql/hive-thriftserver/pom.xml

    </dependency>
+    <dependency>
+      <groupId>${hive.group}</groupId>
+      <artifactId>hive-service</artifactId>


Or is it that we now depend on it just so that Hive JDBC client can work?

Or is it that we now depend on it just so that Hive JDBC client can work?

yes, beeline rely on hive-beeline -> hive-jdbc -> hive-service.
So I just import this part for beeline, and this part is follow hive version.
So it is ok to do this.

juliuszsompolski · 2019-10-21T14:29:06Z

Looking into it a bit more, the PR description actually does describe most of the changes, but it's just hard to grasp through in one huge PR...

Maybe separately have 4 PRs for:

all the gen/ code generated anew in sql/hive-thriftserver/src/gen - it has a new package now, so it will remain unused. (bonus: could we make these thrift files be generated on the fly? seems possible: https://stackoverflow.com/questions/18767986/how-can-i-compile-all-thrift-files-thrift-as-a-maven-phase)
on top of that, all the Thriftserver code that is just translated from Java to Scala without changes - it has a new package now, so it will remain unused.
on top of that, all the classes that are actually modified - and at this point stop compiling the v1.2.1 and v2.3.5 dirs and use the new code.
At the end, on top of that delete the v1.2.1 and v2.3.5 dirs, and only then redirect the build to actually use the new code.

For reviewing we could have 1, 2 and 3 separately on top of each other, plus a PR that does 1+2+3 that can be used for testing the new code. Mostly 3 will be what needs to be reviewed. Then 1, 2, 3 can be committed in quick succession (but I think it's worth keeping them in 3 separate commits).
4 can be done afterwards for cleanup.

AngersZhuuuu · 2019-10-21T14:53:30Z

@juliuszsompolski
Good suggestion. How about build a total new thriftserver in new package(your sugggestion) to make sure all the process is ok. Then after all things doing , rename it back to replace old one and remove 1.2.1/2.3.5 dir.

I will focus on the code change about solve hive dependency and remove unused codes.
And give a detail job list.

When I do this, some code is public and used by 1.2.1/2.3.2 . But most class have some different between 1.2.1/2.3.5, big change from 1.2.1 to 2.3.5.

About import hive-service, it is acceptable, since we all implement thrift server code in spark/scala.
hive-jdbc have to use hive-service code, we need to solve hive dependency so we must delete 1.2.1/2.3.5 folder. We just import what we need for hive-jdbc to used by bin/beeline. It won't impact our thriftsever's code and don't need to concern about version change of hive.

juliuszsompolski · 2019-10-21T15:13:10Z

Good suggestion. Maybe start building it completely separate in spark/sql/spark-thriftserver (or just spark/sql/thriftserver, and in the end maybe even keep it named like that, given that we won't have ties to "hive" anymore.

AngersZhuuuu · 2019-10-22T03:46:23Z

Good suggestion. Maybe start building it completely separate in spark/sql/spark-thriftserver (or just spark/sql/thriftserver, and in the end maybe even keep it named like that, given that we won't have ties to "hive" anymore.

Yeah, we can just copy exist test files from hive-thrift to make sure code is right. And don't break hive-thriftserver's dev and finally backport to spark-thriftserver

Doing like this may require more time to deal with new module issues.
Slow down the develop process to make it step by step but make it more clear.

cc @gatorsmile @wangyum Any advise about the develop process?

AngersZhuuuu · 2019-10-22T05:27:02Z

2. on top of that, all the Thriftserver code that is just translated from Java to Scala without changes - it has a new package now, so it will remain unused.

Translate all code seems too heavy. And we build it with protocol v11. we can't direct apply to v1.2.1/v2.3.5.
I prefer to build the framework based on protocol v11 such as common class and process like Operation.class, SessionManager.class, then fill in the details

And one important point, what do you think about

Construct RowSet with StructType and Row

This work well in our internal self-build thriftserver.

https://github.com/AngersZhuuuu/spark/blob/SPARK-29018/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/cli/RowSet.scala

juliuszsompolski

Re: RowSet directly with Spark rows: +1, but see comments.

juliuszsompolski · 2019-10-23T15:28:47Z

...la/org/apache/spark/sql/hive/thriftserver/cli/operation/SparkExecuteStatementOperation.scala

-          if (sparkRow.isNullAt(curCol)) {
-            row += null
-          } else {
-            addNonNullColumnValue(sparkRow, row, curCol)


Using RowSet with Spark Rows is fine, but if you now just do resultRowSet.addRow(sparkRow), then it will e.g. not convert Array, Map, Struct or Interval to String, like addNonNullColumnValue did, so you still need to add these conversions somewhere. Probably would be easiest with a Project with casts added on top of the query if needed.

juliuszsompolski · 2019-10-23T15:31:55Z

...la/org/apache/spark/sql/hive/thriftserver/cli/operation/SparkExecuteStatementOperation.scala

  }

-  def getResultSetSchema: TableSchema = resultSchema
+  def getResultSetSchema: StructType = {


TableSchema in Hive format still needs to be returned as TTableSchema in TGetResultSetMetadataResp. If you return a Spark StructType here, how do you get TableSchema out of it for the result?
Edit: Ok, I found that you implemented a SchemaMapper for that at the thrift layer. 👍

TableSchema in Hive format still needs to be returned as TTableSchema in TGetResultSetMetadataResp. If you return a Spark StructType here, how do you get TableSchema out of it for the result?
Edit: Ok, I found that you implemented a SchemaMapper for that at the thrift layer. 👍

Convert with same rule in SparkExecuteStatementOperation.

By the way , we may need to implement a spark's jdbc client to remove import of hive-beeline hive-jdbc hive-service

See Google Doc https://docs.google.com/document/d/1oRgzt83vCGykkb45VjzvTEdRC_-ympRLB24pUWqk82o/edit#

juliuszsompolski · 2019-10-23T18:36:39Z

on top of that, all the Thriftserver code that is just translated from Java to Scala without changes - it has a new package now, so it will remain unused.

Translate all code seems too heavy. And we build it with protocol v11. we can't direct apply to v1.2.1/v2.3.5.
I prefer to build the framework based on protocol v11 such as common class and process like Operation.class, SessionManager.class, then fill in the details

I am confused. It seems that in this PR you already did translation from Java to Scala of all Hive Thriftserver code? I assume it was mostly somehow autogenerated?
A lot of these classes don't really change much from Hive to Spark, except for mechanical translation from Java to Scala, renaming package names, removing dependence on some Hive objects. Do I see correctly?
I would commit these in a separate PR, to separate "mechanical" changes from places that were actually rewritten.

I am also not sure whether we should translate those from Java to Scala at all. Maybe we should keep these in Java code, and only implement the Spark specific stuff in scala, removing Java Hive stuff that is not needed anymore. So e.g.

Keep CLIService.java, ThriftCLIService.java, ThriftHttpServlet.java, ... - all things that don't really get modified by Spark in Java
Do Spark specific implementation in scala, and remove the no longer needed Java thriftserver impl. E.g. (Spark)ExecuteStatementOperation.scala and remove ExecuteStatementOperation.java; (Spark)OperationManager.scala and remove OperationManager.java etc. etc.
OR
Translate all these Java files to scala, like is done in this PR?

I think I would keep them in Java to avoid potential errors in translation, and also to see easier if we want to port any future Hive changes to them.
@gatorsmile @rxin - what do you think?

AngersZhuuuu · 2019-10-24T01:46:37Z

Keep CLIService.java, ThriftCLIService.java, ThriftHttpServlet.java, ... - all things that don't really get modified by Spark in Java

Do Spark specific implementation in scala, and remove the no longer needed Java thriftserver impl. E.g. (Spark)ExecuteStatementOperation.scala and remove ExecuteStatementOperation.java; (Spark)OperationManager.scala and remove OperationManager.java etc. etc.
OR
Translate all these Java files to scala, like is done in this PR?

It's ok to do like this, since base class like ThriftCLIService ThriftHttpServlet dose not rely hive. We can remain it as java code. then we implement Operations HiveSession SessionManager etc..

AngersZhuuuu · 2019-10-24T02:01:13Z

I am confused. It seems that in this PR you already did translation from Java to Scala of all Hive Thriftserver code? I assume it was mostly somehow autogenerated?
A lot of these classes don't really change much from Hive to Spark, except for mechanical translation from Java to Scala, renaming package names, removing dependence on some Hive objects. Do I see correctly?
I would commit these in a separate PR, to separate "mechanical" changes from places that were actually rewritten.

Not just translated, but first work is direct translated. However, in some code, there will be some parts of hive version conflict, I need to solve these. I need to make sure it can work in different hiev version 1.2.1/2.3.5 . Some basic hive class is stable, that is the best news. But finally, we want to implement a thriftserver that is completely independent of hive

AmplabJenkins · 2020-01-17T04:30:57Z

Can one of the admins verify this patch?

therealJacobWu · 2020-03-17T17:02:51Z

Hi, I am curious to ask, is this still in progress?

juliuszsompolski · 2020-03-17T17:04:35Z

@Jacobwu123 There has been work in a fork in https://github.com/spark-thriftserver/spark-thriftserver

github-actions · 2020-06-26T00:24:09Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

AngersZhuuuu added 3 commits August 29, 2019 23:35

save code

8c17cbe

change SparkThriftServer to it's own API, avoid Hive version problem

b5e7f69

hive.setMetaConf(propName, substitution.substitute(ss.getConf, value))

8b9bdd1

AngersZhuuuu changed the title ~~[SPARK-29018][SQL] Implement Spark Thrift Server with it's own code base on PROTOCOL_VERSION_V9~~ [WIP][SPARK-29018][SQL] Implement Spark Thrift Server with it's own code base on PROTOCOL_VERSION_V9 Sep 8, 2019

resolve conflicts

16e8eaf

AngersZhuuuu added 2 commits September 8, 2019 21:25

change invalid name

28813f8

fix little bug

60415cd

dongjoon-hyun added the SQL label Sep 9, 2019

juliuszsompolski reviewed Oct 21, 2019

View reviewed changes

AngersZhuuuu mentioned this pull request Oct 23, 2019

[WIP][SPARK-29108][SQL] Add new module sql/thriftserver and add v11 thrift protocol #26221

Closed

juliuszsompolski reviewed Oct 23, 2019

View reviewed changes

github-actions bot added the Stale label Jun 26, 2020

github-actions bot closed this Jun 27, 2020

[WIP][SPARK-29018][SQL] Implement Spark Thrift Server with it's own code base on PROTOCOL_VERSION_V9 #25721

[WIP][SPARK-29018][SQL] Implement Spark Thrift Server with it's own code base on PROTOCOL_VERSION_V9 #25721

Uh oh!

Conversation

AngersZhuuuu commented Sep 8, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AngersZhuuuu commented Sep 8, 2019

Uh oh!

juliuszsompolski commented Sep 9, 2019

Uh oh!

juliuszsompolski commented Sep 9, 2019

Uh oh!

AngersZhuuuu commented Sep 9, 2019

Uh oh!

gatorsmile commented Oct 11, 2019

Uh oh!

AngersZhuuuu commented Oct 11, 2019

Uh oh!

juliuszsompolski commented Oct 21, 2019

Uh oh!

AngersZhuuuu commented Oct 21, 2019

Uh oh!

juliuszsompolski left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juliuszsompolski commented Oct 21, 2019

Uh oh!

AngersZhuuuu commented Oct 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juliuszsompolski commented Oct 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AngersZhuuuu commented Oct 22, 2019

Uh oh!

AngersZhuuuu commented Oct 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juliuszsompolski left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu Oct 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juliuszsompolski commented Oct 23, 2019

Uh oh!

AngersZhuuuu commented Oct 24, 2019

Uh oh!

AngersZhuuuu commented Oct 24, 2019

Uh oh!

AmplabJenkins commented Jan 17, 2020

Uh oh!

therealJacobWu commented Mar 17, 2020

Uh oh!

juliuszsompolski commented Mar 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 26, 2020

Uh oh!

AngersZhuuuu commented Oct 21, 2019 •

edited

Loading

juliuszsompolski commented Oct 21, 2019 •

edited

Loading

AngersZhuuuu commented Oct 22, 2019 •

edited

Loading

AngersZhuuuu Oct 24, 2019 •

edited

Loading

juliuszsompolski commented Mar 17, 2020 •

edited

Loading