Skip to content

Conversation

@AngersZhuuuu
Copy link
Contributor

What changes were proposed in this pull request?

Current SparkThriftServer need to fit so much hive version problem and it implement too much unused feature from HiveServer2. Maybe we should implement thrift server with spark's own code.

Changes:

  1. Construct RowSet with StructType and Row
  2. Old Thrift server avoid conflicts, pass a HiveConf for execution, remove all action about HiveMetaStore to class DelegationTokenHandler
  3. Remove unnecessary action for Hive execution.
  4. Add hive-service API code dependency for beeline
  5. Implement some class to handle version problem such as org.apache.spark.sql.hive.thriftserver.cli.Type, org.apache.spark.sql.hive.thriftserver.utils.LogHelper/VariableSubstitution
  6. Remove class ThrfitserverShimUtils and dir for hive version v1.2.1 & v2.3.5
  7. Based on PROTOCOL_VERSION_V9, backwards compatible, when protocol version update, we can just replace old and add some more method since protocol is backwards compatible.

Why are the changes needed?

Get rider of HiveServer2 API 's limit and fit better for Spark

Does this PR introduce any user-facing change?

NO

How was this patch tested?

@AngersZhuuuu AngersZhuuuu changed the title [SPARK-29018][SQL] Implement Spark Thrift Server with it's own code base on PROTOCOL_VERSION_V9 [WIP][SPARK-29018][SQL] Implement Spark Thrift Server with it's own code base on PROTOCOL_VERSION_V9 Sep 8, 2019
@AngersZhuuuu
Copy link
Contributor Author

gental ping @juliuszsompolski @wangyum

@juliuszsompolski
Copy link
Contributor

It's a lot to digest @AngersZhuuuu . To help see what are the actual changes, could you separate out any parts that are just removing old code / moving code around, from the actual changes?

@juliuszsompolski
Copy link
Contributor

I think it would be best if you first had a PR with all the changes done as much as possible in existing code directories, without moving stuff around, and only then after that is committed, reorganize and move around the code in another PR.

@AngersZhuuuu
Copy link
Contributor Author

I think it would be best if you first had a PR with all the changes done as much as possible in existing code directories, without moving stuff around, and only then after that is committed, reorganize and move around the code in another PR.

Good Ideal, it's too hard to get all points I changed.
I will create a new one.

@gatorsmile
Copy link
Member

@AngersZhuuuu @wangyum Could you address the comment and move it forward?

@AngersZhuuuu
Copy link
Contributor Author

@gatorsmile Thanks for your attention.
Since there have been a lot of changes about sql/hive-thriftserver recently, so I stop this and work for other issue. I will follow @juliuszsompolski 's comment and fit it to current.

@juliuszsompolski
Copy link
Contributor

Hi @AngersZhuuuu - one more question. Why did you decide to base it on V9, instead of the newest V11. Are there any particular problems with supporting the latest ones?

@AngersZhuuuu
Copy link
Contributor Author

Hi @AngersZhuuuu - one more question. Why did you decide to base it on V9, instead of the newest V11. Are there any particular problems with supporting the latest ones?

It just didn't occur to me when I was doing it that there was a higher protocol version..., so I directly make it based on hive-2.3.5 protocaol v9.

But it's easy to upgrade protocol version, since it's downward compatible. I will restart this based on new version.

Copy link
Contributor

@juliuszsompolski juliuszsompolski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AngersZhuuuu !
I made a very high-altitude pass over this PR.
What I understand you are doing is

  • removing the Hive v1.2.1 and v2.3.5 code and replacing it with the equivalent code translated into scala, removing dead code that was not needed.
  • cutting away the Hive middle layer. E.g. SparkSQLSessionManager used to extend Hive's SessionManager etc. etc. Now all the classes are implemented directly.
  • in the process, removing any code that actually depended on Hive.

A few open questions that I have:

  • Do we want to translate to scala also all the code that Spark does not modify at all, or just keep it in Java? (I don't have an answer to that - on one hand, when it stays in Java, we can easier port future fixes changes from Hive; on the other hand, we may not be doing that anyway, and then unifying the code base into scala)
    -- BTW: Did you use some automatic tool to do the translation from Java to Scala?
  • Could you describe where the old code was depending on Hive, and what concrete changes needed to be made to cut that dependency? Was it only removing dead code that is overrided by Spark anyway?
  • Could you list what dead code that was relevant only to Hive was removed?

@@ -15,7 +15,7 @@
* limitations under the License.
*/

package org.apache.spark.sql.hive.thriftserver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe fold it in SparkMetadataOperation, now that we have a common superclass?


<dependency>
<groupId>${hive.group}</groupId>
<artifactId>hive-service</artifactId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that this is cutting away dependency on Hive?

Also, does it need to be defined in root pom.xml? (genuinely asking: I know little about maven)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see actually from the description that it is for beeline and Hive JDBC client... Would be nice if that dependency changes could be inside hive-thriftserver, but like I said, I don't know much about maven...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that this is cutting away dependency on Hive?

Also, does it need to be defined in root pom.xml? (genuinely asking: I know little about maven)

We just import this and use it for beeline. Don't need to spend more time dealing with version dependencies

</dependency>
<dependency>
<groupId>${hive.group}</groupId>
<artifactId>hive-service</artifactId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is it that we now depend on it just so that Hive JDBC client can work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is it that we now depend on it just so that Hive JDBC client can work?

yes, beeline rely on hive-beeline -> hive-jdbc -> hive-service.
So I just import this part for beeline, and this part is follow hive version.
So it is ok to do this.

@juliuszsompolski
Copy link
Contributor

Looking into it a bit more, the PR description actually does describe most of the changes, but it's just hard to grasp through in one huge PR...

Maybe separately have 4 PRs for:

  1. all the gen/ code generated anew in sql/hive-thriftserver/src/gen - it has a new package now, so it will remain unused. (bonus: could we make these thrift files be generated on the fly? seems possible: https://stackoverflow.com/questions/18767986/how-can-i-compile-all-thrift-files-thrift-as-a-maven-phase)
  2. on top of that, all the Thriftserver code that is just translated from Java to Scala without changes - it has a new package now, so it will remain unused.
  3. on top of that, all the classes that are actually modified - and at this point stop compiling the v1.2.1 and v2.3.5 dirs and use the new code.
  4. At the end, on top of that delete the v1.2.1 and v2.3.5 dirs, and only then redirect the build to actually use the new code.

For reviewing we could have 1, 2 and 3 separately on top of each other, plus a PR that does 1+2+3 that can be used for testing the new code. Mostly 3 will be what needs to be reviewed. Then 1, 2, 3 can be committed in quick succession (but I think it's worth keeping them in 3 separate commits).
4 can be done afterwards for cleanup.

@AngersZhuuuu
Copy link
Contributor Author

AngersZhuuuu commented Oct 21, 2019

@juliuszsompolski
Good suggestion. How about build a total new thriftserver in new package(your sugggestion) to make sure all the process is ok. Then after all things doing , rename it back to replace old one and remove 1.2.1/2.3.5 dir.

I will focus on the code change about solve hive dependency and remove unused codes.
And give a detail job list.

When I do this, some code is public and used by 1.2.1/2.3.2 . But most class have some different between 1.2.1/2.3.5, big change from 1.2.1 to 2.3.5.

About import hive-service, it is acceptable, since we all implement thrift server code in spark/scala.
hive-jdbc have to use hive-service code, we need to solve hive dependency so we must delete 1.2.1/2.3.5 folder. We just import what we need for hive-jdbc to used by bin/beeline. It won't impact our thriftsever's code and don't need to concern about version change of hive.

@juliuszsompolski
Copy link
Contributor

juliuszsompolski commented Oct 21, 2019

Good suggestion. Maybe start building it completely separate in spark/sql/spark-thriftserver (or just spark/sql/thriftserver, and in the end maybe even keep it named like that, given that we won't have ties to "hive" anymore.

@AngersZhuuuu
Copy link
Contributor Author

Good suggestion. Maybe start building it completely separate in spark/sql/spark-thriftserver (or just spark/sql/thriftserver, and in the end maybe even keep it named like that, given that we won't have ties to "hive" anymore.

Yeah, we can just copy exist test files from hive-thrift to make sure code is right. And don't break hive-thriftserver's dev and finally backport to spark-thriftserver

Doing like this may require more time to deal with new module issues.
Slow down the develop process to make it step by step but make it more clear.

cc @gatorsmile @wangyum Any advise about the develop process?

@AngersZhuuuu
Copy link
Contributor Author

AngersZhuuuu commented Oct 22, 2019

2. on top of that, all the Thriftserver code that is just translated from Java to Scala without changes - it has a new package now, so it will remain unused.

Translate all code seems too heavy. And we build it with protocol v11. we can't direct apply to v1.2.1/v2.3.5.
I prefer to build the framework based on protocol v11 such as common class and process like Operation.class, SessionManager.class, then fill in the details

And one important point, what do you think about

Construct RowSet with StructType and Row

This work well in our internal self-build thriftserver.

https://github.com/AngersZhuuuu/spark/blob/SPARK-29018/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/cli/RowSet.scala

Copy link
Contributor

@juliuszsompolski juliuszsompolski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: RowSet directly with Spark rows: +1, but see comments.

if (sparkRow.isNullAt(curCol)) {
row += null
} else {
addNonNullColumnValue(sparkRow, row, curCol)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using RowSet with Spark Rows is fine, but if you now just do resultRowSet.addRow(sparkRow), then it will e.g. not convert Array, Map, Struct or Interval to String, like addNonNullColumnValue did, so you still need to add these conversions somewhere. Probably would be easiest with a Project with casts added on top of the query if needed.

}

def getResultSetSchema: TableSchema = resultSchema
def getResultSetSchema: StructType = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TableSchema in Hive format still needs to be returned as TTableSchema in TGetResultSetMetadataResp. If you return a Spark StructType here, how do you get TableSchema out of it for the result?
Edit: Ok, I found that you implemented a SchemaMapper for that at the thrift layer. 👍

Copy link
Contributor Author

@AngersZhuuuu AngersZhuuuu Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TableSchema in Hive format still needs to be returned as TTableSchema in TGetResultSetMetadataResp. If you return a Spark StructType here, how do you get TableSchema out of it for the result?
Edit: Ok, I found that you implemented a SchemaMapper for that at the thrift layer. 👍

Convert with same rule in SparkExecuteStatementOperation.

By the way , we may need to implement a spark's jdbc client to remove import of hive-beeline hive-jdbc hive-service

See Google Doc https://docs.google.com/document/d/1oRgzt83vCGykkb45VjzvTEdRC_-ympRLB24pUWqk82o/edit#

@juliuszsompolski
Copy link
Contributor

  1. on top of that, all the Thriftserver code that is just translated from Java to Scala without changes - it has a new package now, so it will remain unused.

Translate all code seems too heavy. And we build it with protocol v11. we can't direct apply to v1.2.1/v2.3.5.
I prefer to build the framework based on protocol v11 such as common class and process like Operation.class, SessionManager.class, then fill in the details

I am confused. It seems that in this PR you already did translation from Java to Scala of all Hive Thriftserver code? I assume it was mostly somehow autogenerated?
A lot of these classes don't really change much from Hive to Spark, except for mechanical translation from Java to Scala, renaming package names, removing dependence on some Hive objects. Do I see correctly?
I would commit these in a separate PR, to separate "mechanical" changes from places that were actually rewritten.

I am also not sure whether we should translate those from Java to Scala at all. Maybe we should keep these in Java code, and only implement the Spark specific stuff in scala, removing Java Hive stuff that is not needed anymore. So e.g.

  • Keep CLIService.java, ThriftCLIService.java, ThriftHttpServlet.java, ... - all things that don't really get modified by Spark in Java
  • Do Spark specific implementation in scala, and remove the no longer needed Java thriftserver impl. E.g. (Spark)ExecuteStatementOperation.scala and remove ExecuteStatementOperation.java; (Spark)OperationManager.scala and remove OperationManager.java etc. etc.
    OR
    Translate all these Java files to scala, like is done in this PR?

I think I would keep them in Java to avoid potential errors in translation, and also to see easier if we want to port any future Hive changes to them.
@gatorsmile @rxin - what do you think?

@AngersZhuuuu
Copy link
Contributor Author

  • Keep CLIService.java, ThriftCLIService.java, ThriftHttpServlet.java, ... - all things that don't really get modified by Spark in Java
  • Do Spark specific implementation in scala, and remove the no longer needed Java thriftserver impl. E.g. (Spark)ExecuteStatementOperation.scala and remove ExecuteStatementOperation.java; (Spark)OperationManager.scala and remove OperationManager.java etc. etc.
    OR
    Translate all these Java files to scala, like is done in this PR?

It's ok to do like this, since base class like ThriftCLIService ThriftHttpServlet dose not rely hive. We can remain it as java code. then we implement Operations HiveSession SessionManager etc..

@AngersZhuuuu
Copy link
Contributor Author

I am confused. It seems that in this PR you already did translation from Java to Scala of all Hive Thriftserver code? I assume it was mostly somehow autogenerated?
A lot of these classes don't really change much from Hive to Spark, except for mechanical translation from Java to Scala, renaming package names, removing dependence on some Hive objects. Do I see correctly?
I would commit these in a separate PR, to separate "mechanical" changes from places that were actually rewritten.

Not just translated, but first work is direct translated. However, in some code, there will be some parts of hive version conflict, I need to solve these. I need to make sure it can work in different hiev version 1.2.1/2.3.5 . Some basic hive class is stable, that is the best news. But finally, we want to implement a thriftserver that is completely independent of hive

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@therealJacobWu
Copy link

Hi, I am curious to ask, is this still in progress?

@juliuszsompolski
Copy link
Contributor

juliuszsompolski commented Mar 17, 2020

@Jacobwu123 There has been work in a fork in https://github.com/spark-thriftserver/spark-thriftserver

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jun 26, 2020
@github-actions github-actions bot closed this Jun 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants