Using java Spark DataFrame to access Oracle over jdbc

Question

I find the existing Spark implementations for accessing a traditional Database very restricting and limited. Particularly:

Use of Bind Variables is not possible.
Passing the partitioning parameters to your generated SQL is very restricted.

Most bothersome is that I am not able to customize my query in how partitioning takes place, all it allows is to identify a partitioning column, and upper / lower boundaries, but only allowed is a numeric column and values. I understand I can provide the query to my database like you do a subquery, and map my partitioning column to a numeric value, but that will cause very inefficient execution plans on my database, where partition pruning (true Oracle Table Partitions), and or use of indexes is not efficient.

Is there any way for me to get around those restriction ... can I customize my query better ... build my own partition logic. Ideally I want to wrap my own custom jdbc code in an Iterator that I can be executed lazily, and does not cause the entire resultset to be loaded in memory (like the JdbcRDD works).

Oh - I prefer to do all this using Java, not Scala.

Joe Halliwell · Accepted Answer · 2015-03-28 12:13:34Z

1

Take a look at the JdbcRDD source code. There's not much to it.

You can get the flexibility you're looking for by writing a custom RDD type based on this code, or even by subclassing it and overriding getPartitions() and compute().

answered Mar 28, 2015 at 12:13

Joe Halliwell

1,1876 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

YoYo Over a year ago

I was already looking into customizing JdbcRDD, but ran initially ground in trying to use Java. However, I gained a little better understanding of Scala (the implementation language for most of Spark and JdbcRDD), and got the basics of converting between the two language types. Now I was able to extend Base RDD into my own supporting both very granular partitioning control, and bind variables! So neat that I feel it should be part of the standard package.

Joe Halliwell Over a year ago

Right :)I suggested looking at JdbcRDD, because it's (relatively) easy to read and customize. It's what I've chosen in the past to get something working quickly. However, the future of JDBC in Spark is presumably the DataFrame-oriented implementation in JDBCRDD. If you're looking to improve the "standard package", I would suggest focusing your efforts there :)

YoYo Over a year ago

Yes thanks. It was actually your answer that made me more persistent in still trying to get it to work. It is surprisingly simple, and it helped me gaining better understanding of how spark works. Also I think you might have in your recent comment answered my next few questions I had in my mind. I was wondering how I can gather schema information and send it back to the driver and/or next tasks effeciently. I think DataFrame friendly JDBCRDD might hold the answer?

Sujee · Accepted Answer · 2015-05-01 06:52:11Z

0

I studied both JdbcRDD and new Spark SQL Data source API. None of them support your requirements.

Most likely this will be your own implementation. I recommend writing new Data sources API instead of sub-classing JdbcRDD which became obsolete in Spark 1.3.

edited May 1, 2015 at 6:52

answered Mar 28, 2015 at 19:01

Sujee

5,1756 gold badges33 silver badges38 bronze badges

Collectives™ on Stack Overflow

Using java Spark DataFrame to access Oracle over jdbc

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related