2

I find the existing Spark implementations for accessing a traditional Database very restricting and limited. Particularly:

  1. Use of Bind Variables is not possible.
  2. Passing the partitioning parameters to your generated SQL is very restricted.

Most bothersome is that I am not able to customize my query in how partitioning takes place, all it allows is to identify a partitioning column, and upper / lower boundaries, but only allowed is a numeric column and values. I understand I can provide the query to my database like you do a subquery, and map my partitioning column to a numeric value, but that will cause very inefficient execution plans on my database, where partition pruning (true Oracle Table Partitions), and or use of indexes is not efficient.

Is there any way for me to get around those restriction ... can I customize my query better ... build my own partition logic. Ideally I want to wrap my own custom jdbc code in an Iterator that I can be executed lazily, and does not cause the entire resultset to be loaded in memory (like the JdbcRDD works).

Oh - I prefer to do all this using Java, not Scala.

2 Answers 2

1

Take a look at the JdbcRDD source code. There's not much to it.

You can get the flexibility you're looking for by writing a custom RDD type based on this code, or even by subclassing it and overriding getPartitions() and compute().

Sign up to request clarification or add additional context in comments.

3 Comments

I was already looking into customizing JdbcRDD, but ran initially ground in trying to use Java. However, I gained a little better understanding of Scala (the implementation language for most of Spark and JdbcRDD), and got the basics of converting between the two language types. Now I was able to extend Base RDD into my own supporting both very granular partitioning control, and bind variables! So neat that I feel it should be part of the standard package.
Right :)I suggested looking at JdbcRDD, because it's (relatively) easy to read and customize. It's what I've chosen in the past to get something working quickly. However, the future of JDBC in Spark is presumably the DataFrame-oriented implementation in JDBCRDD. If you're looking to improve the "standard package", I would suggest focusing your efforts there :)
Yes thanks. It was actually your answer that made me more persistent in still trying to get it to work. It is surprisingly simple, and it helped me gaining better understanding of how spark works. Also I think you might have in your recent comment answered my next few questions I had in my mind. I was wondering how I can gather schema information and send it back to the driver and/or next tasks effeciently. I think DataFrame friendly JDBCRDD might hold the answer?
0

I studied both JdbcRDD and new Spark SQL Data source API. None of them support your requirements.

Most likely this will be your own implementation. I recommend writing new Data sources API instead of sub-classing JdbcRDD which became obsolete in Spark 1.3.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.