My main question concerns about performance. Looking at the code below:
query = """
SELECT Name, Id FROM Customers WHERE Id <> 1 ORDER BY Id
"""
df = spark.read.format(jdbc) \
.option("url", "connectionString") \
.option("user", user) \
.option("password", password) \
.option("numPartitions", 10) \
.option("partitionColumn", "Id") \
.option("lowerBound", lowerBound) \
.option("upperBound", upperBound) \
.option("dbtable", query) \
.load()
As far as I understand, this command will be sent to the DB process the query and return the value to spark.
Now considering the code below:
df = spark.read.jdbc(url = mssqlconnection,
table = "dbo.Customers",
properties = mssql_prop
).select(
f.col("Id"),
f.col("Name")
).where("Id = <> 1").orderBy(f.col("Id"))
I know that spark will load the entire table into memory and then execute the filters on the dataframe.
Finally, the last code snippet:
df = spark.read.jdbc(url = mssqlconnection,
table = "dbo.Customers",
properties = mssql_prop
)
final_df = spark_session.sql("""
SELECT Name, Id FROM Customers WHERE Id <> 1 ORDER BY Id
""")
I have 3 questions:
- Among the 3 codes, which one is the most correct. I always use the second approach, is this correct?
- What is the difference between using a spark.sql and using the commands directly in the dataframe according to the second code snipper?
- What is the ideal number of lines for me to start using spark? Is it worth using in queries that return less than 1 million rows?