0

Suppose I have something like

df1 = sqlContext.sql("select count(1) as ct1 from big_table_1")
df2 = sqlContext.sql("select count(1) as ct2 from big_table_2")
df1.show()
df2.show()

Within each table (either Hive or temporary), the rows will be counted in parallel across the worker nodes, assuming the underlying dataframe is partitioned.

Is there also a way I can get the two tables to count in parallel? Is this even possible in PySpark?

2
  • Why don’t you do union all if it’s just a count. Commented Jun 21, 2018 at 20:17
  • It might not be. I'm just giving that as a minimal example Commented Jun 21, 2018 at 20:47

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.