2

I have a situation where I'm trying to query a table and use the result (dataframe) from that query as IN clause of another query.

From the first query I have the dataframe below:

+-----------------+
|key              |
+-----------------+
|   10000000000004|
|   10000000000003|
|   10000000000008|
|   10000000000009|
|   10000000000007|
|   10000000000006|
|   10000000000010|
|   10000000000002|
+-----------------+ 

And now I want to run a query like the one below using the values of that dataframe dynamically instead of hard coding the values:

spark.sql("""select country from table1 where key in (10000000000004, 10000000000003, 10000000000008, 10000000000009, 10000000000007, 10000000000006, 10000000000010, 10000000000002)""").show()

I tried the following, however it didn't work:

df = spark.sql("""select key from table0 """)
a = df.select("key").collect()
spark.sql("""select country from table1 where key in ({0})""".format(a)).show()

Can somebody help me?

1
  • This should be done using join. Commented Jan 8, 2020 at 15:51

1 Answer 1

2

You should use an (inner) join between two data frames to get the countries you would like. See my example:

# Create a list of countries with Id's
countries = [('Netherlands', 1), ('France', 2), ('Germany', 3), ('Belgium', 4)]

# Create a list of Ids
numbers = [(1,), (2,)]  

# Create two data frames
df_countries = spark.createDataFrame(countries, ['CountryName', 'Id'])
df_numbers = spark.createDataFrame(numbers, ['Id'])

The data frames look like the following:

df_countries:

+-----------+---+
|CountryName| Id| 
+-----------+---+
|Netherlands|  1|
|     France|  2|
|    Germany|  3|
|    Belgium|  4|
+-----------+---+

df_numbers:
+---+
| Id|
+---+
|  1|
|  2|
+---+

You can join them as follows:

countries.join(numbers, on='Id', how='inner')

Resulting in:

+---+-----------+
| Id|CountryName|
+---+-----------+
|  1|Netherlands|
|  2|     France|
+---+-----------+

Hope that clears things up!

Sign up to request clarification or add additional context in comments.

1 Comment

Join is the correct way. But join is slower than direct query. For 1 or 2 values in the where condition we can better extract them and use them in the query instead of join a big dataset

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.