2

I have a dataframe like this (inspired by this question with slightly different setup):

df3 = hive_context.createDataFrame([
    Row(a=107831, f=3),
    Row(a=107531, f=2),
    Row(a=125231, f=2)
])

Based on this, i create two new objects. Each is a subset of the original dataframe:

from pyspark.sql.functions import col

df1 = (df3
  .filter(((col('a') == 107831) & (col('f') == 3))|
          ((col('a') == 125231) & (col('f') == 2))))

df2 = (df3
  .filter(((col('a') == 107831) & (col('f') == 3))|
          ((col('a') == 107531) & (col('f') == 2))))

Then, I would like to join those two datasets and obtain the f columns from each of them as in:

a = (df1
  .join(df2, (df1['a'] == df2['a']), how = 'full')
  .select(df1['f'], df2['f']).collect())

However, I get [Row(f=None, f=None), Row(f=2, f=2), Row(f=3, f=3)]

Instead of the desired [Row(f=3, f=4), Row(f=None, f=2), Row(f=2, f=None)] or expressed as a table:

+------+----+------+----+
|     a|   f|     a|   f|
+------+----+------+----+
|107831|   3|107831|   4|
|  null|null|107531|   2|
|125231|   2|  null|null|
+------+----+------+----+

Does anyone know how to resolve this? Do i have to store df1 and df2 somewhere?

When I run the scenario as in the above linked question, i get the expected results:

df1 = hive_context.createDataFrame([
    Row(a=107831, f=3),
    Row(a=125231, f=2),
])

df2 = hive_context.createDataFrame([
    Row(a=107831, f=4),
    Row(a=107531, f=2),
])

a = df1.join(df2, (df1['a'] == df2['a']), how = 'full').select(df1['f'], df2['f']).collect()
a

I run it on python 3.6 and spark 2.3

4
  • the join conditions are different Commented Sep 4, 2018 at 17:26
  • @AnkitKumarNamdeo That was a typo, thank you for catching it. However, it has no impact on the described behavior. PySpark still always calls the column from dataframe on left side of the join Commented Sep 4, 2018 at 17:38
  • they both should produce the same output can you please run them again, i think the joining condition was the only problem Commented Sep 4, 2018 at 17:52
  • It is not. I tried it on actual dataset, i tried it on sample dataset. It just doesn't work. Commented Sep 5, 2018 at 7:50

1 Answer 1

6

In cases where there are duplicated column names, use aliases on your DataFrames to avoid the ambiguity:

a = df1.alias('l').join(df2.alias('r'), on='a', how = 'full').select('l.f', 'r.f').collect()
print(a)
#[Row(f=3, f=3), Row(f=None, f=2), Row(f=2, f=None)]
Sign up to request clarification or add additional context in comments.

1 Comment

I was using another approach: df1.join(df2, (df1['a'] == df2['a']), how = 'full').toDF('a_1', 'f_1', 'a_2', 'f_2').select('f_1', 'f_2').collect() However, i like yours better. Thank you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.