PySpark dataframe: working with duplicated column names after self join

Question

I have a dataframe like this (inspired by this question with slightly different setup):

df3 = hive_context.createDataFrame([
    Row(a=107831, f=3),
    Row(a=107531, f=2),
    Row(a=125231, f=2)
])

Based on this, i create two new objects. Each is a subset of the original dataframe:

from pyspark.sql.functions import col

df1 = (df3
  .filter(((col('a') == 107831) & (col('f') == 3))|
          ((col('a') == 125231) & (col('f') == 2))))

df2 = (df3
  .filter(((col('a') == 107831) & (col('f') == 3))|
          ((col('a') == 107531) & (col('f') == 2))))

Then, I would like to join those two datasets and obtain the f columns from each of them as in:

a = (df1
  .join(df2, (df1['a'] == df2['a']), how = 'full')
  .select(df1['f'], df2['f']).collect())

However, I get [Row(f=None, f=None), Row(f=2, f=2), Row(f=3, f=3)]

Instead of the desired [Row(f=3, f=4), Row(f=None, f=2), Row(f=2, f=None)] or expressed as a table:

+------+----+------+----+
|     a|   f|     a|   f|
+------+----+------+----+
|107831|   3|107831|   4|
|  null|null|107531|   2|
|125231|   2|  null|null|
+------+----+------+----+

Does anyone know how to resolve this? Do i have to store df1 and df2 somewhere?

When I run the scenario as in the above linked question, i get the expected results:

df1 = hive_context.createDataFrame([
    Row(a=107831, f=3),
    Row(a=125231, f=2),
])

df2 = hive_context.createDataFrame([
    Row(a=107831, f=4),
    Row(a=107531, f=2),
])

a = df1.join(df2, (df1['a'] == df2['a']), how = 'full').select(df1['f'], df2['f']).collect()
a

I run it on python 3.6 and spark 2.3

@AnkitKumarNamdeo That was a typo, thank you for catching it. However, it has no impact on the described behavior. PySpark still always calls the column from dataframe on left side of the join — ira
– ira, Commented Sep 4, 2018 at 17:38
they both should produce the same output can you please run them again, i think the joining condition was the only problem — Ankit Kumar Namdeo
– Ankit Kumar Namdeo, Commented Sep 4, 2018 at 17:52
It is not. I tried it on actual dataset, i tried it on sample dataset. It just doesn't work. — ira
– ira, Commented Sep 5, 2018 at 7:50

pault · Accepted Answer · 2018-09-05 15:39:19Z

6

In cases where there are duplicated column names, use aliases on your DataFrames to avoid the ambiguity:

a = df1.alias('l').join(df2.alias('r'), on='a', how = 'full').select('l.f', 'r.f').collect()
print(a)
#[Row(f=3, f=3), Row(f=None, f=2), Row(f=2, f=None)]

answered Sep 5, 2018 at 15:39

pault

43.7k17 gold badges121 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

ira Over a year ago

I was using another approach: df1.join(df2, (df1['a'] == df2['a']), how = 'full').toDF('a_1', 'f_1', 'a_2', 'f_2').select('f_1', 'f_2').collect() However, i like yours better. Thank you!

Collectives™ on Stack Overflow

PySpark dataframe: working with duplicated column names after self join

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related