Comparing two data frame with different number of columns in scala

Question

I have two data frame df1 and df2.

df1 have 174 columns and df2 have 175 columns.

How I can find which column is extra ?

Thanks for solution but in my case number of columns in both data frame is different — Yogesh
– Yogesh, Commented Dec 28, 2021 at 17:07

Alex Ott · Accepted Answer · 2021-12-29 09:34:40Z

3

Just convert column lists into sets, and use diff operations on these sets, like this:

df2.columns.toSet.diff(df1.columns.toSet)

Please note that the order of comparison matters, like, df1.columns.toSet.diff(df2.columns.toSet) won't produce a required diff. If you want to have diff independent of position, you can use something like this:

df2.columns.toSet.diff(df1.columns.toSet).union(
  df1.columns.toSet.diff(df2.columns.toSet))

edited Dec 29, 2021 at 9:34

answered Dec 28, 2021 at 17:36

Alex Ott

88.1k10 gold badges110 silver badges157 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Karthikeyan Rasipalay Durairaj · Accepted Answer · 2021-12-28 17:48:21Z

0

In pyspark , You can use below logic .

dept = [("Finance",10), 
        ("Marketing",20), 
        ("Sales",30), 
        ("IT",40) 
      ]
deptColumns = ["dept_name","dept_id"]

dept1 = [("Finance",10,'999'), 
        ("Marketing",20,'999'), 
        ("Sales",30,'999'), 
        ("IT",40,'999') 
      ]
deptColumns1 = ["dept_name","dept_id","extracol"]

deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
dept1DF = spark.createDataFrame(data=dept1, schema = deptColumns1)
deptDF_columns=deptDF.schema.names
dept1DF_columns=dept1DF.schema.names

list_difference = []
for item in dept1DF_columns:
  if item not in deptDF_columns:
     list_difference.append(item)

print(list_difference)

Tested code :

answered Dec 28, 2021 at 17:48

Karthikeyan Rasipalay Durairaj

2,33922 silver badges42 bronze badges

2 Comments

Syed Shahzer Over a year ago

You haven’t consider the scenario where deptDF_columns has an extra column. list_difference = set(deptDF_columns) ^ set(dept1DF_columns) should give you the difference in the 2 lists.

Karthikeyan Rasipalay Durairaj Over a year ago

Thanks for your checking. I have considered that scenario too. Can you please check the line number 13 in screenshot

Collectives™ on Stack Overflow

Comparing two data frame with different number of columns in scala

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related