2

I have two data frame df1 and df2.

df1 have 174 columns and df2 have 175 columns.

How I can find which column is extra ?

2

2 Answers 2

3

Just convert column lists into sets, and use diff operations on these sets, like this:

df2.columns.toSet.diff(df1.columns.toSet)

Please note that the order of comparison matters, like, df1.columns.toSet.diff(df2.columns.toSet) won't produce a required diff. If you want to have diff independent of position, you can use something like this:

df2.columns.toSet.diff(df1.columns.toSet).union(
  df1.columns.toSet.diff(df2.columns.toSet))
Sign up to request clarification or add additional context in comments.

Comments

0

In pyspark , You can use below logic .

dept = [("Finance",10), 
        ("Marketing",20), 
        ("Sales",30), 
        ("IT",40) 
      ]
deptColumns = ["dept_name","dept_id"]

dept1 = [("Finance",10,'999'), 
        ("Marketing",20,'999'), 
        ("Sales",30,'999'), 
        ("IT",40,'999') 
      ]
deptColumns1 = ["dept_name","dept_id","extracol"]

deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
dept1DF = spark.createDataFrame(data=dept1, schema = deptColumns1)
deptDF_columns=deptDF.schema.names
dept1DF_columns=dept1DF.schema.names

list_difference = []
for item in dept1DF_columns:
  if item not in deptDF_columns:
     list_difference.append(item)

print(list_difference)

Tested code :

enter image description here

2 Comments

You haven’t consider the scenario where deptDF_columns has an extra column. list_difference = set(deptDF_columns) ^ set(dept1DF_columns) should give you the difference in the 2 lists.
Thanks for your checking. I have considered that scenario too. Can you please check the line number 13 in screenshot

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.