How to drop a column from a spark dataframe by index where column names can be duplicated?

Question

I have a spark dataframe and want to drop only the last column.

I tried

df.drop(df.columns.last)`

but got error AttributeError: 'list' object has no attribute 'last'.

I also tried:

df = df.drop(df.columns[-1])

but this dropped all columns with that has same name as last.

Using Spark 2.4

It is better to drop by name. withColumn can alter the order of the columns — Salim
– Salim, Commented Jan 23, 2020 at 16:46
@Joe I would recommend the following: 1) Save the column names to a list: colnames = df.columns 2) rename the columns so the names are unique: df = df.toDF(*range(colnames)) 3) drop the last column df = df.drop(df.columns[-1]) 4) rename the columns back to the original: df = df.toDF(*cols[:-1]). Ping me if the question is reopened and I will post an answer. — pault
– pault, Commented Jan 23, 2020 at 18:26

pault · Accepted Answer · 2020-01-23 19:40:34Z

Here is an approach you can take to drop any column by index.

Suppose you had the following DataFrame:

np.random.seed(1)
data = np.random.randint(0, 10, size=(3,3))

df = spark.createDataFrame(data.astype(int).tolist(), ["a", "b", "a"])
df.show()
#+---+---+---+
#|  a|  b|  a|
#+---+---+---+
#|  5|  8|  9|
#|  5|  0|  0|
#|  1|  7|  6|
#+---+---+---+

First save the original column names.

colnames = df.columns
print(colnames)
#['a', 'b', 'a']

Then rename all of the columns in the DataFrame using range so the new column names are unique (they will simply be the column index).

df = df.toDF(*map(str, range(len(colnames))))
print(df.columns)
#['0', '1', '2']

Now drop the last column and rename the columns using the saved column names from the first step (excluding the last column).

df = df.drop(df.columns[-1]).toDF(*colnames[:-1])
df.show()
#+---+---+
#|  a|  b|
#+---+---+
#|  5|  8|
#|  5|  0|
#|  1|  7|
#+---+---+

You can easily expand this to any index, since we renamed using range.

I broke it up into steps for explaination purposes, but you can also do this more compactly as follows:

colnames = df.columns
df = df.toDF(*map(str, range(len(colnames))))\
    .drop(str(len(colnames)-1))\
    .toDF(*colnames[:-1])

Salim · Accepted Answer · 2020-01-23 17:07:14Z

4

It is better to drop a column by name. Some operation like withColumn can alter the order of the columns. If a dataframe has duplicate names coming out from a join then refer the column by dataframe.column_name instead of referring it by "columnName" which causes ambiguity.

df3 = df1.join(df2, df1.c1 == df2.c1).drop(df2.c1)

In general df.drop(df.columnName)

edited Jan 23, 2020 at 17:07

answered Jan 23, 2020 at 16:49

Salim

2,18814 silver badges15 bronze badges

4 Comments

Joe Over a year ago

Column name could be ambiguous since there will be 2 columns with name c1 here.

Salim Over a year ago

if I do df2.drop('C1') then its ambiguous. But if I do df2.drop(df2.C1) then it is not. Please try.

Salim Over a year ago

I will need example and spark version to reproduce.

Jomonsugi Over a year ago

If the df has multiple columns with the same name and are thus ambiguous, this doesn't work. This is a circumstantial answer that works after this particular join.

Collectives™ on Stack Overflow

How to drop a column from a spark dataframe by index where column names can be duplicated?

2 Answers 2

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related