0

I have a spark dataframe and want to drop only the last column.

I tried

df.drop(df.columns.last)` 

but got error AttributeError: 'list' object has no attribute 'last'.

I also tried:

df = df.drop(df.columns[-1])

but this dropped all columns with that has same name as last.

Using Spark 2.4

3
  • It is better to drop by name. withColumn can alter the order of the columns Commented Jan 23, 2020 at 16:46
  • @Joe I would recommend the following: 1) Save the column names to a list: colnames = df.columns 2) rename the columns so the names are unique: df = df.toDF(*range(colnames)) 3) drop the last column df = df.drop(df.columns[-1]) 4) rename the columns back to the original: df = df.toDF(*cols[:-1]). Ping me if the question is reopened and I will post an answer. Commented Jan 23, 2020 at 18:26
  • 1
    @pault I reopened the question Commented Jan 23, 2020 at 19:32

2 Answers 2

6

Here is an approach you can take to drop any column by index.

Suppose you had the following DataFrame:

np.random.seed(1)
data = np.random.randint(0, 10, size=(3,3))

df = spark.createDataFrame(data.astype(int).tolist(), ["a", "b", "a"])
df.show()
#+---+---+---+
#|  a|  b|  a|
#+---+---+---+
#|  5|  8|  9|
#|  5|  0|  0|
#|  1|  7|  6|
#+---+---+---+

First save the original column names.

colnames = df.columns
print(colnames)
#['a', 'b', 'a']

Then rename all of the columns in the DataFrame using range so the new column names are unique (they will simply be the column index).

df = df.toDF(*map(str, range(len(colnames))))
print(df.columns)
#['0', '1', '2']

Now drop the last column and rename the columns using the saved column names from the first step (excluding the last column).

df = df.drop(df.columns[-1]).toDF(*colnames[:-1])
df.show()
#+---+---+
#|  a|  b|
#+---+---+
#|  5|  8|
#|  5|  0|
#|  1|  7|
#+---+---+

You can easily expand this to any index, since we renamed using range.


I broke it up into steps for explaination purposes, but you can also do this more compactly as follows:

colnames = df.columns
df = df.toDF(*map(str, range(len(colnames))))\
    .drop(str(len(colnames)-1))\
    .toDF(*colnames[:-1])
Sign up to request clarification or add additional context in comments.

Comments

4

It is better to drop a column by name. Some operation like withColumn can alter the order of the columns. If a dataframe has duplicate names coming out from a join then refer the column by dataframe.column_name instead of referring it by "columnName" which causes ambiguity.

df3 = df1.join(df2, df1.c1 == df2.c1).drop(df2.c1)

In general df.drop(df.columnName)

4 Comments

Column name could be ambiguous since there will be 2 columns with name c1 here.
if I do df2.drop('C1') then its ambiguous. But if I do df2.drop(df2.C1) then it is not. Please try.
I will need example and spark version to reproduce.
If the df has multiple columns with the same name and are thus ambiguous, this doesn't work. This is a circumstantial answer that works after this particular join.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.