0

I have a dataframe with thousands of columns that I would like to pass to greatest function without specifying column names individually. How can I do that?

As an example, I have df with 3 columns, that I am passing to greatest, each by specifying df.x, df.y.. and so on.

df = sqlContext.createDataFrame([(1, 4, 3)], ['x', 'y', 'z'])
>>> df.select(greatest(df.x,df.y,df.z).alias('greatest')).show()
+--------+
|greatest|
+--------+
|       4|
+--------+

In the above example I had only 3 columns, but if it were in thousands, it is impossible to mention each one of them. Couple of things I tried didn't work. I am missing some crucial python...

df.select(greatest(",".join(df.columns)).alias('greatest')).show()
ValueError: greatest should take at least two columns

df.select(greatest(",".join(df.columns),df[0]).alias('greatest')).show()
u"cannot resolve 'x,y,z' given input columns: [x, y, z];"

df.select(greatest([c for c in df.columns],df[0]).alias('greatest')).show()
Method col([class java.util.ArrayList]) does not exist
4
  • use pandas. You can use apply() with pandas to get max value from each row (if that is what you looking for) Commented Feb 8, 2018 at 11:06
  • 1
    Haven't tried this one but it would make sense: df.select(greatest(*[col(c) for c in df.columns]).alias('greatest')).show() Commented Feb 8, 2018 at 11:07
  • @mkaran - It works. But what does the * mean here? Commented Feb 8, 2018 at 11:13
  • 1
    The * unpacks the list so that the greatest is called with positional arguments instead of a list. Commented Feb 8, 2018 at 11:21

1 Answer 1

1

greatest supports positional arguments*

pyspark.sql.functions.greatest(*cols)

(this is why you can greatest(df.x,df.y,df.z)) so just

df = sqlContext.createDataFrame([(1, 4, 3)], ['x', 'y', 'z'])
df.select(greatest(*df.columns))

* Quoting Python glossary, positional argument is

  • ... an argument that is not a keyword argument. Positional arguments can appear at the beginning of an argument list and/or be passed as elements of an iterable preceded by *. For example, 3 and 5 are both positional arguments in the following calls:

    complex(3, 5)
    complex(*(3, 5))
    

Furthermore:

Sign up to request clarification or add additional context in comments.

1 Comment

*cols or *df.columns - Does it return a list or comma separated columns as expected by greatest? I am always getting confused with it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.