PySpark - map with lambda function

Question

I'm facing an issue when mixing python map and lambda functions on a Spark environment.

Given df1, my source dataframe:

Animals     | Food      | Home
----------------------------------
Monkey      | Banana    | Jungle
Dog         | Meat      | Garden
Cat         | Fish      | House
Elephant    | Banana    | Jungle
Lion        | Meat      | Desert

I want to create another dataframe df2. It will contain two columns with a row per column of df1 (3 in my example). The first column would contain the name of df1 columns. The second column would contain an array of elements with the most occurrences (n=3 in the example below) and the count.

Column      | Content
-----------------------------------------------------------
Animals     | [("Cat", 1), ("Dog", 1), ("Elephant", 1)]
Food        | [("Banana", 2), ("Meat", 2), ("Fish", 1)]
Home        | [("Jungle", 2), ("Desert", 1), ("Garden", 1)]

I tried to do it with python list, map and lambda functions but I had conflicts with PySpark functions:

def transform(df1):
    # Number of entry to keep per row
    n = 3
    # Add a column for the count of occurence
    df1 = df1.withColumn("future_occurences", F.lit(1))

    df2 = df1.withColumn("Content",
        F.array(
            F.create_map(
                lambda x: (x,
                    [
                        str(row[x]) for row in df1.groupBy(x).agg(
                            F.sum("future_occurences").alias("occurences")
                        ).orderBy(
                            F.desc("occurences")
                        ).select(x).limit(n).collect()
                    ]
                ), df1.columns
            )
        )
    )
    return df2

The error is:

TypeError: Invalid argument, not a string or column: <function <lambda> at 0x7fc844430410> of type <type 'function'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

Any idea how to fix it?

Thanks a lot!

This can be done, but it's not really the type of problem that spark is designed for. You can treat each column independently and union the results. How do you break ties? Why is it Cat, Dog, Elephant when the other two animals also have a count of 1? — pault
– pault, Commented Jun 24, 2019 at 14:58
@PentaKill I prefer to post my code to illustrate the problem I'm facing. I don't understand why you say it is useless. — Maxbester
– Maxbester, Commented Jun 24, 2019 at 15:08
@pault thanks for your comment. I'm new to spark so I still need to learn. Yes I guess I could treat column independently but I wasn't sure it was the best solution. I break ties with alphabetic order. This is why I didn't display Lion and Monkey. — Maxbester
– Maxbester, Commented Jun 24, 2019 at 15:08

pault · Accepted Answer · 2019-06-25 14:54:04Z

3

Here is one possible solution, in which the Content column will be an array of StructType with two named fields: Content and count.

from pyspark.sql.functions import col, collect_list, desc, lit, struct
from functools import reduce 

def transform(df, n):
    return reduce(
        lambda a, b: a.unionAll(b),
        (
            df.groupBy(c).count()\
                .orderBy(desc("count"), c)\
                .limit(n)\
                .withColumn("Column", lit(c))\
                .groupBy("Column")\
                .agg(
                    collect_list(
                        struct(
                            col(c).cast("string").alias("Content"), 
                            "count")
                    ).alias("Content")
                )
            for c in df.columns
        )
    )

This function will iterate through each of the columns in the input DataFrame, df, and count the occurrence of each value. Then we orderBy the count (descending) and the column value it self (alphabetically) and keep only the first n rows (limit(n)).

Next, collect the values into an array of structs and finally union together the results for each column. Since the union requires each DataFrame to have the same schema, you will need to cast the column value to a string.

n = 3
df1 = transform(df, n)
df1.show(truncate=False)
#+-------+------------------------------------+
#|Column |Content                             |
#+-------+------------------------------------+
#|Animals|[[Cat,1], [Dog,1], [Elephant,1]]    |
#|Food   |[[Banana,2], [Meat,2], [Fish,1]]    |
#|Home   |[[Jungle,2], [Desert,1], [Garden,1]]|
#+-------+------------------------------------+

This isn't exactly the same output that you asked for, but it will probably be sufficient for your needs. (Spark doesn't have tuples in the way you described.) Here's the new schema:

df1.printSchema()
#root
# |-- Column: string (nullable = false)
# |-- Content: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- Content: string (nullable = true)
# |    |    |-- count: long (nullable = false)

edited Jun 25, 2019 at 14:54

answered Jun 24, 2019 at 15:17

pault

43.7k17 gold badges121 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Maxbester Over a year ago

thanks for your solution which seems to fully answer my needs. However it leads to an error

Union can only be performed on tables with the compatible column types. array<struct<Content:boolean,count:bigint>> <> array<struct<Content:string,count:bigint>> at the second column of the 2th table

. I don't understand where the boolean type comes from.

Maxbester Over a year ago

excellent! thanks a lot! Only thing is I've got Animals|[[Content: Cat, count: 1], [Content: Dog, count: 1], [Content; Elephant, count: 1]] Is it possible to remove the title in the struct? Even if I remove the alias there is still a title.

pault Over a year ago

@Maxbester you can change the struct to array (after from pyspark.sql.functions import array) which would leave you with a WrappedArray. I'm not sure why this matters to you - what's the end goal?

Maxbester Over a year ago

ok I will try. Sorry I thought I explained the goal in my initial question. Actually the purpose is to validate a dataset creation. I want to make sure data is relevant in each column. Obviously my dataset is much bigger than the one I gave as an example.

Collectives™ on Stack Overflow

PySpark - map with lambda function

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related