Convert two columns in pyspark dataframe into one python dictionary

Question

I have a pyspark dataframe in which I want to use two of its columns to output a dictionary.

input pyspark dataframe:

col1|col2|col3
v   |  3 | a
d   |  2 | b
q   |  9 | g

output:

dict = {'v': 3, 'd': 2, 'q': 9}

how should I do this efficiently?

IWHKYB · Accepted Answer · 2019-04-17 01:55:01Z

4

I believe you can achieve it by converting the DF (with only the two columns you want) to rdd:

data_rdd = data.selet(['col1', 'col2']).rdd

create a rdd containing key, pair with both columns using rdd.map function:

kp_rdd = data_rdd.map(lambda row : (row[0],row[1]))

and then collect as map:

dict = kp_rdd.collectAsMap()

that's the main idea, sorry I don't have an instance of pyspark running right now to test it.

answered Apr 17, 2019 at 1:55

IWHKYB

4914 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jomonsugi · Accepted Answer · 2021-03-09 17:08:34Z

2

Given your example, after selecting the applicable columns and converting to an rdd, collectAsMap will accomplish the desired dictionary without any additional steps:

df.select('col1', 'col2').rdd.collectAsMap()

edited Mar 9, 2021 at 17:08

answered Sep 8, 2020 at 23:06

Jomonsugi

1,2991 gold badge13 silver badges21 bronze badges

Comments

thePurplePython · Accepted Answer · 2019-04-17 02:22:44Z

few different options here depending on the format needed ... check this out ... am using structured api ... if you need to persist then either save as json dict or preserve schema with parquet

from pyspark.sql.functions import to_json
from pyspark.sql.functions import create_map
from pyspark.sql.functions import col

df = spark\
.createDataFrame([\
    ('v', 3, 'a'),\
    ('d', 2, 'b'),\
    ('q', 9, 'g')],\
    ["c1", "c2", "c3"])

mapDF = df.select(create_map(col("c1"), col("c2")).alias("mapper"))
mapDF.show(3)

+--------+
|  mapper|
+--------+
|[v -> 3]|
|[d -> 2]|
|[q -> 9]|
+--------+

dictDF = df.select(to_json(create_map(col("c1"), col("c2")).alias("mapper")).alias("dict"))
dictDF.show()

+-------+
|   dict|
+-------+
|{"v":3}|
|{"d":2}|
|{"q":9}|
+-------+

keyValueDF = df.selectExpr("(c1, c2) as keyValueDict").select(to_json(col("keyValueDict")).alias("keyValueDict"))
keyValueDF.show()

+-----------------+
|     keyValueDict|
+-----------------+
|{"c1":"v","c2":3}|
|{"c1":"d","c2":2}|
|{"c1":"q","c2":9}|
+-----------------+

Collectives™ on Stack Overflow

Convert two columns in pyspark dataframe into one python dictionary

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related