2

Is there something like an eval function equivalent in PySpark.

I am trying to convert Python code into PySpark

I am Querying a Dataframe and one of the Column has the Data as shown below but in String Format.

[{u'date': u'2015-02-08', u'by': u'[email protected]', u'value': u'NA'}, {u'date': u'2016-02-08', u'by': u'[email protected]', u'value': u'applicable'}, {u'date': u'2017-02-08', u'by': u'[email protected]', u'value': u'ufc'}]

Assume that 'x' is the column which holds this value in the Dataframe.

Now i want to pass in that String column 'x' and get the List so that i can pass it to mapPartition function.

I want to avoid iterating to each row on my Driver that's the reason i am thinking this way.

In Python using eval() function if used: I get below output:

x = "[{u'date': u'2015-02-08', u'by': u'[email protected]', u'value': u'NA'}, {u'date': u'2016-02-08', u'by': u'[email protected]', u'value': u'applicable'}, {u'date': u'2017-02-08', u'by': u'[email protected]', u'value': u'ufc'}]"

list = eval(x)

for i in list:  print i

Output: (This is what i want in PySpark as well)

{u'date': u'2015-02-08', u'by': u'[email protected]', u'value': u'NA'}
{u'date': u'2016-02-08', u'by': u'[email protected]', u'value': u'applicable'}
{u'date': u'2017-02-08', u'by': u'[email protected]', u'value': u'ufc'}

How to do this in PySpark ??

3
  • IDK in what sense that looked similar to my question. Its no where near Commented Mar 10, 2018 at 2:30
  • Sorry, I may have misunderstood your question then, I thought you want to convert row values to columns, which it looks like from your example. Commented Mar 10, 2018 at 2:31
  • Nope. But thanks for trying Commented Mar 10, 2018 at 2:32

1 Answer 1

2

You can benefit by using from_json function to convert your json string to actual json. For that you will have to define a schema matching to your json string. And finally use explode function to separate the struct array to different rows as you did with eval.

If you have a data as

x = "[{u'date': u'2015-02-08', u'by': u'[email protected]', u'value': u'NA'}, {u'date': u'2016-02-08', u'by': u'[email protected]', u'value': u'applicable'}, {u'date': u'2017-02-08', u'by': u'[email protected]', u'value': u'ufc'}]"

Then dataframe is created

df = sqlContext.createDataFrame([(x,),], ["x"])

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|x                                                                                                                                                                                                              |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{u'date': u'2015-02-08', u'by': u'[email protected]', u'value': u'NA'}, {u'date': u'2016-02-08', u'by': u'[email protected]', u'value': u'applicable'}, {u'date': u'2017-02-08', u'by': u'[email protected]', u'value': u'ufc'}]|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+


root
 |-- x: string (nullable = true)

Using jsons

As I had explained, you would need a schema, regexp_replace function, from_json function and explode function as

from pyspark.sql import types as T
schema = T.ArrayType(T.StructType([T.StructField('date', T.StringType()), T.StructField('by', T.StringType()), T.StructField('value', T.StringType())]))

from pyspark.sql import functions as F
df = df.withColumn("x", F.explode(F.from_json(F.regexp_replace(df['x'], "(u')", "'"), schema=schema)))

which should give you

+-----------------------------------+
|x                                  |
+-----------------------------------+
|[2015-02-08,[email protected],NA]         |
|[2016-02-08,[email protected],applicable]|
|[2017-02-08,[email protected],ufc]      |
+-----------------------------------+

root
 |-- x: struct (nullable = true)
 |    |-- date: string (nullable = true)
 |    |-- by: string (nullable = true)
 |    |-- value: string (nullable = true)

If you require the json strings as mentioned in the question then you can use to_json function as

df = df.withColumn("x", F.to_json(df['x']))

which will give you

+-------------------------------------------------------------+
|x                                                            |
+-------------------------------------------------------------+
|{"date":"2015-02-08","by":"[email protected]","value":"NA"}         |
|{"date":"2016-02-08","by":"[email protected]","value":"applicable"}|
|{"date":"2017-02-08","by":"[email protected]","value":"ufc"}      |
+-------------------------------------------------------------+

Using strings only

If you don't want to go through all the complexities of jsons then you can simply work with strings. For that you would need nested regex_replace, split and explode functions as

from pyspark.sql import functions as F
df = df.withColumn("x", F.explode(F.split(F.regexp_replace(F.regexp_replace(F.regexp_replace(df['x'], "(u')", "'"), "[\\[\\]\s]", ""), "},\\{", "};&;{"), ";&;")))

which should give you

+-------------------------------------------------------------+
|x                                                            |
+-------------------------------------------------------------+
|{'date':'2015-02-08','by':'[email protected]','value':'NA'}         |
|{'date':'2016-02-08','by':'[email protected]','value':'applicable'}|
|{'date':'2017-02-08','by':'[email protected]','value':'ufc'}      |
+-------------------------------------------------------------+
Sign up to request clarification or add additional context in comments.

4 Comments

And that u' that u removed that is actually unicode. Which i have remove. How can we remove the unicode and read simply as string ..?
do you have it saved in a text file or something else? please share how the unicode is present in the data
It is stored in a table as String . On which i am querying and building Dataframe
Can u please explain this line -> sqlContext.createDataFrame([(x,),], ["x"]). What and how exactly the code inside the parenthesis is working ..?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.