List to DataFrame in pyspark

Question

Can someone tell me how to convert a list containing strings to a Dataframe in pyspark. I am using python 3.6 with spark 2.2.1. I am just started learning spark environment and my data looks like below

my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]

Now, i want to create a Dataframe as follows

---------------------------------
|ID | words                     |
---------------------------------
 1  | ['apple','ball','ballon'] |
 2  | ['cat','camel','james']   |

I even want to add ID column which is not associated in the data

This question is about two unrelated things: Building a dataframe from a list and adding an ordinal column. Attempting to do both results in a confusing implementation. There are far simpler ways to make a dataframe to a list if we do not insist on the ID, and there are far simpler ways to add the ID after the fact. The question shows up on searches for converting a list to a dataframe and the answers are not suitable outside the specific case of this question. — Dommondke
– Dommondke, Commented Feb 11, 2023 at 1:05
Also, the question title is incorrect. What's actually being asked is how to create an enumeration of a list in Spark, similar to Python's enumerate. — Dommondke
– Dommondke, Commented Feb 11, 2023 at 1:06

akuiper · Accepted Answer · 2018-01-16 22:33:51Z

8

You can convert the list to a list of Row objects, then use spark.createDataFrame which will infer the schema from your data:

from pyspark.sql import Row
R = Row('ID', 'words')

# use enumerate to add the ID column
spark.createDataFrame([R(i, x) for i, x in enumerate(my_data)]).show() 
+---+--------------------+
| ID|               words|
+---+--------------------+
|  0|[apple, ball, bal...|
|  1| [cat, camel, james]|
|  2| [none, focus, cake]|
+---+--------------------+

answered Jan 16, 2018 at 22:33

akuiper

216k33 gold badges362 silver badges379 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user9226665 Over a year ago

Thnq for your reply.. but i am getting following error when i perform the code Py4JJavaError: An error occurred while calling o40.describe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 3, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "pyspark/worker.py", line 123, in main ("%d.%d" % sys.version_info[:2], version))

akuiper Over a year ago

Try restart pyspark shell. The error doesn't seem to be related to the code.

Bala Over a year ago

Isn't Awesome. Exactly what I was searching for

abhjt · Accepted Answer · 2018-01-24 09:09:10Z

0

Try this -

data_array = []
for i in range (0,len(my_data)) :
    data_array.extend([(i, my_data[i])])

df = spark.createDataframe(data = data_array, schema = ["ID", "words"])

df.show()

answered Jan 24, 2018 at 9:09

abhjt

4425 gold badges11 silver badges26 bronze badges

Comments

Jie · Accepted Answer · 2018-05-17 19:33:23Z

0

Try this -- the simplest approach

  from pyspark.sql import *
  x = Row(utc_timestamp=utc, routine='routine name', message='your message')
  data = [x]
  df = sqlContext.createDataFrame(data)

answered May 17, 2018 at 19:33

Jie

1,2941 gold badge17 silver badges23 bronze badges

Comments

Mohana B C · Accepted Answer · 2020-12-01 15:06:07Z

0

Simple Approach:

my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]

spark.sparkContext.parallelize(my_data).zipWithIndex() \
toDF(["id", "words"]).show(truncate=False)

+---------------------+-----+
|id                   |words|
+---------------------+-----+
|[apple, ball, ballon]|0    |
|[cat, camel, james]  |1    |
|[none, focus, cake]  |2    |
+---------------------+-----+

answered Dec 1, 2020 at 15:06

Mohana B C

5,4721 gold badge13 silver badges31 bronze badges

Collectives™ on Stack Overflow

List to DataFrame in pyspark

4 Answers 4

3 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related