4

Can someone tell me how to convert a list containing strings to a Dataframe in pyspark. I am using python 3.6 with spark 2.2.1. I am just started learning spark environment and my data looks like below

my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]

Now, i want to create a Dataframe as follows

---------------------------------
|ID | words                     |
---------------------------------
 1  | ['apple','ball','ballon'] |
 2  | ['cat','camel','james']   |

I even want to add ID column which is not associated in the data

2
  • This question is about two unrelated things: Building a dataframe from a list and adding an ordinal column. Attempting to do both results in a confusing implementation. There are far simpler ways to make a dataframe to a list if we do not insist on the ID, and there are far simpler ways to add the ID after the fact. The question shows up on searches for converting a list to a dataframe and the answers are not suitable outside the specific case of this question. Commented Feb 11, 2023 at 1:05
  • Also, the question title is incorrect. What's actually being asked is how to create an enumeration of a list in Spark, similar to Python's enumerate. Commented Feb 11, 2023 at 1:06

4 Answers 4

8

You can convert the list to a list of Row objects, then use spark.createDataFrame which will infer the schema from your data:

from pyspark.sql import Row
R = Row('ID', 'words')

# use enumerate to add the ID column
spark.createDataFrame([R(i, x) for i, x in enumerate(my_data)]).show() 
+---+--------------------+
| ID|               words|
+---+--------------------+
|  0|[apple, ball, bal...|
|  1| [cat, camel, james]|
|  2| [none, focus, cake]|
+---+--------------------+
Sign up to request clarification or add additional context in comments.

3 Comments

Thnq for your reply.. but i am getting following error when i perform the code Py4JJavaError: An error occurred while calling o40.describe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 3, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "pyspark/worker.py", line 123, in main ("%d.%d" % sys.version_info[:2], version))
Try restart pyspark shell. The error doesn't seem to be related to the code.
Isn't Awesome. Exactly what I was searching for
0

Try this -

data_array = []
for i in range (0,len(my_data)) :
    data_array.extend([(i, my_data[i])])

df = spark.createDataframe(data = data_array, schema = ["ID", "words"])

df.show()

Comments

0

Try this -- the simplest approach

  from pyspark.sql import *
  x = Row(utc_timestamp=utc, routine='routine name', message='your message')
  data = [x]
  df = sqlContext.createDataFrame(data) 

Comments

0

Simple Approach:

my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]

spark.sparkContext.parallelize(my_data).zipWithIndex() \
toDF(["id", "words"]).show(truncate=False)

+---------------------+-----+
|id                   |words|
+---------------------+-----+
|[apple, ball, ballon]|0    |
|[cat, camel, james]  |1    |
|[none, focus, cake]  |2    |
+---------------------+-----+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.