33

Spark version : 2.1

For example, in pyspark, i create a list

test_list = [['Hello', 'world'], ['I', 'am', 'fine']]

then how to create a dataframe form the test_list, where the dataframe's type is like below:

DataFrame[words: array<string>]

5 Answers 5

39

here is how -

from pyspark.sql.types import *

cSchema = StructType([StructField("WordList", ArrayType(StringType()))])

# notice extra square brackets around each element of list 
test_list = [['Hello', 'world']], [['I', 'am', 'fine']]

df = spark.createDataFrame(test_list,schema=cSchema) 
Sign up to request clarification or add additional context in comments.

2 Comments

For anyone who just wants to convert a list of strings and is impressed by the ridiculous lack of proper documentation: you cannot convert 1d objects, you have to transform it into a list of tuples like: [(t,) for t in list_of_strings]
Is there a reason why from ... import *, almost universally considered an antipattern in Python, is advisable here?
25

i had to work with multiple columns and types - the example below has one string column and one integer column. A slight adjustment to Pushkr's code (above) gives:

from pyspark.sql.types import *

cSchema = StructType([StructField("Words", StringType())\
                      ,StructField("total", IntegerType())])

test_list = [['Hello', 1], ['I am fine', 3]]

df = spark.createDataFrame(test_list,schema=cSchema) 

output:

 df.show()
 +---------+-----+
|    Words|total|
+---------+-----+
|    Hello|    1|
|I am fine|    3|
+---------+-----+

1 Comment

Same question I asked on another answer: Is there a reason why from ... import *, almost universally considered an antipattern in Python, is advisable here?
12

You should use list of Row objects([Row]) to create data frame.

from pyspark.sql import Row

spark.createDataFrame(list(map(lambda x: Row(words=x), test_list)))

1 Comment

Should be spark.createDataFrame
0

If Columns are in different list then use below code as per requirement

l1 =[1,2,3,4]
l2 =['a','b','c','d']
l3=[]
if len(l1)==len(l2):
    for i in range(len(l1)):
        l3.append((l1[i],l2[i]))
print('List:',l3)
column=['Id','Name']
df = spark.createDataFrame(l3,column)
display(df)
**Output:**
List:[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
Id  Name
1   a
2   b
3   c
4   d

Comments

-3
   You can create a RDD first from the input and then convert to dataframe from the constructed RDD
   <code>  
     import sqlContext.implicits._
       val testList = Array(Array("Hello", "world"), Array("I", "am", "fine"))
       // CREATE RDD
       val testListRDD = sc.parallelize(testList)
     val flatTestListRDD = testListRDD.flatMap(entry => entry)
     // COnvert RDD to DF 
     val testListDF = flatTestListRDD.toDF
     testListDF.show
    </code> 

1 Comment

This appears to be Scala code and not Python, for anyone wondering why this is downvoted. The question is explicitly tagged pyspark.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.