Create single row dataframe from list of list PySpark

Question

I have a data like this data = [[1.1, 1.2], [1.3, 1.4], [1.5, 1.6]] I want to create a PySpark dataframe

I already use

dataframe = SQLContext.createDataFrame(data, ['features'])

but I always get

+--------+---+
|features| _2|
+--------+---+
|     1.1|1.2|
|     1.3|1.4|
|     1.5|1.6|
+--------+---+

how can I get result like below?

+----------+
|features  |
+----------+
|[1.1, 1.2]|
|[1.3, 1.4]|
|[1.5, 1.6]|
+----------+

You can create a schema and provide while creating a dataframe — koiralo
– koiralo, Commented Feb 12, 2018 at 11:11

pault · Accepted Answer · 2018-02-12 17:04:38Z

2

I find it's useful to think of the argument to createDataFrame() as a list of tuples where each entry in the list corresponds to a row in the DataFrame and each element of the tuple corresponds to a column.

You can get your desired output by making each element in the list a tuple:

data = [([1.1, 1.2],), ([1.3, 1.4],), ([1.5, 1.6],)]
dataframe = sqlCtx.createDataFrame(data, ['features'])
dataframe.show()
#+----------+
#|  features|
#+----------+
#|[1.1, 1.2]|
#|[1.3, 1.4]|
#|[1.5, 1.6]|
#+----------+

Or if changing the source is cumbersome, you can equivalently do:

data = [[1.1, 1.2], [1.3, 1.4], [1.5, 1.6]]
dataframe = sqlCtx.createDataFrame(map(lambda x: (x, ), data), ['features'])
dataframe.show()
#+----------+
#|  features|
#+----------+
#|[1.1, 1.2]|
#|[1.3, 1.4]|
#|[1.5, 1.6]|
#+----------+

edited Feb 12, 2018 at 17:04

answered Feb 12, 2018 at 16:19

pault

43.7k17 gold badges121 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Anahcolus · Accepted Answer · 2018-02-12 11:23:17Z

0

You need a map function to convert the tuples to array and use it in createDataFrame

dataframe = sqlContext.createDataFrame(sc.parallelize(data).map(lambda x: [x]), ['features'])

You should get as you desire

+----------+
|  features|
+----------+
|[1.1, 1.2]|
|[1.3, 1.4]|
|[1.5, 1.6]|
+----------+

answered Feb 12, 2018 at 11:23

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Comments

pratiklodha · Accepted Answer · 2018-02-12 12:04:54Z

0

You should use the Vector Assembler function, from your code I guess you are doing this to train a machine learning model, and vector assembler works the best for that case. You can also add the assembler in the pipeline.

assemble_feature=VectorAssembler(inputCol=data.columns,outputCol='features')
pipeline=Pipeline(stages=[assemble_feature])
pipeline.fit(data).transform(data)

answered Feb 12, 2018 at 12:04

pratiklodha

1,12512 silver badges20 bronze badges

Collectives™ on Stack Overflow

Create single row dataframe from list of list PySpark

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related